kbase / project_guides

This repo contains documents and guides that describe project principles, how-to docs, etc.
MIT License
7 stars 33 forks source link

initial thinking of ER object diagram, spec file to follow shortly #46

Closed jkbaumohl closed 9 years ago

jkbaumohl commented 9 years ago

This is the thinking from Matt and I on the Genome type. We will submit the spec tomorrow.

dangunter commented 9 years ago

.bmp as in bitmap? Can you convert to PNG?

jkbaumohl commented 9 years ago

Hi Dan,

I can do a png tomorrow. However note you can click all the way through to it and see it. Click the view raw link

-Jason

On Tue, Jun 2, 2015 at 4:15 PM, Dan Gunter notifications@github.com wrote:

.bmp as in bitmap? Can you convert to PNG?

— Reply to this email directly or view it on GitHub https://github.com/kbase/project_guides/pull/46#issuecomment-108126204.

samseaver commented 9 years ago

OK, so I'm reading through this. I've got some initial questions about this:

1) reference_assembly/genome_ref: I take it that we can treat these in a boolean like manner, i.e. if a taxon does not have such a reference, then the organism on a whole does not have a reference genome available (at least not in KBase) and if the genome does not have such a reference, then it is in turn not a reference genome. Am I thinking about this right?

2) I take it that we are no longer going to be holding contig sequences in the WS object, which is a good thing with respect to plants. If this was an oversight, can you please ensure that the contig sequence is optional.

3) Why is the genome_annotations mapping in the Assembly structure and not the Genome structure? Whilst I do understand the relationship between Genome and Assembly, it appears that many of the keys in the Assembly structure would be better off being directly available in the Genome structure. In order to retrieve the GenomeAnnotation for a Genome, currently, one would have to iterate through the set of Assemblies, and then collect the GenomeAnnotation objects from each Assembly.

4) I like the feature grouping concept, particularly along with the use of the GenBank feature keys, and I believe it can be taken a step further by also defining within the feature_grouping structure the relationship between component_types, so a field could be added like:

mapping <string component_type, list> corresponding_components;

Then for each component type, you would be modeling how its linked to other component types.

However, I see that you've not listed locus, mRNA, cDNA etc. so I'm unsure if you intend to use feature_grouping for these main classes?

5) Having feature_grouping, feature_set_map and FeatureTypeSet seems redundant?

6) Having all these extra lists in the Feature structure such as coexpressed_fids seems counter-intuitive to what we're trying to achieve here. Certainly we could come up with a similar structure to the feature_grouping called perhaps results_grouping where features that are collated as a results of external processing can then be grouped in a separate structure. Indeed the Feature structure appears un-necessarily burdened here.

7) Why must there be a 1-1 relationship between a feature and its corresponding protein? It's often the case that a biologist will talk about the relationship between a gene and the protein product it encodes as if its a direct relationship, and will explore it as such. Can there not be simply a list of corresponding proteins?

jkbaumohl commented 9 years ago

Hi Sam, Sorry I did not intend for the spec to be read at this time. I have been saving it along the way, but it is not a finished product as of yet. I am still changing things around and formatting it. I should have it done by the end of today. I would suggest holding off on comments until then as things are in flux. I will send out a comment when I am done. Sorry for the confusion. -Jason

jkbaumohl commented 9 years ago

Ok Sam here is the first draft of the Genome spec. Please check out and ask questions. If there are any flaws or specific needs that this will not be able to meet please let us know.

Thanks, Jason

mlhenderson commented 9 years ago

Deleting it all and falling back to the previous object design is not an option at this point. We have to redesign to support 1) building the knowledgebase and capturing proper semantic information about what data is and how it is being used, 2) eukaryotic organisms, 3) multiple assemblies and annotations per genome that can be compared and operated on, 4) simple and easy extraction of important information while also balancing the sizes of data that need to be moved around, 5) flexibility and extensibility that allows the types to evolve over time as understanding of Biology increases, 6) more efficient navigation for indexing and computation.

Biological data itself is complex, and we need to be able to capture that. I don't view this as complicating our existing design, but simplifying a model of biology that will open up capabilities for us in the coming months and years. I know it will not be perfect, and this has to be actually tested against something to know that it works well. Additionally, I know that this would be disruptive and so the plan was to provide an API for accessing pieces of the data that would remove the need for a service or method to have a deep understanding of the type structure, and a conversion method that you could call during a transition period that would allow you to get objects in the current production form so that we don't break the system. This may be a lossy conversion because there may be differences in what information is captured, but you would have time to migrate without loss of functionality. Ultimately my goal is to make life easier for everyone, which doesn't always mean simpler.

jkbaumohl commented 9 years ago

Hi Sam,

I will try comment on these in line: Also I realized in reading your questions it appears you are looking at an older version.

On Wed, Jun 3, 2015 at 8:36 AM, samseaver notifications@github.com wrote:

OK, so I'm reading through this. I've got some initial questions about this:

1) reference_assembly/genome_ref: I take it that we can treat these in a boolean like manner, i.e. if a taxon does not have such a reference, then the organism on a whole does not have a reference genome available (at least not in KBase) and if the genome does not have such a reference, then it is in turn not a reference genome. Am I thinking about this right?

Potentially a non-reference assembly could have a reference genome annotation. Meaning this is the default annotation for this assembly.

The concept of the reference is so that there is a "default". From a Kbase stand point things from RefSeq may be our initial reference data. Personally I feel there is a need for the reference concept for three reasons. --First is if a researcher is searching our public data and they are inundated with potential many options at all these different levels and may not want to go through the decision making process of which option to take. --Secondly from my perspective of dealing with the expression data from GEO. It would be much more efficient if the expression data is hung on the reference genome annotation instead of all the possible genome annotations. As a result the size of the data stored would be much more reasonable and would not spiral out of control. This would not preclude us from associating it with non-reference annotations as a service. --Thirdly it allows a user to set a default if they are dealing with many variations of their own private data.

2) I take it that we are no longer going to be holding contig sequences in the WS object, which is a good thing with respect to plants. If this was an oversight, can you please ensure that the contig sequence is optional.

You are correct the contig sequence will no longer be in the contig subobject. The idea is to leverage some of Shock's indexing ability on fasta files to quickly pull the sequence. This keeps the subobjects smaller so that they do not go over the WS object size limit (a problem we had with many plants). It also keeps the object smaller when moving it around in the system.

3) Why is the genome_annotations mapping in the Assembly structure and not the Genome structure? Whilst I do understand the relationship between Genome and Assembly, it appears that many of the keys in the Assembly structure would be better off being directly available in the Genome structure. In order to retrieve the GenomeAnnotation for a Genome, currently, one would have to iterate through the set of Assemblies, and then collect the GenomeAnnotation objects from each Assembly.

I am not sure if I am completely following you here. Yes if you started at the Genome level you would need to go to the assembly level. However if you were looking for the reference assembly and its corresponding reference genome annotation then you could use the reference ids to directly access the nonversioned ws reference you want from the maps.

4) I like the feature grouping concept, particularly along with the use of the GenBank feature keys, and I believe it can be taken a step further by also defining within the feature_grouping structure the relationship between component_types, so a field could be added like:

mapping > corresponding_components;

Then for each component type, you would be modeling how its linked to other component types.

However, I see that you've not listed locus, mRNA, cDNA etc. so I'm unsure if you intend to use feature_grouping for these main classes?

I am confused here. In an early draft we had a separate feature grouping subobject that we took out. Are you referring to that or the feature set? We can absolutely add a mapping if it would make things easier. We just need to make sure we are talking about the same thing.

5) Having feature_grouping, feature_set_map and FeatureTypeSet seems redundant?

Are you looking at the most recent version?

6) Having all these extra lists in the Feature structure such as coexpressed_fids seems counter-intuitive to what we're trying to achieve here. Certainly we could come up with a similar structure to the feature_grouping called perhaps results_grouping where features that are collated as a results of external processing can then be grouped in a separate structure. Indeed the Feature structure appears un-necessarily burdened here.

7) Why must there be a 1-1 relationship between a feature and its corresponding protein? It's often the case that a biologist will talk about the relationship between a gene and the protein product it encodes as if its a direct relationship, and will explore it as such. Can there not be simply a list of corresponding proteins?

We can make a list but realize the protein ref is only on the mRNA and CDS properties level.

— Reply to this email directly or view it on GitHub https://github.com/kbase/project_guides/pull/46#issuecomment-108486951.

jkbaumohl commented 9 years ago

Oops I just realized I am responding to your message yesterday. Please check out the new version of the doc.

-Jason

On Thu, Jun 4, 2015 at 12:56 PM, Jason Baumohl jkbaumohl@lbl.gov wrote:

Hi Sam,

I will try comment on these in line: Also I realized in reading your questions it appears you are looking at an older version.

On Wed, Jun 3, 2015 at 8:36 AM, samseaver notifications@github.com wrote:

OK, so I'm reading through this. I've got some initial questions about this:

1) reference_assembly/genome_ref: I take it that we can treat these in a boolean like manner, i.e. if a taxon does not have such a reference, then the organism on a whole does not have a reference genome available (at least not in KBase) and if the genome does not have such a reference, then it is in turn not a reference genome. Am I thinking about this right?

Potentially a non-reference assembly could have a reference genome annotation. Meaning this is the default annotation for this assembly.

The concept of the reference is so that there is a "default". From a Kbase stand point things from RefSeq may be our initial reference data. Personally I feel there is a need for the reference concept for three reasons. --First is if a researcher is searching our public data and they are inundated with potential many options at all these different levels and may not want to go through the decision making process of which option to take. --Secondly from my perspective of dealing with the expression data from GEO. It would be much more efficient if the expression data is hung on the reference genome annotation instead of all the possible genome annotations. As a result the size of the data stored would be much more reasonable and would not spiral out of control. This would not preclude us from associating it with non-reference annotations as a service. --Thirdly it allows a user to set a default if they are dealing with many variations of their own private data.

2) I take it that we are no longer going to be holding contig sequences in the WS object, which is a good thing with respect to plants. If this was an oversight, can you please ensure that the contig sequence is optional.

You are correct the contig sequence will no longer be in the contig subobject. The idea is to leverage some of Shock's indexing ability on fasta files to quickly pull the sequence. This keeps the subobjects smaller so that they do not go over the WS object size limit (a problem we had with many plants). It also keeps the object smaller when moving it around in the system.

3) Why is the genome_annotations mapping in the Assembly structure and not the Genome structure? Whilst I do understand the relationship between Genome and Assembly, it appears that many of the keys in the Assembly structure would be better off being directly available in the Genome structure. In order to retrieve the GenomeAnnotation for a Genome, currently, one would have to iterate through the set of Assemblies, and then collect the GenomeAnnotation objects from each Assembly.

I am not sure if I am completely following you here. Yes if you started at the Genome level you would need to go to the assembly level. However if you were looking for the reference assembly and its corresponding reference genome annotation then you could use the reference ids to directly access the nonversioned ws reference you want from the maps.

4) I like the feature grouping concept, particularly along with the use of the GenBank feature keys, and I believe it can be taken a step further by also defining within the feature_grouping structure the relationship between component_types, so a field could be added like:

mapping > corresponding_components;

Then for each component type, you would be modeling how its linked to other component types.

However, I see that you've not listed locus, mRNA, cDNA etc. so I'm unsure if you intend to use feature_grouping for these main classes?

I am confused here. In an early draft we had a separate feature grouping subobject that we took out. Are you referring to that or the feature set? We can absolutely add a mapping if it would make things easier. We just need to make sure we are talking about the same thing.

5) Having feature_grouping, feature_set_map and FeatureTypeSet seems redundant?

Are you looking at the most recent version?

6) Having all these extra lists in the Feature structure such as coexpressed_fids seems counter-intuitive to what we're trying to achieve here. Certainly we could come up with a similar structure to the feature_grouping called perhaps results_grouping where features that are collated as a results of external processing can then be grouped in a separate structure. Indeed the Feature structure appears un-necessarily burdened here.

7) Why must there be a 1-1 relationship between a feature and its corresponding protein? It's often the case that a biologist will talk about the relationship between a gene and the protein product it encodes as if its a direct relationship, and will explore it as such. Can there not be simply a list of corresponding proteins?

We can make a list but realize the protein ref is only on the mRNA and CDS properties level.

— Reply to this email directly or view it on GitHub https://github.com/kbase/project_guides/pull/46#issuecomment-108486951.

cshenry commented 9 years ago

Well, I can�t say I�m even slightly surprised by this response. You want to ignore my input. So be it. This is your mess. Your transition. Your work. I�ll have no association with it.

On Jun 4, 2015, at 1:00 PM, Matt Henderson notifications@github.com wrote:

Deleting it all and falling back to the previous object design is not an option at this point. We have to redesign to support 1) building the knowledgebase and capturing proper semantic information about what data is and how it is being used, 2) eukaryotic organisms, 3) multiple assemblies and annotations per genome that can be compared and operated on, 4) simple and easy extraction of important information while also balancing the sizes of data that need to be moved around, 5) flexibility and extensibility that allows the types to evolve over time as understanding of Biology increases, 6) more efficient navigation for indexing and computation.

Biological data itself is complex, and we need to be able to capture that. I don't view this as complicating our existing design, but simplifying a model of biology that will open up capabilities for us in the coming months and years. I know it will not be perfect, and this has to be actually tested against something to know that it works well. Additionally, I know that this would be disruptive and so the plan was to provide an API for accessing pieces of the data that would remove the need for a service or method to have a deep understanding of the type structure, and a conversion method that you could call during a transition period that would allow you to get objects in the current production form so that we don't break the system. This may be a lossy conversion because there may be differences in what information is captured, but you would have time to migrate without loss of functionality. Ultimately my goal is to make life easier for everyone, which doesn't alwa ys mean simpler.

� Reply to this email directly or view it on GitHub.

aparkin commented 9 years ago

I don't think anyone is ignoring your input Chris. I think we are trying to work multiple problems simultaneously. I don't think anyone wants to show a user 20 objects everytime they load a genome into WS, for example. But we may have other ways of solving this problem then a large stuffed genome object. Right now we are discussing this. We are pushing back on your point of view. I am happy to have a more formal discussion of this object as a summit and we can work it through together. We have had a LOT of discussion locally which is one of our challenges...we end up thinking somehow that everyone can hear what we say in a room over weeks. And then we plop a spec here and everyone not in the room feels left out. But this isn't a DECISION yet. We (or I) are working out all the details and arguing back.

mlhenderson commented 9 years ago

@cshenry That is an unfortunate response. I hope you will reconsider. I'm trying to incorporate input from everyone. We may not always agree, and that's ok, but I would hope that we can reach some common ground that would provide support for you.

cshenry commented 9 years ago

So we�re crossing two threads here. Matt�s response (on the other thread) had the definite feel of a �decision�, with little to no room for discussion.

Right now we are discussing this. We are pushing back on your point of view. I am happy to have a more formal discussion of this object as a summit and we can work it through together. We have had a LOT of discussion locally which is one of our challenges...we end up thinking somehow that everyone can hear what we say in a room over weeks.

I think that�s my fundamental issue here. We are all subject to the data policies you guys are establishing. But we have essentially no input, because we are not privy to your local decisions. So I am seeing absolutely zero justification� I�m just seeing a spec with massive changes with no real justification. I�m not seeing why this spec is so much better� what problem is this spec meant to solve? Why couldn�t the earlier spec by modified in a slight way rather than throwing everything out?

I am fine and happy about so many things in KBase. I love our platform, and I think we�re building something awesome. But, when we start digging into these data modeling issues� this is why I get very concerned. Because I just don�t see a path from here to there. Not a short path at least� it�s a very long path with loads of work� and I�m not seeing why the destination is worth the large amount of work we�re going to need to do to get there.

On Jun 4, 2015, at 4:14 PM, Adam Arkin notifications@github.com wrote:

I don't think anyone is ignoring your input Chris. I think we are trying to work multiple problems simultaneously. I don't think anyone wants to show a user 20 objects everytime they load a genome into WS, for example. But we may have other ways of solving this problem then a large stuffed genome object. Right now we are discussing this. We are pushing back on your point of view. I am happy to have a more formal discussion of this object as a summit and we can work it through together. We have had a LOT of discussion locally which is one of our challenges...we end up thinking somehow that everyone can hear what we say in a room over weeks. And then we plop a spec here and everyone not in the room feels left out. But this isn't a DECISION yet. We (or I) are working out all the details and arguing back.

� Reply to this email directly or view it on GitHub.

aparkin commented 9 years ago

Matt has put a lot of thought into this and has talked to a lot people, He has tested out ideas in prototype, and done a lot of due diligence. I don't think final decisions have been made. Our process now is designed so we can have these discussions out in the open like we are.

But we ARE dealing with a challenge, Chris. Part of the reorg was done so teams could work without too much kibbitzing across the project. This IS a big change and is frictional. Decisions need to get made and the leads are responsible for it.

That said, I know Matt is very sensitive to ensuring a reasonable transition to the new system. Believe me-- he and Shane have reigned in some of my more--er ambitious? Nuts? -- ideas so I don't bring us all down. :)

I think though that if we ARE going to make a big transition-- better to do it now then have to retool many many other services and update many many other users' objects later.

samseaver commented 9 years ago

I have to say I realize now that though we've been implicitly discussing how the data would be accessed, we've appear to be doing so in light of the current functions with which we can access it. I do cringe at the though of the number of get_objects() calls I'd have to make if, given a genome, I need to retrieve the complete set of proteins associated with it.

It seems that we really should be discussing, in conjunction with this increased complexity, a rich/deep Data API that would allow us to perform simple calls that traverse the entire structure efficiently (along with caching), much like a stored SQL query. We would not be able to get such complexity to work within the KBase framework if the API itself fails to meet standards.

We will also want to consider, given the current modularity of the structure, restricting the set of objects and sub-objects that we present to a user/developer, the ones they would actually need for individual apps/methods, making caching easier and improving response times.

mlhenderson commented 9 years ago

That's a great idea, let's talk about what a Genome API would look like that would work for KBase. The API can help inform the types and vice versa.

I can start an empty markdown document that everyone can PR against with thoughts on the Genome API. Does that seem reasonable?

There were some queries defined here that could be useful in thinking about the API, and feel free to PR against this document as well in general: https://github.com/kbase/nextgen/blob/master/docs/design/data/queries.md

On Thu, Jun 4, 2015 at 7:17 PM, samseaver notifications@github.com wrote:

I have to say I realize now that though we've been implicitly discussing how the data would be accessed, we've appear to be doing so in light of the current functions with which we can access it. I do cringe at the though of the number of get_objects() calls I'd have to make if, given a genome, I need to retrieve the complete set of proteins associated with it.

It seems that we really should be discussing, in conjunction with this increased complexity, a rich/deep Data API that would allow us to perform simple calls that traverse the entire structure efficiently (along with caching), much like a stored SQL query. We would not be able to get such complexity to work within the KBase framework if the API itself fails to meet standards.

We will also want to consider, given the current modularity of the structure, restricting the set of objects and sub-objects that we present to a user/developer, the ones they would actually need for individual apps/methods, making caching easier and improving response times.

— Reply to this email directly or view it on GitHub https://github.com/kbase/project_guides/pull/46#issuecomment-109132305.

cshenry commented 9 years ago

I agree with this Sam. One way we will make the new spec manageable is through a data API. But it doesn't solve other major problems. Such as, how will the user manage things when their genomes balloon to 5 objects instead of 1? How will you keep all these interconnected objects together? This is hard enough now with genomes and contigsets.

So yes, let's discuss the API, but before we can really commit to this new spec, these other issues must be addressed as well.

samseaver commented 9 years ago

So I came back to this, to re-visit the topic of what we're doing with the domain-agnostic data model for the genome, and also looking at the list of queries Matt posted (which appear comprehensive).

Given that we will be developing an API to enable one to traverse the data model easily, I can't really find anything to comment on right now, at least not without actively testing the API itself, so I'm going to hold off on commenting on this further until we've implemented the spec in a workspace/narrative testing environment within which I can load and explore plant genomes.

fperez commented 9 years ago

I think this should go together with the genome_data_api one in #46, and those should be part of the technical documentation and discussions of the core data machinery. For the same reasons as #45 and #46 I think we should close here.

Once those materials are put into the relevant repo, both this PR and #36 can be referenced, so the discussion and context remains available. This means none of this work and ideas are lost.

But I really think this is too technically specific to be listed in this high-level repo of project-wide policies and practices.

If anyone disagrees we can obviously reopen.