kbase / project_guides

This repo contains documents and guides that describe project principles, how-to docs, etc.
MIT License
7 stars 33 forks source link

Initial Rationale for Genome object #44

Closed samseaver closed 9 years ago

samseaver commented 9 years ago

I agree to an extent with what you are saying. My approach has been rather a data-agnostic one, than a domain-agnostic approach, though adopting the former will not preclude the latter. In other words, I'm hoping that any structural and functional features provided by a user, generated from whatever source, can be loaded into a single Genome object.

I decided then that we could use a universal feature which can then represent any structural or functional feature of a genome. We can then use a well-defined ontology to enable a developer to retrieve precisely the feature types they need for their methods and apps. We can maintain an external lookup of this ontology, that links different feature classes and types with specific methods and apps so to avoid any confusion that arises from attempting to use the wrong set of features.

In terms of modeling biology, for example where you are describing the murkiness of alternative splicing along with introns and exons, we can keep an equally flexible list of feature_relationships. Indeed, to model biology means to circumscribe both entities and their relationships. But ultimately, no matter how closely you want to represent biology, you will always get user data that is essentially a subset of the biology you're modeling, and I believe you'd want to use the Genome object flexibly so.

aparkin commented 9 years ago

In many ways we ARE in agreement.. but my point did have a deeper issue.. that the genome object is, to me, abstract-- it is not a particular instantiation of annotations. We can know about genomes without knowing its sequence, and we can know about sequence without having asserted an annotation so I was advocating for separation of these concepts. Once we GET to annotation, then I agree with the idea of generic features-- and that all these features are "seqence encoding something" -- that is they are not proteins, or mRNAs, but "pre-mRNA coding sequence", "Protein coding sequence", "Riboswitch coding sequence". Then actual objects like mRNAs can point to features or feature sets for their origin biologically.

mlhenderson commented 9 years ago

We need to think about breaking up the genome object, not only because of Adam's reasoning, but because there are size issues for storing and accessing the data. I also want to move away from large lists of things, so representing the features as a list is problematic for being able to look up and extract specific features. Additionally, as we've talked about before, we need to maintain references back to any data uploaded by a user. There is no explicit information about the assembly associated with the genome here other than a fasta reference. Is that meant to be the assembly?

aparkin commented 9 years ago

There are also some complexities about features that Jason was talking about yesterday and that I've struggled with. For example... some features are really ordered lists of other features (e.g. an operon which is some set of promoters, regulatory sites and genes) and others are just really hard to express (the shufflon example above). Obviously, the perfect is the enemy of the good, but something to consider what defining/partitioning feature types.

samseaver commented 9 years ago

So, first of all, in response to Matt, I was taking a Eukaryotic approach, so I took out everything except for what may be used by Plants, initially anyway. Given that we're talking about a domain-agnostic approach, I decided to change the fasta/reads refs to a list of data refs, for any external data file used to compose the genome. This can be extended or proliferated to link to the variety of data types.

Secondly, with regards to breaking things down, I'm starting to understand what Adam meant, but I'm also starting to realize that calling this a Genome in the first place wasn't perhaps the best choice. Perhaps what we, as in the plants team, really need moving forward is simply a collection of structural and functional annotations (I don't yet see how the abstraction of the Genome in the spec would be of practical use), but the question is, as Matt states we need to break things down, is do we create a separate object for the domain-agnostic Genome spec (or rather, simply rename this one?)

I understand that the large number of features for a plant genome may make a workspace object unwieldy, particularly if you were to populate it with every feature type at once, but the alternative is to create separate objects for each feature type, perhaps called FeatureSets, to be retrieved independently, is this what you're thinking of?

samseaver commented 9 years ago

Following on from Matt's comment about breaking down the Genome, I realized I could emulate the use of some of the old "Set" objects in the previous Genome spec, but making them generic, so I introduced a generic FeatureSet that could be any set of structural and functional features.

mlhenderson commented 9 years ago

This is going down an old path that does not solve any of the issues. We can't squeeze everything into a single object, and we can't store large lists of things that have to be traversed completely to extract one element. We'll (data team) propose an alternative structure and then maybe we can resolve potential conflicts.

samseaver commented 9 years ago

But my point of allowing us to store separate feature sets means I can make one call for a single set of Protein or cDNA features. If, given a protein sequence, we want to retrieve it's original gene attributes, then yes, we'd have to perform traversal, but the only way to avoid that is to ensure that every feature is essentially a conglomerate of all features that can originate from a single region of DNA, so if you were to retrieve a feature object that contains a protein sequence, the same feature object would also contain the original gene attributes, and no traversal would be needed. I can see that being a possibility, but if a user were to concentrate on a set of protein sequences, then they'd be retrieving a lot of data they won't be using.

We could implement both perspectives at the same time, thereby allowing for a large number of different feature sets, so users would be able to retrieve a single set of sequences, and also allowing for a single number of conglomerate features, so that users can easily traverse the structural and functional associations. The Genome object itself would then simply hold a list of references.

aparkin commented 9 years ago

I can't say I'm following either argument about sets completely-- so I apologize if I am out of left field.

I think we have to separate the very biological model in which we somehow allow a genome to point to the replicating molecules that compose its heritable materials, for for these molecules to have sequences and annotations; from the very practical need to be able to manipulate compendia of objects.

In the former we want to be able to very quickly, for example, find all promoter annotation and their stops and starts in the genome of taxon X, or all coding sequence annotations in the tryptophan biosynthesis pathway for that taxon. We then want to be able to get all sequences associated with these annotations and for the CDSs we want to form a set of all translations to get a protein sequence set. This last thing (the protein set) is a generic type-- not really part of the biological data model since these sets can come form anywhere. Whereas the set of annotations we are querying is a very particular thing-- a model of functional regions within the taxon X's genome. Since the query's that precede the formation of the generic object (the protein sets or promoter sequence set) that we want to perform analysis on) are likely to be very common asks by users to exrtract information about biology -- these queries have to be very efficient.

The sets of objects we have for analysis (the arbitrary protein set which happens to contain tryptophan biosynthesis genes from Taxon X or promoter sequences from Taxon X) are things we are not likely to have as part of general query/search (though they might be part of specialized searches) -- they are organizations of biological data that have meaning to the user but not necessarily to anyone else. For example, the protein set could just as easily be the set of all proteins in Taxons X,Y,Z that have more than two prolines in a row) These more generic objects are useful because many algorithms don't care about the biological model but are fairly generic analyses of sequence sets (e.g. hydropathy calculators, secondary structure predictors), or feature (meaning numbers) sets/compendia (e.g. clustering).

So my point is that there should be a difference between how we think about storing and querying things that are specifically about the structured representation of living systems in contact with their environments and the sets of things we compose to make it easier to calculate things.

A

dangunter commented 9 years ago

I think Adam is saying, among other things, that we should model the new Genome relationally, i.e. break it down into logical atomic pieces and model explicitly how those are composed into larger units. Matt and Jason have a relational schema already (that's what Matt's talking about in his last message). I would advocate that discussion/revisions occur at this level first. These models shouldn't contain all the details of the actual measurements; they should really be restricted to the common fields that users can search on. At some point of course all the measurements need to get represented, but including them too early will slow down the main discussion. I am selfishly interested in this because any change of datastore technology will be much easier if we have clean models.

dangunter commented 9 years ago

More specifically, I suggest we use the types and basic model (not all the bells & whistles) from the Hive DDL. This is basically straight SQL plus some struct/union types. The nice thing is that it's compatible with both Hive & Spark (see this paper).

samseaver commented 9 years ago

Dan I'm unsure if your comments here were directed at this PR or the one Jason submitted yesterday?

mlhenderson commented 9 years ago

@samseaver They are separate PR requests, but we're all talking about the same thing in terms of getting a genome definition that would support Euks. I also want to be very clear that I'm putting forward with Jason a different way of thinking about how the genome would be structured ultimately in order to address problems we have had and to enable a lot of extra functionality in the system in the next 18 months, including for eukaryotic organisms. I want to iterate on this process with you and others, so as we go along please make suggestions and propose improvements. We can also schedule calls/meetings to help the discussion.

cshenry commented 9 years ago

1.) If you don�t like traversing lists, then please just add a dictionary to the object with object IDs pointing to indices in the lists. Lists are useful because they capture order, they�re efficient, and they ultimately support subselection. Throwing out lists with numerous major advantages because of one easily surmounted disadvantage seems like an error to me.

2.) I�d rather you remove excessive data items from the feature object and just leave the function and ID intact rather than remove the features from the genome entirely. In Jason�s new design, the genome object is essentially worthless. It has almost no usable data in it. You need to fetch another object to even get the taxonomy. You have to fetch about 5-6 objects to get the features at all� because they�re buried 5 layers deep. Why do this? The taxonomy isn�t even a big object. It�s not like it�s making the genome too big.

We have an object oriented database that our users interact with on the level of high-level objects. As such we really do need a small number of objects that correspond with major biological entities. If you break the genome down into 20-30 smaller objects, it�s going to be completely unwieldy to work with.

Imagine a narrative with 100 genomes now where each genome is 1 object. Now imagine a narrative with 100 genomes in your design, when 1 genome is 10-20 objects. It�s unwieldy.

How do you copy a genome and make sure you have every part of it? How do you edit an annotation?

We have a system now that users understand and want to use. We are presenting workshops that go quite well. The apps are reviewing well. I�m really concerned that our entire data model is being thrown out and redesigned by people who don�t even attend those workshops. Please make sure you�re getting exposure to our current active users and how they use the system before making major decisions like this.

One last comment. If you simply design a SOLR scheme and index your typed objects in SOLR, you could access the data in ANY way you want. You don�t need to completely redesign the typed objects to support new access patterns. Then it doesn't really matter how big your objects are. That�s just your underlying data store. If we�re talking about the level of re-design that you�re proposing, then I would much rather go down this road then endlessly obsessing over typed object designs in hoping to find the one design that supports every access pattern imaginable.

On Jun 3, 2015, at 12:26 PM, Matt Henderson notifications@github.com wrote:

@samseaver They are separate PR requests, but we're all talking about the same thing in terms of getting a genome definition that would support Euks. I also want to be very clear that I'm putting forward with Jason a different way of thinking about how the genome would be structured ultimately in order to address problems we have had and to enable a lot of extra functionality in the system in the next 18 months, including for eukaryotic organisms. I want to iterate on this process with you and others, so as we go along please make suggestions and propose improvements. We can also schedule calls/meetings to help the discussion.

� Reply to this email directly or view it on GitHub.

aparkin commented 9 years ago

Chris, I suspect you and I will have to talk personally about why I think breaking up the genome object is necessary. I don't think this has to cause confusion to users if we get the proper data API in place. They can ask for Genome Sequence or Genome Proteins and use them in the appropriate algorithms. There ARE things that might be unfamiliar to users who are not thinking about biology the way I am aiming to make them think about it. But I think we can make a soft transition.

For the complexity of handling genome associated data, there is some thought that needs to be put into this. Whether you are parsing many subpieces of single object or traversing a tree of references there are challenges and efficicencies/inefficiencies.

As for the charge that we/I am not attending workshops-- I can tell you I am a working biologist and computational biologist and am pretty well informed about how others and I use data. I understand that there is change proposed here- in part to support things well beyond what we are currently workshopping- but this is the time to mull it over. We have suffered from just doing and seeing what happens too much. We need a better prototyping environment yes. We need to be able to test different ways of doing things. But all this thought is for a real purpose.

cshenry commented 9 years ago

This is a massive scale change on the core object of the entire system. Not even a little change. A massive change. We have something like 200K genomes in the system. Maybe more. I just don�t see how there could ever be a �soft transition�.

I remember the transition from the old workspace to the new workspace, and that was bad enough. This makes that look like child�s play.

That is what is driving my previous comment. I really do think the transition is going to be an utter nightmare� and I don�t even want to contemplate it. So it�s somewhat frustrating when my objections to the design meet a stone wall of �this is just the way it needs to be�.

I�m not charging that anyone isn�t a working computational biologist. I�m concerned if people are proposing these kinds of changes� that they may not be familiar with how people use the system right now. And thus far, it�s been Jason and Matt on this thread, not you� so if you�re tied up in this design, then I withdraw the comment. Fair enough. I know you use the system a ton.

On Jun 4, 2015, at 12:39 PM, Adam Arkin notifications@github.com wrote:

Chris, I suspect you and I will have to talk personally about why I think breaking up the genome object is necessary. I don't think this has to cause confusion to users if we get the proper data API in place. They can ask for Genome Sequence or Genome Proteins and use them in the appropriate algorithms. There ARE things that might be unfamiliar to users who are not thinking about biology the way I am aiming to make them think about it. But I think we can make a soft transition.

For the complexity of handling genome associated data, there is some thought that needs to be put into this. Whether you are parsing many subpieces of single object or traversing a tree of references there are challenges and efficicencies/inefficiencies.

As for the charge that we/I am not attending workshops-- I can tell you I am a working biologist and computational biologist and am pretty well informed about how others and I use data. I understand that there is change proposed here- in part to support things well beyond what we are currently workshopping- but this is the time to mull it over. We have suffered from just doing and seeing what happens too much. We need a better prototyping environment yes. We need to be able to test different ways of doing things. But all this thought is for a real purpose.

� Reply to this email directly or view it on GitHub.

aparkin commented 9 years ago

Yeah.. I'm involved. Doesn't mean I'm right. But I am thinking hard about all the different classes of use I am trying to put the data through. I completely understand the challenges you are posting here and in the other thread (I've grown confused)...

But it is true that I am proposing something that breaks a little bit the document store design and yet I don't want to go to full relational. I think the struggle here is between what is efficient for core valuable workflow and what will be important for knowledge representation; inference comparison; and so on. That I am am exercised about what see as intellectual conflation of concepts and measurement. :) I tend to get on a purist hobby horse with theory and have to be driven back a little (a little).

I do think it would be great to get in a room with you and go over some of my thought process sometime soon.

mlhenderson commented 9 years ago

Chris, so your assertion is that I don't know how people use data in the system and therefore I should not have anything to do with the data types?

cshenry commented 9 years ago

As far as I�m aware, your involvement in teaching and working with others who are using the system has been limited. I don�t think this precludes you from leading work on data types, but one needs complete information to make these decisions that are at the core of the entire system. And data is at the core of the entire system. You realize you are literally leading what is, in my view, the most important part of the entire system?

Data sits at the core of all tools, all user experience, all sharing, all social interactions, and all interface with real biology. If you change data, you impact all those things. I don�t think this is true of anything else in the system.

You are making very bold changes in a system that works right now in a certain way (and works pretty well). These changes will �break� the way the system works right now. Thus, these changes will call for changes pretty much everywhere else.

The UI will need to change. The data store will need to change. And essentially all tools will need to change.

I�m uncertain if this is clear to you. If it is, then I apologize for my presumption. I know that personally� knowing these facts that I just stated above� I would never dream of making the changes to the genome object that you have proposed.

So if you know these things� and you�re proposing these things� then dude� you�ve got balls the size of a house� and I don�t know how you manage to walk upright.

On Jun 4, 2015, at 4:23 PM, Matt Henderson notifications@github.com wrote:

Chris, so your assertion is that I don't know how people use data in the system and therefore I should not have anything to do with the data types?

� Reply to this email directly or view it on GitHub.

aparkin commented 9 years ago

I'm not sure I've seen him walk upright for the last little while. :)

But I need to take some blame since I've been an instigator of a lot of this. I agree with everything you just said. This is HUGE. It is huge for the system now and in the future. It is an incredibly heavy lift and we ARE going to need to pull together to get it done.

cshenry commented 9 years ago

Well fine. I apologize for reacting so negatively. But the implications of this are pretty overwhelming� and there needs to be a whole game plan around this that I�m not seeing (e.g. changes to workspace, UI, widgets, services). In isolation, this looks insane because it doesn�t fit with the current system. We really need to link these genome documents with a document where we build that broader plan...

On Jun 4, 2015, at 4:50 PM, Adam Arkin notifications@github.com wrote:

I'm not sure I've seen him walk upright for the last little while. :)

But I need to take some blame since I've been an instigator of a lot of this. I agree with everything you just said. This is HUGE. It is huge for the system now and in the future. It is an incredibly heavy lift and we ARE going to need to pull together to get it done.

� Reply to this email directly or view it on GitHub.

scanon commented 9 years ago

Sorry I am jumping into this thread late. I just wanted to stress that we are worried about how we carry out this transition. We have some ideas, but we are far from a complete plan. As production lead, I certainly don't intend to burn down the house just so we can rebuild.

samseaver commented 9 years ago

There's now two separate PRs essentially covering the same topic, so I'm going to close this one. For anyone who's late to the party, this discussion will continue here:

https://github.com/kbase/project_guides/pull/46