change feature ids to md5 sums of sequences?

gregcaporaso commented 8 years ago

Currently they are the sequences themselves, which is probably a bit longer than they need to be.

wdwvt1 commented 8 years ago

@gregcaporaso - it seems like storing the feature IDs as md5's adds unnecessary overhead (now we have two files rather than just the dataframe or whatever), without any significant benefit. Having the feature IDs be the sequences makes it easy to just pass the table to someone, to derive taxonomy, to cluster, etc. What is the benefit of having the md5sum?

gregcaporaso commented 8 years ago

I'm planning to revert this to go back to just the sequences themselves as the feature ids, discussed this with @wasade yesterday and we agreed that'd be better. That'll happen next week, thanks for the feedback!

Sent from my phone, please excuse typos and brevity.

On Sep 16, 2016 3:39 PM, "Will Van Treuren" notifications@github.com wrote:

@gregcaporaso https://github.com/gregcaporaso - it seems like storing the feature IDs as md5's adds unnecessary overhead (now we have two files rather than just the dataframe or whatever), without any significant benefit. Having the feature IDs be the sequences makes it easy to just pass the table to someone, to derive taxonomy, to cluster, etc. What is the benefit of having the md5sum?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benjjneb/q2-dada2/issues/4#issuecomment-247724910, or mute the thread https://github.com/notifications/unsubscribe-auth/AALvdB96Sq2ZOf7mnpUYckZksdjo-sAYks5qqxqbgaJpZM4J9Ldu .

gregcaporaso commented 8 years ago

This change has been reverted, and the feature ids are sequences again.

We're going to stick with this for now.

gregcaporaso commented 8 years ago

Flip-flopping again on this. Here's my thinking (paraphrased from an email I sent on this topic recently):

I also like the sequences being the feature ids as it's really convenient. However, I do really think it's the wrong decision. One thing to note is that this decision about what the feature ids are is made on a per-plugin basis, it's not something that we could (or would want to) enforce across the QIIME 2 plugin ecosystem. For example, feature IDs from an OTU picker that performed clustering, or which defined features from metabolome data, would make a different choice. Given that, what I think we need is an easily accessible mapping from the feature id to the rich information about the feature (sequence/etc, depends, based on the data type) - we don't want to embed information in the feature ids themselves (which is what we're doing when we make them the sequences) because it doesn't work well for the different types of data that we're trying to support.

If visualization developers need to be able to handle short identifiers (e.g., <10 characters long, such as integers) or up to hundreds or even a thousand characters long (sequences now, or in a year) that's a big burden for them. And not everyone is going to handle it well - we basically have the choice of doing it once in the dada2 plugin, or in every visualization that uses feature ids.

What I want to do is make it really easy to get the mapping from feature id to rich feature data (sequence/etc). This could then be used well in visualizations, but also accessible for the user who wants to just blast their sequence. It also scales to other types of data, because we could start thinking about that richer information mapping to database entry URLs, automated BLAST searches (click the ID and it BLASTs the sequence against NCBI for you), etc.

mortonjt commented 8 years ago

I think the md5sum is a reasonable short-term solution. It's extremely inconvenient dealing with full length sequences when interpreting OTU statistics (i.e. correlation, differential abundance, ...).

If there is push for keeping the sequence, I think it is ok to have a feature_mapping data structure to store this sort of information.

gregcaporaso commented 8 years ago

Also, from another email conversation on this topic early today (since some people have brought up the possibility of md5 collisions).

Collisions (different sequences mapping to the same md5) are extremely unlikely - the space of md5 is actually three bits bigger than the space of UUIDs, so collisions are less likely than in UUID space, and we're relying on UUIDs being unique in various places. (In fact, the probability of a collision of two md5s is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456 - sorry, I couldn't resist. Our case is actually more similar to the second one brought up in that SO post, so still extraordinarily unlikely).

gregcaporaso commented 8 years ago

I think it is ok to have a feature_mapping data structure to store this sort of information.

Yep - we actually need that mapping regardless of whether we use sequences as feature ids (since that decision is made on a per plugin basis), and I'd add a new visualizer that makes it really easy to access that info (which I'm also going to do regardless).

jairideout commented 8 years ago

it seems like storing the feature IDs as md5's adds unnecessary overhead (now we have two files rather than just the dataframe or whatever),

It's actually the opposite case here because QIIME 2 stores representative sequences in a separate artifact from the feature table. Thus, you'd have sequences (i.e. very long IDs) stored in two files: the feature table artifact and feature metadata artifact. By using short IDs (e.g. md5), the representative sequences are only stored once, in a single feature metadata artifact. Concretely, the feature ID -> representative sequence mapping would be stored in a FeatureData[Sequence] artifact, which already exists in q2-types. It is trivial (~20 mins of work) to add a visualizer that accepts a FeatureData[Sequence] artifact and produces a nice HTML page of that mapping (even with clickable links to generate BLAST results on-the-fly!). This visualizer would exist in the q2-feature-table plugin.

I fully support using MD5 sums as IDs, to avoid encoding information in identifiers and to have a short, predicable length. This is the whole point of the cual-id paper, the only difference being that they are intended for sample IDs. I don't think a different axis (features) should be special-cased.

We are already assuming that UUIDs (version 4) will not collide in two places: QIIME 2 artifacts and cual-id. Thus, using MD5s for feature IDs shouldn't be a problem (any more than the others are problematic), especially because the chance of a collision is even smaller.

It would be great to add an option to the dada2 plugin (off by default) that stores rep seqs as the feature IDs, that way we can support both types of IDs.

gregcaporaso commented 8 years ago

It would be great to add an option to the dada2 plugin (off by default) that stores rep seqs as the feature IDs, that way we can support both types of IDs.

I'm good with this solution, as I think it gets the best of both worlds. I'll add this.

wdwvt1 commented 8 years ago

my response in an email to @gregcaporaso Hi Greg,

Given that, what I think we need is an easily accessible mapping from the feature id to the rich information about the feature (sequence/etc, depends, based on the data type) - we don't want to embed information in the feature ids themselves because it doesn't scale to the different types of data that we're trying to support (which is what we're doing when we make them the sequences).

I don't really understand this. If we don't want to embed feature information into the feature IDs, then why should any visualization/tool ever put a feature ID in a plot? It just seems like adding an MD5 is creating a unique mapping from an already unique mapping so that a meaningless but unique piece of information can be displayed on a plot. As an example: lets say we go the MD5 approach and ANCOM now has a bunch of MD5's plotted. I will then go look up that MD5 in my MD5-taxonomy table. All we've done is make it so we are displaying a shorter length, but meaningless piece of information. Wouldn't it be just as easy to have ANCOM (or whomever else) plot a number and let me look up that number in the associated table I generated it with? I can see situations where it would be easier to have the MD5 sum plotted so that you could always say which feature a plot was referring to (no necessity of knowing the table it came from). I can't see this outweighing the activation energy concern, but I definitely understand how you can come to a different conclusion to this and support that.

What I want to do is make it really easy to get the mapping from feature id to rich feature data (sequence/etc).

That makes a lot of sense and will probably alleviate most of my concerns. I still don't think it will be easy for most amateur developers or for people who get out of the QIIME ecosystem, but I'm probably overly concerned there.

I definitely understand your points and I am sure MD5's will be okay.

gregcaporaso commented 8 years ago

I think we have a solution here that will make everyone happy (basically make this an option that the user has), so just going to hit a couple of points from this.

All we've done is make it so we are displaying a shorter length, but meaningless piece of information. Wouldn't it be just as easy to have ANCOM (or whomever else) plot a number and let me look up that number in the associated table I generated it with?

Simply using numbers (as with de novo OTU picking in QIIME 1) doesn't work, as we want feature ids to be comparable across different tables.

If someone who has done a lot of development is saying this...

I clarified with @wdwvt1 by email that the developer in question hasn't actually written any code that uses QIIME 2, or approached the dev team for help as far as I know, so I think this is an unfair comment. We do care a lot about making it easy to develop for QIIME 2. Creating the types and formats is currently fairly challenging, but we're working on improving that, and once we have the full pipeline flushed out, it won't be necessary to create those very often.

gregcaporaso commented 8 years ago

If someone who has done a lot of development is saying this...

This is a bit off the main topic, but since it came up here I also want to note that we're extremely interested in getting feedback on how to simplify development for QIIME 2. We are really hoping to build a framework that will make microbiome bioinformatics better by making it easier to integrate different tools in a reliable, reproducible, and user-friendly way (e.g, I ran DADA2 through a GUI this week!). So if anyone ever has suggestions/questions/comments/concerns, please get in touch by Slack, issue trackers, or email (though public forums are strongly preferred). It may be rough around the edges right now, but we're working hard on it, and that's why we're describing ourselves as being in alpha release stage.

qiime2 / q2-dada2

change feature ids to md5 sums of sequences? #4