qiita-spots / qiita

Qiita - A multi-omics databasing effort
http://qiita.microbio.me
BSD 3-Clause "New" or "Revised" License
120 stars 80 forks source link

suggested rarefied level #650

Closed antgonza closed 8 years ago

antgonza commented 9 years ago

From @clozupone. Rewriting: add a way to know the sequences per sample in a meta-analysis before defining the rarefaction level.

squirrelo commented 9 years ago

This has come up a bunch, and the idea is to have an on-the-fly histogram of sequences per sample created on the page where you select rarefaction level. Not sure how computationally intensive this will be though since that's potentially a lot of biom tables to go through.

antgonza commented 9 years ago

The most expensive step is to create the stats of the biom file per study, if we can store this at time of creation then a histogram should be pretty fast.

wasade commented 9 years ago

It isn't a violation of the biom file format to add in hdf5 attributes. on the other hand, we can do this specific computation much faster than the summarize method as we only need to look at indptr, not the full matrix. should actually be very quick to compute

On Tue, Dec 2, 2014 at 3:12 PM, Antonio Gonzalez notifications@github.com wrote:

The most expensive step is to create the stats of the biom file per study, if we can store this at time of creation then a histogram should be pretty fast.

— Reply to this email directly or view it on GitHub https://github.com/biocore/qiita/issues/650#issuecomment-65314886.

wasade commented 9 years ago

From chatting with @antgonza and @josenavas, there is another subtly to this which is that in a meta analysis, the BIOM table doesn't yet exist. Regardless of if we want to present sequence counts to the user prior to or following table creation, I think we can do something reasonably fast by just interrogating the compressed axis directly:

from h5py import File

def get_sample_stats(open_hdf5_file, samples=None):
    if samples is None:
        samples = open_hdf5_file['sample_ids']

    samples = set(samples)

    indptr = open_hdf5_file['sample/indptr']
    data = open_hdf5_file['sample/data']

    for idx, id_ in enumerate(open_hdf5_file['sample/ids']):
        if id_ in samples:
            start = indptr[idx]
            end = indptr[idx + 1]

            yield (id_, data[start:end].sum())

Asking how many unique OTUs are observed per sample is even lighter. While in general it is nice to have the BIOM table in memory, it does represent a fair bit of overhead to assess basic summaries. The overhead will be significant if the number of tables that need to be interrogated is large

gregcaporaso commented 9 years ago

What if you store the number of sequences per sample in the database, and then when the user selects their samples you could display the histogram of number of sequences per sample? To assist the user in making their choice, you could annotate (e.g.) the 10th, 20th, 25th, 50th, 75th, 80th, 90th percentiles in that distribution and note what fraction of their samples they'd discard with each level.

rob-knight commented 9 years ago

Great idea!

On Mar 11, 2015, at 7:23 PM, Greg Caporaso notifications@github.com<mailto:notifications@github.com> wrote:

What if you store the number of sequences per sample in the database, and then when the user selects their samples you could display the histogram of number of sequences per sample? To assist the user in making their choice, you could annotate (e.g.) the 10th, 20th, 25th, 50th, 75th, 80th, 90th percentiles in that distribution and note what fraction of their samples they'd discard with each level.

— Reply to this email directly or view it on GitHubhttps://github.com/biocore/qiita/issues/650#issuecomment-78412691.

wasade commented 9 years ago

Just an extension, I like the thought of storing within the BIOM table similar to how we store stats in the demux files. Since we're pending a minor release for biom, we could actually just get this lumped in as a new option to the table summarizer (--save-stats). Fetching the stats would not require a parse of the table, just opening the h5py file. However, storing in the db is light as well, but it puts the stats further away from the actual data they represent.

On Wed, Mar 11, 2015 at 8:27 PM, Rob Knight notifications@github.com wrote:

Great idea!

On Mar 11, 2015, at 7:23 PM, Greg Caporaso <notifications@github.com mailto:notifications@github.com> wrote:

What if you store the number of sequences per sample in the database, and then when the user selects their samples you could display the histogram of number of sequences per sample? To assist the user in making their choice, you could annotate (e.g.) the 10th, 20th, 25th, 50th, 75th, 80th, 90th percentiles in that distribution and note what fraction of their samples they'd discard with each level.

— Reply to this email directly or view it on GitHub< https://github.com/biocore/qiita/issues/650#issuecomment-78412691>.

— Reply to this email directly or view it on GitHub https://github.com/biocore/qiita/issues/650#issuecomment-78413026.

antgonza commented 8 years ago

I have been thinking about this and I can see the benefits of the described approaches. However, thinking pragmatically about this, the histograms need to be good but not prefect. With this in mind, what about adding a step in the processing step (otu picking) so a default histogram is generated and we add this information to the process data table as an array or ints? The default histogram can be from 0->50K (we can discuss this or we can create a histogram of all the current biom tables so we get a better range) with 100 or 1K points. With this approach we can quickly display the histogram of one biom but more importantly can show histograms of each process data and the combination of all for metaanalyses. Finally, this solution is really easy to implement and has minimal DB changes. What do you think?

wasade commented 8 years ago

This should break down to:

Table = load_table(foo) Res = hist(table.sum(axis='sample'))

Recommend storing the bar positions and height instead of a PDF so that the visual can be changed in UI if needed, but path of least resistance is just generate the image for display On Oct 23, 2015 07:59, "Antonio Gonzalez" notifications@github.com wrote:

I have been thinking about this and I can see the benefits of the described approaches. However, thinking pragmatically about this, the histograms need to be good but not prefect. With this in mind, what about adding a step in the processing step (otu picking) so a default histogram is generated and we add this information to the process data table as an array or ints? The default histogram can be from 0->50K (we can discuss this or we can create a histogram of all the current biom tables so we get a better range) with 100 or 1K points. With this approach we can quickly display the histogram of one biom but more importantly can show histograms of each process data and the combination of all for metaanalyses. Finally, this solution is really easy to implement and has minimal DB changes. What do you think?

— Reply to this email directly or view it on GitHub https://github.com/biocore/qiita/issues/650#issuecomment-150598761.

antgonza commented 8 years ago

Yes, simple addition. We could store positions and height with the default 10 values and on merge we simply display the combination. Note that this might not work well when you start filtering down samples by metadata ...

gregcaporaso commented 8 years ago

I've got code for this (and for making the plots interactive) - code here https://github.com/gregcaporaso/q2d2/blob/master/q2d2/__init__.py#L265, example image attached. Feel free to use - just throw an attribution in as a comment.

[image: Inline image 1]

On Fri, Oct 23, 2015 at 8:09 AM, Antonio Gonzalez notifications@github.com wrote:

Yes, simple addition. We could store positions and height with the default 10 values and on merge we simply display the combination. Note that this might not work well when you start filtering down samples by metadata ...

— Reply to this email directly or view it on GitHub https://github.com/biocore/qiita/issues/650#issuecomment-150603489.

clozupone commented 8 years ago

Sounds great!

On Oct 23, 2015, at 9:09 AM, Antonio Gonzalez notifications@github.com<mailto:notifications@github.com> wrote:

Yes, simple addition. We could store positions and height with the default 10 values and on merge we simply display the combination. Note that this might not work well when you start filtering down samples by metadata ...

— Reply to this email directly or view it on GitHubhttps://github.com/biocore/qiita/issues/650#issuecomment-150603489.

antgonza commented 8 years ago

Solved in https://github.com/biocore/qiita/pull/1738.

Now, every artifact has it's own summary. For example a biom has:

screen shot 2016-04-03 at 8 47 49 am

Which should help you decide your rarefaction level.