Prevalence differences in experiments

brendanwee commented 2 years ago

Hi Tyler,

I have been running a couple experiments in preparation to run a large 2000+ sample through CoPTR.

The samples are all human gut metagenomic sequencing samples, with an average read depth of about 20M. For all of these experiments, I used the default parameters and the IGGdb High-Quality Gut Genomes database provided in the documentation.

First I ran 75 samples and predicted PTRs for each sample separately This resulted in a sparse result, where most species detected were only found in 5-10% of the samples

75_prevalence

Next, I took 10 samples that had the most results from the previous experiment, and I predicted PTRs for all of them together This resulted in a much less sparse result, where most species were detected in at least 50% of the samples. This is the type of distribution I could work with with my large run!

10_prevalence

Finally, I ran all 2337 samples. I aligned the reads and computed coverage maps for each sample separately. Then, I copied over all the coverage maps, and predicted all of the PTRs together. I also gave the --plot flag when I ran the final estimate command. Unfortunately, I got a prevalence distribution similar to the first experiment where the samples had been estimated separately. Most species detected are found in less than 10% of the samples.

2337_prevalence

Can you think of a reason why the prevalence distribution in my large experiment turned out so different from the 10 sample experiment?

tyjo commented 2 years ago

I suspect that the differences you see here are due to differences in sample size, and running all samples separately or together is secondary. The run from 75 samples is a better estimator for the run using 2337 samples. Even in the same individual, the composition of species in the same changes over time (see Figure 2 in this paper for example), and the number of shared species between two individuals is low. This suggests to me that the prevalence curve you have observed reflects real biological phenomena.

Running samples together allows you to estimate PTRs from draft assemblies (species need to be observed in at least 5 samples), and improves estimates for reference genomes. But in principle, whether a species has a draft assembly or complete genome should not affect the prevalence curve (ignoring that more common species could be more likely to have complete genomes).

There are a couple ways you can sanity check:

You could estimate relative abundances using a different method (e.g. https://huttenhower.sph.harvard.edu/metaphlan2/), and compute a prevalence curve from relatives abundances.
You could also check if you as missing any PTRs in the run of 2337 samples compared to the run with 10 samples.

brendanwee commented 2 years ago

Thanks for your response!

Since 2. is a quick check I went ahead and did it. In the run with only ten samples, we find 10 species that do not show up in the complete run with 2337 samples. Each of these samples are detected in at least 5 / 10 samples. These are the same coverage maps used for each experiment.

Can you think of a reason these species would not have a PTR predicted in the larger dataset? Given that they are detected in at least 5 samples as shown in the small 10 sample experiment.

tyjo commented 2 years ago

Can you tell me if these species have complete genomes, draft assemblies, or both?

On Oct 7, 2021, at 5:31 PM, Brendan Wee @.***> wrote:

Thanks for your response!

Since 2. is a quick check I went ahead and did it. In the run with only ten samples, we find 10 species that do not show up in the complete run with 2337 samples. Each of these samples are detected in at least 5 / 10 samples. These are the same coverage maps used for each experiment.

Can you think of a reason these species would not have a PTR predicted in the larger dataset? Given that they are detected in at least 5 samples as shown in the small 10 sample experiment.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

brendanwee commented 2 years ago

These are the species that were missing: '28025.10.patric', '339860.19.patric', '349741.6.patric', '43675.28.patric', '165186.24.patric', '1917876.3.patric', '2025659.3.patric', '28116.19.patric', '435842.3.patric', '754307.3.patric', '765821.6.patric', '888828.3.patric', 'ERS235560_81.hgm', 'SRS820622_7.hgm'

How do you tell if they have complete genomes or draft assemblies?

brendanwee commented 2 years ago

I checked a few on Patric - the first 4 or 5 species had complete genomes. The next few had Good genomes with 10-100 contigs.

tyjo commented 2 years ago

One more question, are you using the same coverage maps for the 10 sample experiment as the 2337 sample experiment? Did you redo the read mapping?

brendanwee commented 2 years ago

I used the same ones in both experiments

tyjo commented 2 years ago

I was able to replicate the issue for contigs. It looks like the filtering criteria was too restrictive. If a species failed the filtering criteria in too many samples, no estimate would be produced. I've changed the threshold so that species will not be lost if you increase the number of samples. This change is reflect in the latest commit. I'm not sure this will change the prevalence curve.

However, I was not able to replicate the issue for complete genomes. I first ran CoPTR on a small dataset of 10 samples, then increased the number of samples to 50 and 100. The species with complete genomes present in the 10 sample data were also present in the same samples when run on the 50 and 100 sample datasets.

If you are willing to provide a minimal reproducible example for the second case, I can look further into the issue.

brendanwee commented 2 years ago

Thank you so much for looking into this and trying to fix it so quickly!

I really wish I could, but the data I am working with cannot be shared :( If I find a way a sharable dataset that replicates the issue, I will provide it.

That being said, I will pull the latest commit and try the estimation step again with my large dataset and see how the results change.

brendanwee commented 2 years ago

Great news!

I went ahead and reran the 2000+ samples with the latest commit.

Overall, the prevalence curve did not change drastically, just like you expected. In total ~30,000 more PTRs were predicted. Comparing with the same subset of 10 samples before, we see no dropouts of species! I wonder if the complete genomes we saw before were not complete when the database was created...

Regardless, things look clean thanks to your recent fix. Thank you so much!

tyjo commented 2 years ago

Great - thanks for the update!

tyjo / coptr

Prevalence differences in experiments #13