sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
453 stars 78 forks source link

comparing the resemblance & containment of two metagenomes using read sets #1135

Open yue-clare-lou opened 3 years ago

yue-clare-lou commented 3 years ago

Hi,

I have pairs of metagenomes that I want to compare the similarity of them as well as figure out the % of one metagenome that is contained within the other one. sourmash gather & search seem to be two optimal choices. I was wondering if I could use PE read sets only to run the analysis I am interested in?

Also, I have also run Mash (including Screen) on my metagenome pairs. However, I am a bit confused with the outputs from Mash vs. sourmash gather/search --containment. For instance, metagenomes that show less mash dist are shown to share less reads from the sourmash gather outputs.. I am guessing the reason for seeing this disagreement is that the size of these metagenomes could vary, which could affect mash dist, but not much with sourmash gather since it is calculating the containment. I also compared the outputs of Mash screen and sourmash gather but the fraction of shared bases aren't similar from two tools. If possible, could you please explain the relationship b/w Mash output (mash dist) and sourmash search --containment/gather output?

Thanks!

ctb commented 3 years ago

hi @clarelou0128, thanks for your questions!

I'm not sure what "PE read sets only" means - do you mean only reads where both pairs are present?

I think the three commands you want are

# calculate similarity
sourmash search A.sig B.sig --threshold=0 
# calculate how much of A is contained by B
sourmash search A.sig B.sig --threshold=0 --containment
# calculate how much of B is contained by A
sourmash search B.sig A.sig --threshold=0 --containment

(the last two commands could also be done with gather and should return the same results.)

You could also do sourmash sig overlap A.sig B.sig if you want detailed output for just the comparison of the two.

My believe that mash dist calculates similarity and estimates ANI. That agrees with "metagenomes that show less mash dist are shown to share less reads from the sourmash gather outputs", I think - similarity approximately tracks with containment, but is usually much less than. So I'm not sure it's disagreement?

I've never used mash screen, I'm afraid, and I thought it was mostly for estimating genome containment by metagenomes - which should work for containment between two metagenomes. @luizirber do you have any insight here?

(Hopefully the sourmash sig overlap will give you some of the details you need to dig into the math - it gives you the raw numbers from which we calculate similarity and containment.)

luizirber commented 3 years ago

I've never used mash screen, I'm afraid, and I thought it was mostly for estimating genome containment by metagenomes - which should work for containment between two metagenomes. @luizirber do you have any insight here?

I don't think it is going to work for the containment of two metagenomes by default. I think it might work if you use a very large sketch (n=100000, at least), and then run mash screen on it. (mash screen will hash the query and check each hash in the sketch you calculated with n=100000, and that's why it needs a large n)

And I never thought about using sourmash sig overlap, that is a pretty nice output!

ctb commented 3 years ago

I've never used mash screen, I'm afraid, and I thought it was mostly for estimating genome containment by metagenomes - which should work for containment between two metagenomes. @luizirber do you have any insight here?

I don't think it is going to work for the containment of two metagenomes by default. I think it might work if you use a very large sketch (n=100000, at least), and then run mash screen on it. (mash screen will hash the query and check each hash in the sketch you calculated with n=100000, and that's why it needs a large n)

Oh! I'm glad I asked! Thanks, yes, that makes it clear.

So, @clarelou0128, I think mash screen is going to (incorrectly) return much higher estimated overlaps than sourmash search --containment/gather unless you tweak the parameters significantly.

And I never thought about using sourmash sig overlap, that is a pretty nice output!

:executes courtly bow: at your service!

yue-clare-lou commented 3 years ago

hi @clarelou0128, thanks for your questions!

I'm not sure what "PE read sets only" means - do you mean only reads where both pairs are present?

sorry @ctb for the confusion but yeah "PE read sets" refer to pair-end reads. Specifically, they are interleaved reads. I don't want to use assembled contigs or genomes since I think reads has the most information if I want to see how similar two communities are and if one community is almost 100% contained within the other.

I think the three commands you want are

# calculate similarity
sourmash search A.sig B.sig --threshold=0 
# calculate how much of A is contained by B
sourmash search A.sig B.sig --threshold=0 --containment
# calculate how much of B is contained by A
sourmash search B.sig A.sig --threshold=0 --containment

(the last two commands could also be done with gather and should return the same results.)

Thanks! My sourmash search commands are similar to what you suggested here except that 1) I used default threshold 2) set k=51 (I want matches to be very stringent). I tried to adjust the threshold to 0 and the output I am getting is the same as the default one.

Can I ask what does --threshold do in sourmash search? From reading #995, this is my understanding: since I set --scaled to 1000, --threshold=0 means sourmash search will report all matches including hashes that share 0 bases? And for the default --threshold=0.08, any matching hashes that share >= 8% bases will be reported?

This is the output when I ran sourmash search A.sig B.sig -k 51 --threshold=0 --containment

selecting specified query k=51
loaded query: A.sig (k=51, DNA)
loaded 1 signatures.

1 matches:
similarity   match
----------   -----
 22.4%       B.sig

I also tried sourmash gather and compared the output with sourmash search and their outputs were the same when ignoring abundance. However when considering abundance, the %of match increased significantly. Is it normal to see such a drastic difference? I have included two sourmash gather run b/w A.sig & B.sig below.

sourmash gather w/o tracking abundance (sourmash gather A.sig B.sig -k 51 --threshold=0 --ignore-abundance) (setting threshold=0 or using the default threshold=50kb results the same output)

overlap     p_query p_match
---------   ------- -------
125.2 Mbp     22.4%   18.3%    B.sig
found less than 0 bp  in common. => exiting

found 1 matches total;
the recovered matches hit 22.4% of the query

sourmash gather tracking abundance (sourmash gather A.sig B.sig -k 51 --threshold=0)

overlap     p_query p_match avg_abund 
---------   ------- ------- ---------
125.2 Mbp     91.9%   18.3%      46.1    B.sig
found less than 0 bp  in common. => exiting

found 1 matches total;
the recovered matches hit 91.9% of the query

My interpretation is that the majority of reads that seen in B came from A since when considering abundance, sample A is ~92% contained within sample B.

You could also do sourmash sig overlap A.sig B.sig if you want detailed output for just the comparison of the two.

Ah, thanks for the suggestion! Just tried this command and agree with @luizirber that the output looks amazing!

My believe that mash dist calculates similarity and estimates ANI. That agrees with "metagenomes that show less mash dist are shown to share less reads from the sourmash gather outputs", I think - similarity approximately tracks with containment, but is usually much less than. So I'm not sure it's disagreement?

Correct me if I am wrong -- I thought if two metagenomes that are more similar, they should share more reads and hence the mash dist between them should be less?

And below shows an example of the disagreement that I am seeing when comparing the outputs of mash dist & sourmash search. (when calculating mash dist, I tried sketch sizes 1000 & 10000 and both output the same mash distance)

compared   mash         sourmash 
samples    distance    similarity
---------   -------       -------   
A vs. B       0.03           25%
A vs. C       0.04           74%

I've never used mash screen, I'm afraid, and I thought it was mostly for estimating genome containment by metagenomes - which should work for containment between two metagenomes. @luizirber do you have any insight here?

(Hopefully the sourmash sig overlap will give you some of the details you need to dig into the math - it gives you the raw numbers from which we calculate similarity and containment.)

yue-clare-lou commented 3 years ago

I've never used mash screen, I'm afraid, and I thought it was mostly for estimating genome containment by metagenomes - which should work for containment between two metagenomes. @luizirber do you have any insight here?

I don't think it is going to work for the containment of two metagenomes by default. I think it might work if you use a very large sketch (n=100000, at least), and then run mash screen on it. (mash screen will hash the query and check each hash in the sketch you calculated with n=100000, and that's why it needs a large n)

Thanks for letting me know. I will not use mash screen in this case!

And I never thought about using sourmash sig overlap, that is a pretty nice output!

ctb commented 3 years ago

ok! some answers!

can I use the reads only?

yes! Your containment and similarity %s will be lower than they should be because of errors that inflate the total number of k-mers, but there is no problem in using the reads! There are some preprocessing steps you can use (see these instructions) that might help, but they take a lot of time so I wouldn't bother the first time through.

what does --threshold do in sourmash search? ... will sourmash search report all matches including hashes that share 0 bases?

The thresholding is done on number of hashes that match; --threshold=0 will return something as long as there is at least one hash matching. If you have a scaled=1000, one hash ~= 1000 bases.

As you saw in your gather results, you get the same output with and without --threshold=0; that's because you have a good match :). If you didn't have a good match, without --threshold=0 you wouldn't see any results.

My interpretation is that the majority of reads that seen in B came from A since when considering abundance, sample A is ~92% contained within sample B.

yes, exactly!

Note that this could well be due to sequencing errors: if you don't do k-mer based error trimming (as above), and you have two communities that are very similar and have been deeply sequenced, this is the result I would expect to see. The reason is that erroneous k-mers will always be low abundance, while your true k-mers in a deeply sequenced metagenome will be high abundance.

Correct me if I am wrong -- I thought if two metagenomes that are more similar, they should share more reads and hence the mash dist between them should be less?

yes, my mistake! I just swizzled 'dist' and similarity in my head!

yue-clare-lou commented 3 years ago

Thank you so much for this detailed explanation!

ctb commented 3 years ago

welcome! I'll keep this open for a bit if you have any more questions ;).

yue-clare-lou commented 3 years ago

Sounds great. Thank you so much!! Really appreciate it.