Difference between searching from reads and assembly

jodyphelan commented 1 year ago

I'm looking into using sourmash to find the max ANI of a new sample to existing sequences. I can do this either straight from the reads or assembling first and then searching using the assembly.

I noticed the ANI drops considerably (outside of what might be considered the same species) from 99% using the assembly to 93% using the reads. Is this expected?

Here is an example for reference:

# download files
wget wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR211/000/SRR21133400/SRR21133400_*.fastq.gz
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_001373395.1/download?include_annotation_type=GENOME_FASTA&filename=GCF_001373395.1.zip" -H "Accept: application/zip"
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_022374895.2/download?include_annotation_type=GENOME_FASTA&filename=GCF_022374895.2.zip" -H "Accept: application/zip"
unzip -o GCF_001373395.1.zip
unzip -o GCF_022374895.2.zip

# create sketches
sourmash sketch dna --merge reads -o reads.sig SRR21133400_*
sourmash sketch dna -o GCF_001373395.1.sig ncbi_dataset/data/GCF_001373395.1/GCF_001373395.1_PRJEB8430_assembly_1_genomic.fna
sourmash sketch dna -o GCF_022374895.2.sig ncbi_dataset/data/GCF_022374895.2/GCF_022374895.2_ASM2237489v2_genomic.fna

# perform searches
sourmash search reads.sig GCF_001373395.1.sig --containment -o reads.csv --estimate-ani
sourmash search GCF_022374895.2.sig GCF_001373395.1.sig --containment -o asm.csv --estimate-ani
head *.csv

ctb commented 1 year ago

hey @jodyphelan thanks for the commands!!

my guess here is that the assembly is removing a lot of sequence, probably by collapsing strain variation. This is a well-known problem that we are working to measure in an ongoing project. This does seem like an extreme case tho!

The simplest way to confirm is to look at the k-mer containment values directly. They should mirror what you see with ANI, but stronger, because k-mer containment scales exponentially with ANI.

An alternative/addition would be to do mapping - the 99% reads should map just fine to the reference.

@bluegenes any thoughts?

super cool stuff - if you nail down any details I'd love to hear about them!

jodyphelan commented 1 year ago

Hi @ctb,

Thanks for your answer, super informative. Based on that I tested to see if you track the abundance and then filter out hashes with low abundance and it seems to push up the ANI to 99%. I was wondering if perhaps filtering of low-abundance hashes would be a useful feature to build in? No worries if it isn't a priority, I can use the script for now.

Here if a working example building on the previous comment. Here is the python script to perform the filtering of the signature (using a min value of 5):

import json
import sys

input = sys.argv[1]
output = sys.argv[2]

data = json.load(open(input))

num_mins = len(data[0]['signatures'][0]['mins'])

mins = [data[0]['signatures'][0]['mins'][i] for i in range(num_mins) if data[0]['signatures'][0]['abundances'][i] > 5]
abunds = [data[0]['signatures'][0]['abundances'][i] for i in range(num_mins) if data[0]['signatures'][0]['abundances'][i] > 5]

data[0]['signatures'][0]['mins'] = mins
data[0]['signatures'][0]['abundances'] = abunds

json.dump(data, open(output, 'w'))

Then I ran the following sketch command with a lower scaling value (as some of the hashes will be removed later).

sourmash sketch dna -p scaled=100,abund --merge reads -o reads.abund.sig SRR21133400_*

Performed the filtering with the script above. This brings the number of hashes down from 517328 to 61343.

python filter-sig.py reads.abund.sig reads.abund.filt.sig

# Check the difference 
sourmash sig describe reads.abund.sig
sourmash sig describe reads.abund.filt.sig

Then running search which gives an ANI estimate of 99%

sourmash search reads.abund.filt.sig GCF_001373395.1.sig --containment -o reads.filt.csv --estimate-ani --ignore-abundance
head *.csv

ctb commented 1 year ago

more later, but see: sourmash sig filter ;)

jodyphelan commented 1 year ago

Oh wow, I need to learn how to read the manual!

ctb commented 1 year ago

nah, no worries - we've added a lot of functionality over the years and it's easy to lose track even for us 😆

jodyphelan commented 1 year ago

haha ok!

Just for reference, I'm planning to use this as a method to quickly ascertain what species of Mycobacterium we have before we go down the mapping/variant calling and all that and it seems like it does it perfectly, so thanks again for a great tool!

ctb commented 11 months ago

just to put a cap on this - I made a note to revisit based on the last comment, but now upon re-reading I see:

assembly had ANI of 99% (based on sketching)
reads had estimated ANI of 93% (based on sketching)
filtering out low-abundance hashes from the sketches made the reads ANI match the assembly ANI

the straightforward interpretation is that assembly is indeed assembling the high abundance sequences, so filtering out low abundance k-mers makes the reads look like the assembly.

I had misread the initial question to show that assembly had an ANI of 93%, and that made me wonder. but all seems good now!

Note that search is not that informative compared to prefetch and prefetch does exactly what you want here - search by containment :).

I would also like to take this time to suggest that you use gather instead of search --containment, which should give you the "right" species of mycobacterium with very high certainty and a lot less fuss. If you try it out and get different results I'd be VERY interested to hear about it.

Last but not least you might be interested in this discussion about k-mer trimming: https://github.com/sourmash-bio/sourmash/issues/2122 - no hurry or anything, but your feedback would be very welcome :)

sourmash-bio / sourmash

Difference between searching from reads and assembly #2750