Open AmstlerStephan opened 2 months ago
Is it possible that this discrepancy is a due to the different algorithms between blast and minimap2 (and perhaps different databases..i'm unclear here if you are using the same database in both emu and your blast run here. If using a different database, the same species classification may be represented by a different representative sequence). If you were to run minimap2 using just your selected reads against the database you are using for emu, does the discrepancy still appear?
Thank you for the fast response. I don't think that it is an actual classification problem. It very much seems like the readID column (readID referring to the fastq read name) does not match the classification probabilities in the read-assignment-distribution file or somehow gets mixed up. I extracted a handful of sequences, which were the top-hits in the read-assignment-distribution (assignment probability = 1) file of different taxa and samples. See the output_sequences.txt. In this fasta file I annotated the readID, classification, and barcode (sample) in the header of each sequence. I rerun emu on this subset and in the new assignment-distribution file, the taxa assignment changed for some sequences and now matches the blast search results. I attached the annotated fasta, the new emu assignment-distribution file and the rel-abundance file.
My command for emu using the emu-default database:
emu abundance --keep-counts --keep-read-assignments --db ./ --output-unclassified --threads 12 <input_fasta>
If you need more information or a detailed overview how I extracted the sequences etc. please let me know.
And thanks a lot for your help. output_sequences_read-assignment-distributions.txt output_sequences_rel-abundance.txt output_sequences.txt
We would expect the read classification distributions to change depending on which reads are included in the input. The algorithm behind emu upweights likelihood for more abundant species, and downweights less abundant species. Thus, the classification distribution for each read varies depending on the other reads in the sample.
I ran emu on your latest provided sequences. I choose to focus on the fourth read (97fe77d8-c844-40ff-8e18-a9ae7d3c0047_Rickettsia) since you have labeled this as Rickettsia and emu has labeled it has Wolbachia (split between tax ids 307502 and 66084 per the read-assignment-distribution file). I looked at the minimap2 output (sam file) that is produced as part of emu. When looking at the alignments for this read (attached), indeed the top 8 alignments are of Wolbachia, and the later alignments are for Rickettsia species. You can see in the chaining score (s1), Wolbachia references are in the 1054-1122 range, while Rickettsia has the highest s1 of 691. Minimap2 favors Wolbachia for this read.
One thing I noticed about this read is that is that it is over 6kbp. In the provided fasta file, many reads appear to be over 3kbp, especially towards the top of the file. Depending on the type of sequencing done, I suspect these may be chimeric? I'm curious how a chimeric read corrector would perform on these reads.
97fe77d8-c844-40ff-8e18-a9ae7d3c0047_Rickettsia_minimap2.sam.txt
Hi there! I utilized EMU to classify 16S and 18S rRNA amplicons sequenced using nanopore sequencing with the SQK-NBD114.96 kit. However, I noticed that some of the taxonomic assignments made by EMU didn't match our expectations, so I wanted to verify some of these assignments manually. I used the read-assignment-distributions file to extract the readIDs for some arbitrarily chosen reads. I next greped the readIDs in the according to fastq files and extracted the according sequence. When I now blast the sequences, some of them have different assignments by blast than by emu. All assignments from blasting do occur in the taxa assigned by emu, but some of them don't match the readID. Attached please find the files containing the information to blast readID (c509fbd9-628a-4fd4-b62e-96d6781e5484).
The sequence to readID c509fbd9-628a-4fd4-b62e-96d6781e5484: ATTACATCTACTTCGTTCAGTTACGTATTGCTAAGGTTAACACAAAGACACCGACAACTTTCTTCAGCACCTTACGGCTACCTTGTTACGACTTCACCCCAGTCGCTAATTTTACCGTGGTTGGCTGCCTCTTGCGTTAGCTCACCACCTTCAGGTAAAACCAACTCCCATGGCGTGACGGGCAGTGTGTACAAGGCCCGAGAACGTATTCACCGCGGCATGCTGATCCGCGATTACTAGCGATTCCAACTTCATGCTCTCGAGTTGCAGAGAACAATCCGAACTGAGATGTCTTTAGGGATTTGCTCCACGTCGCCGTATTGCTTCCCTCTGTAAACACCATTGTAGCACGCGTGTAGCCCAACCCATAAGGGCCATGATGACTTGACGTCGTCCCCACCTTCCTCCGGCTTATCACCGGCAGTTTTCTTATAGTTCCCGGCATTACCCGCTGGCAAATAAGAAGTCGGGGTTGCGCTGAGTTGCGGGACTTAACCCGACATCTCACGACACGAGCTGACGACAGCCATGCAACACCTGTGTGTGGTCCAGCCGAACTGAAGGAAAGCATCTCTGCGATCCGTAACCACCATGTCAAGGGTTGGTAAGGTTTTTCGCGTAACATCGAATTAAACCGCATGCTCCACCGCTTGTGCGAGCCCCCGTCAATTCCTTTGAGTTTTAATCTTGCGACCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTACGAAACCGAAAGAAATTCCTCCGATATCTAGCACTCATCGTTTACGGCGTGACTACCGGGTATCTGATCCTCTTTGCTCCACGCTTTCGTGCATCAGTGTCATTATCCTCTTACCGAGATGACCGCCTTCGCCACCGGTGTTCCTCCTAATATCTAAGAATTTCACCTCTACACTAGGAATTCCATCATCCCCTACTACACTCTAGATTAGTAGTTTTGAAAGCTATTCCGAGGTTAAGCCCCGGGCTTTCACTTTCCAACTTACTAAACCACCTACGCACTCTTTACGCCCAGTAATTCCGAACAACGCTAGCCCCCTCCGTCTTACCGCCGGCTGCTGGCACGGAGTTAGCCGGGGCTTTTTCTGCAGGTAACGTCATTATCTTCCCTGCTAAAAGAGCTTTACAACCCTAAGGCCTTCGTCACTCACTCGGTATTGCTGGATCAGGCTTTCGCCCATTGTCCAATATTCCCCACTGCTGCCTCCAGTAGGAGTCTGGGCCGTGTCTCAGTCCCAGTGTGGCTGATCATCCTCTCAGACCAGCTACAGATCGTAGGCTTGGTAGGCCATTACCCCACCAACTACCTAATCTGACGCGGGCTCATCCATCAGCGATAAATCTTTCCTCCTCGGAGAATATACGGTATTAGCTTTTATTTCTAAAAGTTATTCCGTACTGATGGGCAGATTCCCACGTGTTTCACCAGTCTGCCACTAATTAATTAGAGCAAGCCCTAATTAACTCGTTCGACTTGCATGTGTTAAGCATACCGATAGCGTTCGTTCTGAGCCATGATCAAACTCTATACGGTTACCTTGTTACGACTTCACCCCAGTCATGAATCACAAAGTGGTAAGCGCCCTCCCGAAGGTTGAGCTTACCTACTTCTTTTGCAACCCACTCCCATGGTGTGACGGGCGGTGTTCCAAGGCCCGGGAACGTATTCACCGTAGCATTCTGATCTACGATTACTAGCGATTCCGACTTCACGGAGTCGAGTTGCAGACTCCGATCCGGACTACGACGTACTTTATGAGGTCCGCTTGCTCTCGCGAGTTCGCTTTTCTTTGTATACGCCATTGTAGCACGTGTGTAGCCCTACTCGTAAGGGCCATGATAGCTTGACGTCATACCCACCTTCCTCCGGTTTATCACCGGCAGTCTCCTTTGAGTTCCCACCATTACGTGCTGGCAACAAAGGATAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATTTCACAACACGAGCTGACGACAGCCATGCAGCACCTGTCTCAGAGTTCCCGAAGGCACTAAGCTATCTCTAGCAAATTCTCTGGATGTCAAGAGTAGGTAAGGTTCTTCGCGTTGCATCGAATTAAACCACATGCTCCACCCGCTTGTGCGGGCCCCCGTCAATTCATTTGAGTTTTAACCTTGCGGCCGTACTCCCCAGGCGGTCGACTTAACGCGTTAGCTCCGGAAGCCACGCCTCGAGGGCACAACCTACAGTCGACATCGTTTACAGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCTACCCACGCTTTCAGCCTGCCTGACGTCAGTCTTTGTCCAGGGGGCCGCCTTCACCACCGGTATTCCTCCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCCTCCCCAGACTCTAGCTTGCCAGTTTCAAATGCAGTTCCCACGTTAAGCGCGGGGATTTCACATCTGACTTAACAAACCGCCTGCGTGCGCTTTACGCCCAGTAATTCCGATTAACGCTTGCACCCTCCGTATTACCGCGGCTGCTGGCACGGAGTTAGCCGGTGCTTCTTCTGCGAGTAACGTCAATGAACAGTGCTATTAACACTGAACCCTTCCTCCTCGCTGAAAGTGCTTTACAACCCGAAGGCCTTCTTCACACACGCGGCATGGCTGCATCAGGCTTGCGCCCATTGTGCAATATTCCCCACTGCTGCCTCCCGTAGGAGTCTGGACCGTGTCTCAGTTCCAGTGTGGCTGGTCATCCTATCTTCAGACCAGCTAGGGATCGTCGCCTAGGTGAGCCATTACCCCACCTCTAGCTAATCCCATCTGGGCACATCTGATGGCATGAGCCCAGAAGGTCCCCCACTTTGGTCCGTAGACGTTATGCGGTATTAGCTACCGTTTCCAGTAGTTATCCCCCTCCATCAGGCAGTTTCCCAGACATTACTCACCCGTCCGCCGCTCGTCACCCAGAGAGCAAGCTCTCCTGTGCTACCGCTCGACTTGCATGTGTTAGGCCTGCCGCCAGCGTTCAATCTGAGCCATGATCAAACTCTAGGTGCTGAAGAAAGTTGTCGGTGTCTTTGTGTTAACCTTCACAATACGT
It was assigned to Rickettsia bellii, but the blast search resulted in Serratia liquefaciens.
So far, I have not found any systematic shift in the rowIDs that could explain the inconsistencies. help would be much appreciated.
Kind regards, Stephan
read-assignments.txt example_fastq.txt