Can sourmash be used to remove microbal contamination from long reads?

Hi,

I recently discovered sourmash from a benchmark study (Portik et al. 2022) and tested it myself. It's very fast and memory-effiecient.

I tried to use sourmash to remove microbal contamination from long reads but failed. My data is a pacbio hifi WGS for de novo assembly (SRR28491883). Since the target species is an insect, microbes like symbionts could not be removed completely before sequencing, which is showed in SRA taxonomy analysis.

I used the commands from #3095 and #3252.

# sourmash 4.8.11 
# sourmash_plugin_branchwater 0.9.7

# Nm_hifi.csv file for sketch 
# Nm_hifi.fasta is converted by bam2fasta in pbtk
cat Nm_hifi.csv
name,genome_filename,protein_filename
Nm_hifi,Nm_hifi.fasta,

# manysketch with --singleton
sourmash scripts manysketch Nm_hifi.csv -o Nm_hifi.manysketch.k31.singleton.zip -p dna,k=31,scaled=1000,abund --singleton
sourmash scripts index -k 31 gtdb-rs214-k31.zip -o gtdb-rs214-k31.rocksdb
sourmash scripts fastmultigather -k 31 -o Nm_hifi.manysketch.k31.singleton.fastmultigather.gtdb-rs214.csv Nm_hifi.manysketch.k31.singleton.zip gtdb-rs214-k31.rocksdb

I expected it to be a large file(taxomony assignment for 3.4M reads). However, fastmultigather finished within a minute and there was only one line of result in the output csv file. Nm_hifi.manysketch.k31.singleton.fastmultigather.gtdb-rs214.csv

# manysketch without --singleton
sourmash scripts manysketch Nm_hifi.csv -o Nm_hifi.manysketch.k31.zip -p dna,k=31,scaled=1000,abund
sourmash scripts fastgather -k 31 -o Nm_hifi.manysketch.k31.fastgather.gtdb-rs214.csv Nm_hifi.manysketch.k31.zip gtdb-rs214-k31.zip

Then I tried manysketch without --singleton and used fastgather the output file seems normal. Nm_hifi.manysketch.k31.fastgather.gtdb-rs214.csv

I also tried genbank-2022.03 databases and with -k 51. The output for manysketch with --singletons and fastmultigather either only one line(for k31 genbank bacteria) or empty(for all k51 databases) but for manysketch without --singletons and fastgather the output seems normal.

I'm confused about what's going wrong here. Can sourmash perform taxonomy assignment for long reads? Or is this unexpected behavior due to some specific commands/parameters I'm using?

To quickly reproduce this issue, I randomly sampled 10k reads from Nm_hifi.fasta(Nm_hifi_sample10k.fasta.gz on google drive).

gzip -d Nm_hifi_sample10k.fasta.gz
echo -e "name,genome_filename,protein_filename\nNm_hifi_sample10k,Nm_hifi_sample10k.fasta," > Nm_hifi_sample10k.csv
sourmash scripts manysketch Nm_hifi_sample10k.csv -o Nm_hifi_sample10k.manysketch.k31.singleton.zip -p dna,k=31,scaled=1000,abund --singleton
sourmash scripts fastmultigather -k 31 -o Nm_hifi_sample10k.manysketch.k31.singleton.fastmultigather.gtdb-rs214.csv Nm_hifi_sample10k.manysketch.k31.singleton.zip gtdb-rs214-k31.rocksdb
sourmash scripts manysketch Nm_hifi_sample10k.csv -o Nm_hifi_sample10k.manysketch.k31.zip -p dna,k=31,scaled=1000,abund
sourmash scripts fastgather -k 31 -o Nm_hifi_sample10k.manysketch.k31.fastgather.gtdb-rs214.csv Nm_hifi_sample10k.manysketch.k31.zip gtdb-rs214-k31.zip

There are three lines of results in the fastgather output but the fastmultigather output is empty. Nm_hifi_sample10k.manysketch.k31.fastgather.gtdb-rs214.csv Nm_hifi_sample10k.manysketch.k31.singleton.fastmultigather.gtdb-rs214.csv

sourmash-bio / sourmash

Can sourmash be used to remove microbal contamination from long reads? #3291