Are the example workflows up to date?

bede commented 1 year ago

Hi again @oma219, I'm really interested in using this tool for contaminant detection but am having some problems classifying reads using the databases I created successfully. This has led me to the example workflows page in the wiki. I am unable to execute the example commands shown using a multifasta of the human genome. Can you advise what I might be doing wrong? Are the example workflows up to date with spumoni 2?

Wiki examples scenario 1:

./spumoni build -r ecoli_genome.fa -M -P
./spumoni run -r spumoni_full_ref.bin -p reads.fa -M

Wiki examples scenario 2:

./spumoni build -r ecoli_genome.fa -M -P -n
./spumoni run -r ecoli_genome.fa -p reads.fa -P -c

My attempts:

(base) ubuntu@pikachu:~$ sudo docker run -v /data:/data oma219/spumoni:v2.0.2 spumoni build -r /data/GCF_000001405.40_GRCh38.p14_genomic.fa -M -P
SPUMONI version: 2.0.2

Error: A minimizer type must be specified.

(base) ubuntu@pikachu:~$ sudo docker run -v /data:/data oma219/spumoni:v2.0.2 spumoni build -r /data/GCF_000001405.40_GRCh38.p14_genomic.fa -M -P -n
Error: Need to specify an output prefix for the index files.

SPUMONI version: 2.0.2

More generally, how may I perform classification with pseudo-matching lengths using the database I created using the following command?

spumoni build -m -M -P -o /data/GRCh38.p14.mMP -r /data/GCF_000001405.40_GRCh38.p14_genomic.fa

oma219 commented 1 year ago

Hi Bede,

Yeah, so the workflows are slightly out of date, they are just missing two required options.

I just forgot to update the workflows with (1) using the -o parameter to specify the index prefix and (2) if you do want to use minimizer digestion to reduce the size of your index, you either have to specify that you are using a minimizer-alphabet (-m) or DNA-based minimizers with (-t) which seem to be the causing the errors you are showing.

So first step you can build the index, and you can use either of these commands. Using minimizer digestion is helpful for scenarios where your reference file is large like in the GB.

(1) Build the index

# Uses minimizer digestion because of -m
spumoni build -r /data/GCF_000001405.40_GRCh38.p14_genomic.fa -o /index/GCF_000001405.40_GRCh38.p14_genomic -m -M -P
# Does not use minimizer digestion because of -n
spumoni build -r /data/GCF_000001405.40_GRCh38.p14_genomic.fa -o /index/GCF_000001405.40_GRCh38.p14_genomic -n -M -P

(2) Classify reads against the database

And then you can classify by using the run subcommand and telling it what minimizer scheme you used. So here are two commands you can use depending on which build command you used. The -c option is what gives you a classification report for each read.

# Use this command if you built the index using -m
spumoni run -r /index/GCF_000001405.40_GRCh38.p14_genomic -p reads.fa -m -P -c
# Use this command if you build the index using -n
spumoni run -r /index/GCF_000001405.40_GRCh38.p14_genomic -p reads.fa -n -P -c

Let me know if anything is unclear, or you run into further errors. I'll update the wiki page to reflect what I have mentioned here.

Thanks, Omar

bede commented 1 year ago

Hi Omar, thank you so much for taking the time to write a thorough reply, this makes sense now – I'm running my first test and report file is being generated.

Thanks again, and for developing such an interesting tool. Bede

oma219 commented 1 year ago

Sounds good, I update the workflows on the wiki page. Thank you for pointing this out! I'll close this issue, but feel free to open it again if there are confusing aspects of the wiki page.

oma219 / spumoni

Are the example workflows up to date? #12