tseemann / mlst

:id: Scan contig files against PubMLST typing schemes
GNU General Public License v2.0
198 stars 47 forks source link

cgMLST database creation documentation #106

Open lskatz opened 3 years ago

lskatz commented 3 years ago

I am trying to make a guide for myself to create a cgMLST or even a wgMLST scheme. I hope this helps others. I suggest adding this in some way to the documentation, although I might still go with @tseemann's suggestion and just go with Chewie. Anyway, if you're like me and just have to try it out first to convince yourself...

Step 1: download the whole scheme from https://chewbbaca.online into a folder Listeria_monocytogenes.chewbbaca

Step 2: add git to the scheme so that you can check it back out in case you make a mistake.

cd Listeria_monocytogenes.chewbbaca
git init
git add *
git add .*
git commit -m init
git tag v1

Step 2b, undocumented: make sure there are no deflines with * indicating custom alleles not in the universal set of alleles.

Step 3a, not documented here: install mlst

Step 3b: mlst db creation

# load dependencies
module load mlst ncbi-blast+ perl
cd Listeria_monocytogenes.chewbbaca
# Check if any long IDs are present.
# If so, makeblastdb in mlst-build will not run
grep ">" -h *.fasta | perl -lane 'print if(length($_) > 50);' | head
# If long IDs, checkout v1 which I already versioned
# when I downloaded from chewbbaca.online
git checkout v1 -- '*'
pushd ~/bin/mlst/db/pubmlst
mkdir lmonocgmlst
cd lmonocgmlst
cp Listeria_monocytogenes.chewbbaca/*.fasta .
# rename to tfa.  Remove unnecessary prefix to filenames (although they are necessary elsewhere especially for Chewie)
for i in *.fasta; do mv $i $(basename $i .fasta).tfa; done;
for i in *.tfa; do target=$(sed 's/Pasteur_//' <<< $i); mv $i $target; done;
for i in *.tfa; do target=$(sed 's/-//' <<< $i); mv $i $target; done;
# Remove Pasteur_ prefix to avoid locus/allele confusion in mlst
perl -MFile::Copy=mv -MBio::SeqIO -e 'for $f(glob("*.tfa")){print "$f\n"; $in=Bio::SeqIO->new(-file=>$f); $out=Bio::SeqIO->new(-file=>">$f.tmp.fasta"); while($seq=$in->next_seq){$id=$seq->id; $id=~s/Pasteur_//; $seq->id($id); $out->write_seq($seq);} mv("$f.tmp.fasta", $f);}'
# Remove the dash too
perl -MFile::Copy=mv -MBio::SeqIO -e 'for $f(glob("*.tfa")){print "$f\n"; $in=Bio::SeqIO->new(-file=>$f); $out=Bio::SeqIO->new(-file=>">$f.tmp.fasta"); while($seq=$in->next_seq){$id=$seq->id; $id=~s/-//; $seq->id($id); $out->write_seq($seq);} mv("$f.tmp.fasta", $f);}'
# Make the scheme file
(echo -ne "ST\t"; \ls *.tfa | sed 's/.tfa//' | tr '\n' '\t'; echo clonal_complex
# fake genotype to avoid undefined genotype error
echo -ne 1
for i in *.tfa; do echo -ne "\t1"; done;
echo;
) > lmonocgmlst.txt
# Run automated makeblastdb to add the new scheme
mlst-make_blast_db
# ensure it worked
mlst --longlist | grep lmonocgmlst | perl -lane 'print "@F[0..7] ..."'
popd
cd my/Lmono/assemblies/
mlst --scheme lmonocgmlst *.fasta --threads 4 > lmonocgmlst.tsv
lskatz commented 3 years ago

To cap this off, my results were indeed not promising for my test set. I would expect about 1748 locus results because these are the core loci. However, the results only approximated 1k per genome. I have indeed shown myself to go with Chewie instead of mlst for cgMLST analysis.

\ls *_spades.fasta | xargs -P 12 -n 1 bash -c 'mlst --scheme lmonocgmlst --threads 2 --novel $0.novel.fasta.tmp $0 > $0.tsv.tmp; touch $0.novel.fasta.tmp;'
for tsv in *.tsv.tmp; do name=$(basename $tsv .fasta.tsv.tmp); touch $name.fasta.novel.fasta.tmp; novel=$(( $(wc -l < $name.fasta.novel.fasta.tmp) / 2)); echo -ne "$name\t$novel\t";  head -n1 $tsv | perl -lane 'for(@F){ if(/(\(\d+\))/){ $num++} } print "\t$num";'; done

...to produce a table of novel and exact allele matches per assembly in my benchmarking dataset.

assembly                                numNovel        numExact
Listeria_shovill_SRR10323923_spades     4               987
Listeria_shovill_SRR10483479_spades     994             139
Listeria_shovill_SRR10505985_spades     1               396
Listeria_shovill_SRR10696096_spades     101             265
Listeria_shovill_SRR13296922_spades     328             75
Listeria_shovill_SRR14044344_spades     5               355
Listeria_shovill_SRR14404488_spades     4               439
Listeria_shovill_SRR14669035_spades     4               397
Listeria_shovill_SRR15356214_spades     0               271
Listeria_shovill_SRR9973979_spades      4               902
lskatz commented 3 years ago

Not shown: Chewie results gave me on average 1740 loci per assembly