Closed SchwarzEM closed 4 years ago
I've also tried running gather-by-contig.py with sourmash versions 2.1.0 or 2.2.0 (both via bioconda), rather than version 2.0.0a6. For either version 2.1.0 or version 2.2.0, I get the following error (which is at least somewhat different from the error with 2.0.0a6):
"""
Traceback (most recent call last):
File "/pylon5/mc5phvp/schwarze/sourmash/gather-by-contig.py", line
14, in
which obviously doesn't work either.
A key issue here seems to be that the bioconda version of sourmash 2.0.0a6 (which used to work with gather-by-contig.py) is not strictly keeping register with the exact version of khmer that will allow gather-by-contig.py to run.
If there was some way to enforce having sourmash version 2.0.0a6 in bioconda be tightly linked with whatever earlier version of khmer it used to have!
A final point that may help in figuring out how to fix all this: I'm attaching the copy of gather-by-contig.py that I had previously successfully used for decontaminating assembly contigs.
Attaching, not just posting here, so that github won't munge the text of gather-by-contig.py.
Attaching as .py.txt rather than .py, so that github will let me post the thing at all.
Finally -- note that this copy is slightly different from the version that I had initially downloaded from https://gist.githubusercontent.com/ctb/fa1c11b5e2f9614128685600911c842a/raw/8feb65684bb9534f4549d058de0db46e0e3c06b6/gather-by-contig.py.
OK, I have fixed the bug. My solution:
Upgrade to sourmash 2.2.0 after all.
Revise my gather-by-contig.py so that the function SearchMinHashesFindBestIgnoreMaxHash is replaced by GatherMinHashesFindBestIgnoreMaxHash, as follows:
""" nano gather-by-contig.py ; diff gather-by-contig.previous.py gather-by-contig.py ;
14c14
< from sourmash.sbtmh import SearchMinHashesFindBestIgnoreMaxHash
---
> from sourmash.sbtmh import GatherMinHashesFindBestIgnoreMaxHash
50c50
< search_fn = SearchMinHashesFindBestIgnoreMaxHash().search
---
> search_fn = GatherMinHashesFindBestIgnoreMaxHash().search
"""
I figured out the latter solution by seeing an 8-month-old comment in Titus' blog:
http://ivory.idyll.org/blog/2018-detecting-contamination-in-long-read-assemblies.html
With this github post, hopefully, the solution will become more reliably accessible...
And, I'm attaching my corrected version of gather-by-contig.py here (as before, with *.txt so that github will let me post it as an un-munged file).
Well, [redacted verb] me -- I declared victory before looking at the actual results of the seemingly succesful sourmash run. It's failing, because for some weird reason sourmash is no longer making a successful binary division between contaminant contigs and non-contaminant contigs. It's managing somehow to lose contigs -- lots of contigs. Never did that before.
Here's an example of what's going very wrong.
Input assembly:
Total nt: 166,592,905 Scaffolds: 107 Contigs: 113
Output with a match to GenBank (contaminants):
Total nt: 23,610,398 Scaffolds: 17 Contigs: 17
Output without a match to GenBank (non-contaminants):
Total nt: 98,892,689 Scaffolds: 79 Contigs: 79
The latter two don't come close to adding up to 166 Mb.
Oh, how I wish I could still make gather-by-contig.py work with sourmash 2.0.0a6, using bioconda to install sourmash version 2.0.0a6!
Looking at one run that did work (wtdbg2) versus one that clearly didn't (Flye): the difference may be that Flye emits both contigs and scaffolds, whereas wtdbg2 (like Canu) assembles pure contigs only.
Weird to have a long-read assembler scaffold anything, but it would explain why Flye's sourmash run went so very wrong; if gather-by-contig.py can only parse contigs, then scaffolds will just disappear rather than being assigned to matching (contaminant) or non-matching (non-contaminant) sets.
hi @SchwarzEM can you show me the command lines and also drop me a few MB of fasta files to test it on, pls? Thanks!
ok, I see the command lines above, sorry! Some test files would be helpful tho.
Sorry about the delay -- there were Things...
I have two test files Raxei_flye_50x_2019.10.19_scaffold_206.fa will not work with sourmash; Raxei_flye_50x_2019.10.19_scaffold_206.contigs.fa will.
github won't let me post the files, so I will e-mail them to you directly.
I think is a pretty obvious reason why one of them is failing -- Raxei_flye_50x_2019.10.19_scaffold_206.fa contains 100 nucleotides of scaffolding ("N" residues) which link the two contigs in Raxei_flye_50x_2019.10.19_scaffold_206.contigs.fa.
Here are their detailed statistics:
Sequence: Raxei_flye_50x_2019.10.19_scaffold_206.fa
Total nt: 1,795,346 Scaffolds: 1 Contigs: 2
ACGT nt: 1,795,246 N-res. nt: 100 % non-N: 99.99 % GC: 35.36 % softmasked [1]: 0.00 ([1]: versus entire assembly size) % softmasked [2]: 0.00 ([2]: versus non-N, ACGT only, assembly size)
Scaffold N50 nt: 1,795,346.0 Scaffold N90 nt: 1,795,346.0 Scaf. max. nt: 1,795,346 Scaf. min. nt: 1,795,346
Contig N50 nt: 1,722,405.0 Contig N90 nt: 1,722,405.0 Contig max. nt: 1,722,405 Contig min. nt: 72,841
Sequence: Raxei_flye_50x_2019.10.19_scaffold_206.contigs.fa
Total nt: 1,795,246 Scaffolds: 2 Contigs: 2
ACGT nt: 1,795,246 N-res. nt: 0 % non-N: 100.00 % GC: 35.36 % softmasked [1]: 0.00 ([1]: versus entire assembly size) % softmasked [2]: 0.00 ([2]: versus non-N, ACGT only, assembly size)
Scaffold N50 nt: 1,722,405.0 Scaffold N90 nt: 1,722,405.0 Scaf. max. nt: 1,722,405 Scaf. min. nt: 72,841
Contig N50 nt: 1,722,405.0 Contig N90 nt: 1,722,405.0 Contig max. nt: 1,722,405 Contig min. nt: 72,841
The take-home on this seems to be: although most long-read assemblers will generate nothing but pure contigs, Flye (at least) will not, and this makes the products of Flye not directly usable for a sourmash decontamination. (As an immediate work-around for problem, I split the scaffolds from Flye into contigs and ran sourmash on the contigs.)
figured it out, it's the Ns, patch happening :)
best, --titus
On Tue, Oct 22, 2019 at 02:42:55PM -0700, Erich Schwarz wrote:
The take-home on this seems to be: although most long-read assemblers will generate nothing but pure contigs, Flye (at least) will not, and this makes the products of Flye not directly usable for a sourmash decontamination. (As an immediate work-around for problem, I split the scaffolds from Flye into contigs and ran sourmash on the contigs.)
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/dib-lab/sourmash/issues/749#issuecomment-545168841 -- C. Titus Brown, ctbrown@ucdavis.edu
tackled in #940
I used to be able to make sourmash decontaminate contigs in long-read assemblies. Here are all of the steps I would run that used to function properly. But now, as of Oct. 20, 2019, I am getting a weird bug, that seems to be due to bioconda sourmash not being able to use the correct version of khmer. It needs to have a version of khmer with the attribute 'load_nodegraph', but bioconda isn't letting that happen any more.
Here's the walkthrough:
mkdir $SCRATCH/sourmash ;
cd $SCRATCH/sourmash ;
curl -O -L https://s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k31-2017.05.09.tar.gz ; tar xvfz microbe-genbank-sbt-k31-2017.05.09.tar.gz ;
curl -O -L https://gist.githubusercontent.com/ctb/fa1c11b5e2f9614128685600911c842a/raw/8feb65684bb9534f4549d058de0db46e0e3c06b6/gather-by-contig.py ; chmod +x gather-by-contig.py ;
[Do the very same bioconda installation that has worked previously:] conda create -n sourmash_2.0.0a6 sourmash=2.0.0a6 ; conda activate sourmash_2.0.0a6 ;
which sourmash ;
/pylon5/mc5phvp/schwarze/anaconda2/envs/sourmash_2.0.0a6/bin/sourmash
which python ;
/pylon5/mc5phvp/schwarze/anaconda2/envs/sourmash_2.0.0a6/bin/python
python --version ;
Python 3.6.7 # originally this was Python 3.6.5; why any change at all?
cd $SCRATCH/rhabditella/assemblies/flye/100x_decont ;
[Try this line that used to always work:]
$SCRATCH/sourmash/gather-by-contig.py \ --output-match raxei_flye_100x_2019.10.20.01.match.orig.fa \ --output-nomatch raxei_flye_100x_2019.10.20.01.no_match.orig.fa \ --csv raxei_flye_100x_2019.10.20.01.sm.summary.tsv.txt \ ../100x/Raxei_flye_100x_2019.10.19/assembly.fasta \ $SCRATCH/sourmash/genbank-k31.sbt.json ;
[It starts to work, but then fails, for what appears to be the inability of bioconda-sourmash to get the exactly correct version of khmer that will let everything Just Keep Working.]
found SBT database /pylon5/mc5phvp/schwarze/sourmash/genbank-k31.sbt.json using ksize=31, scaled=2000 loading sequences from ../100x/Raxei_flye_100x_2019.10.19/assembly.fasta saving match sequences to raxei_flye_100x_2019.10.20.01.match.orig.fa saving nomatch sequences to raxei_flye_100x_2019.10.20.01.no_match.orig.fa outputting CSV summary to raxei_flye_100x_2019.10.20.01.sm.summary.tsv.txt Traceback (most recent call last): File "/pylon5/mc5phvp/schwarze/sourmash/gather-by-contig.py", line 87, in
main()
File "/pylon5/mc5phvp/schwarze/sourmash/gather-by-contig.py", line 61, in main
for leaf in tree.find(search_fn, query, args.threshold):
File "/pylon5/mc5phvp/schwarze/anaconda2/envs/sourmash_2.0.0a6/lib/python3.6/site-packages/sourmash/sbt.py", line 191, in find
if search_fn(node_g, *args):
File "/pylon5/mc5phvp/schwarze/anaconda2/envs/sourmash_2.0.0a6/lib/python3.6/site-packages/sourmash/sbtmh.py", line 191, in search
matches = sum(1 for value in mins if node.data.get(value))
File "/pylon5/mc5phvp/schwarze/anaconda2/envs/sourmash_2.0.0a6/lib/python3.6/site-packages/sourmash/sbtmh.py", line 191, in
matches = sum(1 for value in mins if node.data.get(value))
File "/pylon5/mc5phvp/schwarze/anaconda2/envs/sourmash_2.0.0a6/lib/python3.6/site-packages/sourmash/sbt.py", line 684, in data
self._data = khmer.load_nodegraph(f.name)
AttributeError: module 'khmer' has no attribute 'load_nodegraph'