sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
467 stars 80 forks source link

can't make sourmash work for decontaminating contigs any more #749

Closed SchwarzEM closed 4 years ago

SchwarzEM commented 4 years ago

I used to be able to make sourmash decontaminate contigs in long-read assemblies. Here are all of the steps I would run that used to function properly. But now, as of Oct. 20, 2019, I am getting a weird bug, that seems to be due to bioconda sourmash not being able to use the correct version of khmer. It needs to have a version of khmer with the attribute 'load_nodegraph', but bioconda isn't letting that happen any more.

Here's the walkthrough:

mkdir $SCRATCH/sourmash ;

cd $SCRATCH/sourmash ;

curl -O -L https://s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k31-2017.05.09.tar.gz ; tar xvfz microbe-genbank-sbt-k31-2017.05.09.tar.gz ;

curl -O -L https://gist.githubusercontent.com/ctb/fa1c11b5e2f9614128685600911c842a/raw/8feb65684bb9534f4549d058de0db46e0e3c06b6/gather-by-contig.py ; chmod +x gather-by-contig.py ;

[Do the very same bioconda installation that has worked previously:] conda create -n sourmash_2.0.0a6 sourmash=2.0.0a6 ; conda activate sourmash_2.0.0a6 ;

which sourmash ;

/pylon5/mc5phvp/schwarze/anaconda2/envs/sourmash_2.0.0a6/bin/sourmash

which python ;

/pylon5/mc5phvp/schwarze/anaconda2/envs/sourmash_2.0.0a6/bin/python

python --version ;

Python 3.6.7 # originally this was Python 3.6.5; why any change at all?

cd $SCRATCH/rhabditella/assemblies/flye/100x_decont ;

[Try this line that used to always work:]

$SCRATCH/sourmash/gather-by-contig.py \ --output-match raxei_flye_100x_2019.10.20.01.match.orig.fa \ --output-nomatch raxei_flye_100x_2019.10.20.01.no_match.orig.fa \ --csv raxei_flye_100x_2019.10.20.01.sm.summary.tsv.txt \ ../100x/Raxei_flye_100x_2019.10.19/assembly.fasta \ $SCRATCH/sourmash/genbank-k31.sbt.json ;

[It starts to work, but then fails, for what appears to be the inability of bioconda-sourmash to get the exactly correct version of khmer that will let everything Just Keep Working.]

found SBT database /pylon5/mc5phvp/schwarze/sourmash/genbank-k31.sbt.json using ksize=31, scaled=2000 loading sequences from ../100x/Raxei_flye_100x_2019.10.19/assembly.fasta saving match sequences to raxei_flye_100x_2019.10.20.01.match.orig.fa saving nomatch sequences to raxei_flye_100x_2019.10.20.01.no_match.orig.fa outputting CSV summary to raxei_flye_100x_2019.10.20.01.sm.summary.tsv.txt Traceback (most recent call last): File "/pylon5/mc5phvp/schwarze/sourmash/gather-by-contig.py", line 87, in main() File "/pylon5/mc5phvp/schwarze/sourmash/gather-by-contig.py", line 61, in main for leaf in tree.find(search_fn, query, args.threshold): File "/pylon5/mc5phvp/schwarze/anaconda2/envs/sourmash_2.0.0a6/lib/python3.6/site-packages/sourmash/sbt.py", line 191, in find if search_fn(node_g, *args): File "/pylon5/mc5phvp/schwarze/anaconda2/envs/sourmash_2.0.0a6/lib/python3.6/site-packages/sourmash/sbtmh.py", line 191, in search matches = sum(1 for value in mins if node.data.get(value)) File "/pylon5/mc5phvp/schwarze/anaconda2/envs/sourmash_2.0.0a6/lib/python3.6/site-packages/sourmash/sbtmh.py", line 191, in matches = sum(1 for value in mins if node.data.get(value)) File "/pylon5/mc5phvp/schwarze/anaconda2/envs/sourmash_2.0.0a6/lib/python3.6/site-packages/sourmash/sbt.py", line 684, in data self._data = khmer.load_nodegraph(f.name) AttributeError: module 'khmer' has no attribute 'load_nodegraph'

SchwarzEM commented 4 years ago

I've also tried running gather-by-contig.py with sourmash versions 2.1.0 or 2.2.0 (both via bioconda), rather than version 2.0.0a6. For either version 2.1.0 or version 2.2.0, I get the following error (which is at least somewhat different from the error with 2.0.0a6):

""" Traceback (most recent call last): File "/pylon5/mc5phvp/schwarze/sourmash/gather-by-contig.py", line 14, in from sourmash.sbtmh import SearchMinHashesFindBestIgnoreMaxHash ImportError: cannot import name 'SearchMinHashesFindBestIgnoreMaxHash' from 'sourmash.sbtmh' \ (/pylon5/mc5phvp/schwarze/anaconda2/envs/sourmash_2.2.0/lib/python3.7/site-packages/sourmash/sbtmh.py) """

which obviously doesn't work either.

A key issue here seems to be that the bioconda version of sourmash 2.0.0a6 (which used to work with gather-by-contig.py) is not strictly keeping register with the exact version of khmer that will allow gather-by-contig.py to run.

If there was some way to enforce having sourmash version 2.0.0a6 in bioconda be tightly linked with whatever earlier version of khmer it used to have!

SchwarzEM commented 4 years ago

A final point that may help in figuring out how to fix all this: I'm attaching the copy of gather-by-contig.py that I had previously successfully used for decontaminating assembly contigs.

Attaching, not just posting here, so that github won't munge the text of gather-by-contig.py.

Attaching as .py.txt rather than .py, so that github will let me post the thing at all.

Finally -- note that this copy is slightly different from the version that I had initially downloaded from https://gist.githubusercontent.com/ctb/fa1c11b5e2f9614128685600911c842a/raw/8feb65684bb9534f4549d058de0db46e0e3c06b6/gather-by-contig.py.

gather-by-contig.py.txt

SchwarzEM commented 4 years ago

OK, I have fixed the bug. My solution:

  1. Upgrade to sourmash 2.2.0 after all.

  2. Revise my gather-by-contig.py so that the function SearchMinHashesFindBestIgnoreMaxHash is replaced by GatherMinHashesFindBestIgnoreMaxHash, as follows:

""" nano gather-by-contig.py ; diff gather-by-contig.previous.py gather-by-contig.py ;

14c14
< from sourmash.sbtmh import SearchMinHashesFindBestIgnoreMaxHash
---
> from sourmash.sbtmh import GatherMinHashesFindBestIgnoreMaxHash
50c50
<         search_fn = SearchMinHashesFindBestIgnoreMaxHash().search
---
>         search_fn = GatherMinHashesFindBestIgnoreMaxHash().search

"""

I figured out the latter solution by seeing an 8-month-old comment in Titus' blog:

http://ivory.idyll.org/blog/2018-detecting-contamination-in-long-read-assemblies.html

With this github post, hopefully, the solution will become more reliably accessible...

SchwarzEM commented 4 years ago

And, I'm attaching my corrected version of gather-by-contig.py here (as before, with *.txt so that github will let me post it as an un-munged file).

gather-by-contig.py.txt

SchwarzEM commented 4 years ago

Well, [redacted verb] me -- I declared victory before looking at the actual results of the seemingly succesful sourmash run. It's failing, because for some weird reason sourmash is no longer making a successful binary division between contaminant contigs and non-contaminant contigs. It's managing somehow to lose contigs -- lots of contigs. Never did that before.

Here's an example of what's going very wrong.

Input assembly:

Total nt: 166,592,905 Scaffolds: 107 Contigs: 113

Output with a match to GenBank (contaminants):

Total nt: 23,610,398 Scaffolds: 17 Contigs: 17

Output without a match to GenBank (non-contaminants):

Total nt: 98,892,689 Scaffolds: 79 Contigs: 79

The latter two don't come close to adding up to 166 Mb.

Oh, how I wish I could still make gather-by-contig.py work with sourmash 2.0.0a6, using bioconda to install sourmash version 2.0.0a6!

SchwarzEM commented 4 years ago

Looking at one run that did work (wtdbg2) versus one that clearly didn't (Flye): the difference may be that Flye emits both contigs and scaffolds, whereas wtdbg2 (like Canu) assembles pure contigs only.

Weird to have a long-read assembler scaffold anything, but it would explain why Flye's sourmash run went so very wrong; if gather-by-contig.py can only parse contigs, then scaffolds will just disappear rather than being assigned to matching (contaminant) or non-matching (non-contaminant) sets.

ctb commented 4 years ago

hi @SchwarzEM can you show me the command lines and also drop me a few MB of fasta files to test it on, pls? Thanks!

ctb commented 4 years ago

ok, I see the command lines above, sorry! Some test files would be helpful tho.

SchwarzEM commented 4 years ago

Sorry about the delay -- there were Things...

I have two test files Raxei_flye_50x_2019.10.19_scaffold_206.fa will not work with sourmash; Raxei_flye_50x_2019.10.19_scaffold_206.contigs.fa will.

github won't let me post the files, so I will e-mail them to you directly.

I think is a pretty obvious reason why one of them is failing -- Raxei_flye_50x_2019.10.19_scaffold_206.fa contains 100 nucleotides of scaffolding ("N" residues) which link the two contigs in Raxei_flye_50x_2019.10.19_scaffold_206.contigs.fa.

Here are their detailed statistics:

Sequence: Raxei_flye_50x_2019.10.19_scaffold_206.fa

Total nt: 1,795,346 Scaffolds: 1 Contigs: 2

ACGT nt: 1,795,246 N-res. nt: 100 % non-N: 99.99 % GC: 35.36 % softmasked [1]: 0.00 ([1]: versus entire assembly size) % softmasked [2]: 0.00 ([2]: versus non-N, ACGT only, assembly size)

Scaffold N50 nt: 1,795,346.0 Scaffold N90 nt: 1,795,346.0 Scaf. max. nt: 1,795,346 Scaf. min. nt: 1,795,346

Contig N50 nt: 1,722,405.0 Contig N90 nt: 1,722,405.0 Contig max. nt: 1,722,405 Contig min. nt: 72,841

Sequence: Raxei_flye_50x_2019.10.19_scaffold_206.contigs.fa

Total nt: 1,795,246 Scaffolds: 2 Contigs: 2

ACGT nt: 1,795,246 N-res. nt: 0 % non-N: 100.00 % GC: 35.36 % softmasked [1]: 0.00 ([1]: versus entire assembly size) % softmasked [2]: 0.00 ([2]: versus non-N, ACGT only, assembly size)

Scaffold N50 nt: 1,722,405.0 Scaffold N90 nt: 1,722,405.0 Scaf. max. nt: 1,722,405 Scaf. min. nt: 72,841

Contig N50 nt: 1,722,405.0 Contig N90 nt: 1,722,405.0 Contig max. nt: 1,722,405 Contig min. nt: 72,841

SchwarzEM commented 4 years ago

The take-home on this seems to be: although most long-read assemblers will generate nothing but pure contigs, Flye (at least) will not, and this makes the products of Flye not directly usable for a sourmash decontamination. (As an immediate work-around for problem, I split the scaffolds from Flye into contigs and ran sourmash on the contigs.)

ctb commented 4 years ago

figured it out, it's the Ns, patch happening :)

best, --titus

On Tue, Oct 22, 2019 at 02:42:55PM -0700, Erich Schwarz wrote:

The take-home on this seems to be: although most long-read assemblers will generate nothing but pure contigs, Flye (at least) will not, and this makes the products of Flye not directly usable for a sourmash decontamination. (As an immediate work-around for problem, I split the scaffolds from Flye into contigs and ran sourmash on the contigs.)

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/dib-lab/sourmash/issues/749#issuecomment-545168841 -- C. Titus Brown, ctbrown@ucdavis.edu

ctb commented 4 years ago

tackled in #940