kevlar-dev / kevlar

Reference-free variant discovery in large eukaryotic genomes
https://kevlar.readthedocs.io
MIT License
40 stars 9 forks source link

Tutorial: Error on running kevlar filter #376

Closed exeter-matthew-wakeling closed 4 years ago

exeter-matthew-wakeling commented 4 years ago

I'm getting an error when I try to run kevlar on the tutorial data. When I run:

kevlar filter novel.augfastq

then I get the following error message:

[kevlar] running version 0.7
Traceback (most recent call last):
  File "/gpfs/ts0/home/mw501/.local/bin/kevlar", line 11, in <module>
    load_entry_point('biokevlar==0.7', 'console_scripts', 'kevlar')()
  File "/gpfs/ts0/home/mw501/.local/lib/python3.6/site-packages/kevlar/__main__.py", line 30, in main
    mainmethod(args)
  File "/gpfs/ts0/home/mw501/.local/lib/python3.6/site-packages/kevlar/filter.py", line 100, in main
    mask = kevlar.sketch.load(args.mask)
  File "/gpfs/ts0/home/mw501/.local/lib/python3.6/site-packages/kevlar/sketch.py", line 87, in load
    if not filename.endswith(extensions):
AttributeError: 'NoneType' object has no attribute 'endswith'

Any assistance would be greatly appreciated.

standage commented 4 years ago

Hi @exeter-matthew-wakeling, could you share all of the commands you ran prior to kevlar filter on the tutorial data? That will help me troubleshoot what went wrong and where. Thanks.

exeter-matthew-wakeling commented 4 years ago

These are the commands that I ran:

kevlar count --memory 250M mother.ct mother.fq.gz
kevlar count --memory 250M father.ct father.fq.gz 
kevlar count --memory 250M proband.ct proband.fq.gz 
kevlar novel --case proband.fq.gz --case-counts proband.ct --control-counts father.ct mother.ct -o novel.output
mv novel.output novel.augfastq
kevlar filter -o novel_filtered.augfastq novel.augfastq 

This last command failed. At the time, the contents of the directory are:

-rw-rw-r-- 1 mw501 research 249999800 Feb 14 16:13 father.ct
-rw-rw-r-- 1 mw501 research  32314239 Feb 14 15:22 father.fq.gz
-rw-rw-r-- 1 mw501 research 249999800 Feb 14 16:12 mother.ct
-rw-rw-r-- 1 mw501 research  32314752 Feb 14 15:19 mother.fq.gz
-rw-rw-r-- 1 mw501 research    394075 Feb 14 16:22 novel.augfastq
-rw-rw-r-- 1 mw501 research 249999800 Feb 14 16:16 proband.ct
-rw-rw-r-- 1 mw501 research  32312614 Feb 14 15:25 proband.fq.gz
-rw-rw-r-- 1 mw501 research    717771 Feb 14 15:28 refr.fa.gz
-rw-rw-r-- 1 mw501 research        12 Feb 14 15:29 refr.fa.gz.amb
-rw-rw-r-- 1 mw501 research        39 Feb 14 15:29 refr.fa.gz.ann
-rw-rw-r-- 1 mw501 research   2500088 Feb 14 15:29 refr.fa.gz.bwt
-rw-rw-r-- 1 mw501 research    625002 Feb 14 15:29 refr.fa.gz.pac
-rw-rw-r-- 1 mw501 research   1250056 Feb 14 15:29 refr.fa.gz.sa
standage commented 4 years ago

I get the following when I run the first command.

$ kevlar count --memory 250M mother.ct mother.fq.gz
[kevlar] running version 0.7+15.gebabd62
[kevlar::count] Storing k-mers in a count table, a CountMin sketch with a counter size of 8 bits, for k-mer abundance queries (max abundance 255)
[kevlar::count] - processing "mother.fq.gz"
[kevlar::count] Done loading k-mers;
    7500000 reads processed, 86100250 distinct k-mers stored;
    estimated false positive rate is 0.399 (FPR too high, bailing out!!!)

Two thoughts.

  1. The quick start data is different from the tutorial data but is named the same. Did you download the new data, or did you re-use data from the quick start?
  2. It looks like you're using kevlar version 0.7. Several changes to the software have been made since then. I'm hoping to release a new version soon, but in the mean time you may consider installing the latest version from GitHub with pip install git+https://github.com/dib-lab/kevlar.git.
exeter-matthew-wakeling commented 4 years ago

You must be right - the files I am using are smaller than that, so they must be the quickstart files. I got the following when I ran the first command:

[mw501@login01 kevlar]$ kevlar count --memory 250M mother.ct mother.fq.gz 
[kevlar] running version 0.7
[kevlar::count] Storing k-mers in a count table, a CountMin sketch with a counter size of 8 bits, for k-mer abundance queries (max abundance 255)
[kevlar::count] - processing "mother.fq.gz"
[kevlar::count] Done loading k-mers;
    750000 reads processed, 2507236 distinct k-mers stored;
    estimated false positive rate is 0.000;
    saved to "mother.ct"
[kevlar::count] Total time: 22.01 seconds

I just updated kevlar using the command you quoted (with --user added, as I don't have root access on this box). I then tried running kevlar, but got an error about khmer missing, so I installed that (again?). Now, when I try to run the first command, I get:

[mw501@login01 kevlar]$ kevlar count --memory 250M mother.ct mother.fq.gz 
[kevlar] running version 0.7+15.gebabd62
[kevlar::count] Storing k-mers in a count table, a CountMin sketch with a counter size of 8 bits, for k-mer abundance queries (max abundance 255)
[kevlar::count] - processing "mother.fq.gz"
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/gpfs/ts0/shared/software/Miniconda3/4.7.10/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/gpfs/ts0/shared/software/Miniconda3/4.7.10/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
TypeError: argument 1 must be str, not _khmer.ReadParser

[kevlar::count] Done loading k-mers;
    0 reads processed, 0 distinct k-mers stored;
    estimated false positive rate is 0.000;
    saved to "mother.ct"
[kevlar::count] Total time: 0.47 seconds
standage commented 4 years ago

Which version of khmer are you running? (Can test with normalize-by-median.py --version.) That might be an issue with your latest post. The latest version of kevlar also relies on some updates to khmer that haven't yet been published in a stable release 😞. If you can install the latest version of khmer from github (https://github.com/dib-lab/khmer.git) that would be best. That may require getting a sysadmin's help if you don't have root privileges though.

Otherwise, you can try downgrading back to kevlar version 0.7 and increasing the memory you use for k-mer counting. If you're seeing differences between kevlar 0.7 and the latest documentation, you could also consider using the kevlar 0.7 documentation.

exeter-matthew-wakeling commented 4 years ago
[mw501@login01 kevlar]$ normalize-by-median.py --version
|| This is the script normalize-by-median.py in khmer.
|| You are running khmer version 2.1.1
|| You are also using screed version 1.0.4
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2015. http://dx.doi.org/10.12688/f1000research.6924.1
||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
||
|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

khmer 2.1.1
[mw501@login01 kevlar]$ pip install --user git+https://github.com/dib-lab/khmer.git
Collecting git+https://github.com/dib-lab/khmer.git
  Cloning https://github.com/dib-lab/khmer.git to /tmp/pip-req-build-y38jy0gp
  Running command git clone -q https://github.com/dib-lab/khmer.git /tmp/pip-req-build-y38jy0gp
Requirement already satisfied: screed>=1.0 in /gpfs/ts0/home/mw501/.local/lib/python3.7/site-packages (from khmer==3.0.0a3) (1.0.4)
Requirement already satisfied: bz2file in /gpfs/ts0/home/mw501/.local/lib/python3.7/site-packages (from khmer==3.0.0a3) (0.98)
Building wheels for collected packages: khmer
  Building wheel for khmer (setup.py) ... done
  Stored in directory: /tmp/pip-ephem-wheel-cache-0nh1y_ub/wheels/6b/c2/6a/ec82249e368a3b7a8efe8514e946e845451960517d9c50d8e8
Successfully built khmer
Installing collected packages: khmer
  Found existing installation: khmer 2.1.1
    Uninstalling khmer-2.1.1:
      Successfully uninstalled khmer-2.1.1
Successfully installed khmer-3.0.0a3
[mw501@login01 kevlar]$ 
[mw501@login01 kevlar]$ normalize-by-median.py --version

|| This is the script normalize-by-median.py in khmer.
|| You are running khmer version 3.0.0a3
|| You are also using screed version 1.0.4
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1
||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
||
|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

khmer 3.0.0a3

Ok, so the latest git version is now installed. I'm going to try the commands again - note that I'm running it on the quickstart files.

[mw501@login01 kevlar]$ kevlar count --memory 250M mother.ct mother.fq.gz 
[kevlar] running version 0.7+15.gebabd62
[kevlar::count] Storing k-mers in a count table, a CountMin sketch with a counter size of 8 bits, for k-mer abundance queries (max abundance 255)
[kevlar::count] - processing "mother.fq.gz"
[kevlar::count] Done loading k-mers;
    750000 reads processed, 2507236 distinct k-mers stored;
    estimated false positive rate is 0.000;
    saved to "mother.ct"
[kevlar::count] Total time: 20.20 seconds
[mw501@login01 kevlar]$ kevlar count --memory 250M father.ct father.fq.gz 
[kevlar] running version 0.7+15.gebabd62
[kevlar::count] Storing k-mers in a count table, a CountMin sketch with a counter size of 8 bits, for k-mer abundance queries (max abundance 255)
[kevlar::count] - processing "father.fq.gz"
[kevlar::count] Done loading k-mers;
    750000 reads processed, 2507691 distinct k-mers stored;
    estimated false positive rate is 0.000;
    saved to "father.ct"
[kevlar::count] Total time: 22.25 seconds
[mw501@login01 kevlar]$ kevlar count --memory 250M proband.ct proband.fq.gz 
[kevlar] running version 0.7+15.gebabd62
[kevlar::count] Storing k-mers in a count table, a CountMin sketch with a counter size of 8 bits, for k-mer abundance queries (max abundance 255)
[kevlar::count] - processing "proband.fq.gz"
[kevlar::count] Done loading k-mers;
    750000 reads processed, 2507598 distinct k-mers stored;
    estimated false positive rate is 0.000;
    saved to "proband.ct"
[kevlar::count] Total time: 20.41 seconds
[mw501@login01 kevlar]$ kevlar novel --case proband.fq.gz --case-counts proband.ct --control-counts father.ct mother.ct -o novel.augfastq
[kevlar] running version 0.7+15.gebabd62
[kevlar::novel] Loading control samples
[kevlar::novel]    INFO: counttables for 2 sample(s) provided, any corresponding FASTA/FASTQ input will be ignored for computing k-mer abundances
[kevlar::sketch]     loading sketchfile "father.ct"...done! estimated false positive rate is 0.000
[kevlar::sketch]     loading sketchfile "mother.ct"...done! estimated false positive rate is 0.000
[kevlar::novel] Control samples loaded in 0.94 sec
[kevlar::novel] Loading case samples
[kevlar::novel]    INFO: counttables for 1 sample(s) provided, any corresponding FASTA/FASTQ input will be ignored for computing k-mer abundances
[kevlar::sketch]     loading sketchfile "proband.ct"...done! estimated false positive rate is 0.000
[kevlar::novel] Case samples loaded in 0.48 sec
[kevlar::novel] All samples loaded in 1.42 sec
[kevlar::novel] Iterating over reads from 1 case sample(s)
[kevlar::novel] Found 4274 instances of 370 unique novel kmers in 134 reads in 108.68 seconds
[kevlar::novel] Iterated over all case reads in 108.68 seconds
[kevlar::novel] Total time: 110.10 seconds
[mw501@login01 kevlar]$ kevlar filter -o novel_filtered.augfastq novel.augfastq 
[kevlar] running version 0.7+15.gebabd62
Traceback (most recent call last):
  File "/gpfs/ts0/home/mw501/.local/bin/kevlar", line 10, in <module>
    sys.exit(main())
  File "/gpfs/ts0/home/mw501/.local/lib/python3.7/site-packages/kevlar/__main__.py", line 30, in main
    mainmethod(args)
  File "/gpfs/ts0/home/mw501/.local/lib/python3.7/site-packages/kevlar/filter.py", line 100, in main
    mask = kevlar.sketch.load(args.mask)
  File "/gpfs/ts0/home/mw501/.local/lib/python3.7/site-packages/kevlar/sketch.py", line 87, in load
    if not filename.endswith(extensions):
AttributeError: 'NoneType' object has no attribute 'endswith'

So it is still failing with the same error message. Any clues?

exeter-matthew-wakeling commented 4 years ago

Wait, why does the error message mention files in ~/.local/bin/python3.7 when I'm using python 3.6?

[mw501@login01 kevlar]$ module list
Currently Loaded Modulefiles:
  1) GCCcore/7.3.0                                6) hwloc/1.11.10-GCCcore-7.3.0                 11) ScaLAPACK/2.0.2-gompi-2018b-OpenBLAS-0.3.1  16) Tcl/8.6.8-GCCcore-7.3.0                     21) Python/3.6.6-foss-2018b
  2) binutils/2.30-GCCcore-7.3.0                  7) OpenMPI/3.1.1-GCC-7.3.0-2.30                12) foss/2018b                                  17) SQLite/3.24.0-GCCcore-7.3.0                 22) Miniconda3/4.7.10
  3) GCC/7.3.0-2.30                               8) OpenBLAS/0.3.1-GCC-7.3.0-2.30               13) bzip2/1.0.6-GCCcore-7.3.0                   18) XZ/5.2.4-GCCcore-7.3.0
  4) zlib/1.2.11-GCCcore-7.3.0                    9) gompi/2018b                                 14) ncurses/6.1-GCCcore-7.3.0                   19) GMP/6.1.2-GCCcore-7.3.0
  5) numactl/2.0.11-GCCcore-7.3.0                10) FFTW/3.3.8-gompi-2018b                      15) libreadline/7.0-GCCcore-7.3.0               20) libffi/3.2.1-GCCcore-7.3.0
[mw501@login01 kevlar]$ which python
/gpfs/ts0/shared/software/Miniconda3/4.7.10/bin/python

I have no idea what this means, but it doesn't look right.

Possibly false alarm:

[mw501@login01 kevlar]$ python --version
Python 3.7.3
[mw501@login01 kevlar]$ 
[mw501@login01 kevlar]$ pip --version
pip 19.1.1 from /gpfs/ts0/shared/software/Miniconda3/4.7.10/lib/python3.7/site-packages/pip (python 3.7)
standage commented 4 years ago

Whoops, I'm sorry. I failed to notice something very obvious: that you are not providing any sequences with k-mers to mask. In part this is a bug—kevlar should either halt or at least issue a loud warning when this step is run without a mask. But it is also expected that the user will provide a using the mask to the kevlar filter command.

In general, the mask should include k-mers that we're not interested in. For example, if a k-mer is high abundance in the proband but absent from the parents, we're still not interested in that k-mer if it's present in the reference genome. We also want to ignore k-mers from contaminants, such as vector sequences or bacterial genomes. So I often create a mask (use kevlar count) from the sequences in the reference genome, UniVec, and E. coli. For this demo data though, creating the mask from only the reference genome should suffice.

exeter-matthew-wakeling commented 4 years ago

Fabulous. It works now. Many thanks. I have managed to find the de novo variants in the quickstart trio. Now I'll see if I can get it working on our real WGS data.