sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

documentation for `sourmash signature kmers` is incorrect about exiting with bad kmers #2842

Open jessicalumian opened 10 months ago

jessicalumian commented 10 months ago

Hi! I'm using sourmash signature kmers to extract kmers and a fasta from a signature of hashes of interest and the original fasta file.

My command:

sourmash sig kmers --signatures <sig file of matches> --sequences <fasta file> --save-sequences <output name> --save-kmers <output name2>

I got an error when a sourmash came across a kmer with an N.

Traceback (most recent call last):
  File "/home/jupyter-jessica/.local/bin/sourmash", line 8, in <module>
    sys.exit(main())
  File "/home/jupyter-jessica/.local/lib/python3.8/site-packages/sourmash/__main__.py", line 13, in main
    return mainmethod(args)
  File "/home/jupyter-jessica/.local/lib/python3.8/site-packages/sourmash/cli/sig/kmers.py", line 91, in main
    return sourmash.sig.__main__.kmers(args)
  File "/home/jupyter-jessica/.local/lib/python3.8/site-packages/sourmash/sig/__main__.py", line 1148, in kmers
    for kmer, hashval in kh_iter:
  File "/home/jupyter-jessica/.local/lib/python3.8/site-packages/sourmash/minhash.py", line 387, in kmers_and_hashes
    hashvals = self.seq_to_hashes(sequence,
  File "/home/jupyter-jessica/.local/lib/python3.8/site-packages/sourmash/minhash.py", line 360, in seq_to_hashes
    hashes_ptr = self._methodcall(lib.kmerminhash_seq_to_hashes, to_bytes(sequence), len(sequence), force, bad_kmers_as_zeroes, is_protein, size)
  File "/home/jupyter-jessica/.local/lib/python3.8/site-packages/sourmash/utils.py", line 25, in _methodcall
    return rustcall(func, self._get_objptr(), *args)
  File "/home/jupyter-jessica/.local/lib/python3.8/site-packages/sourmash/utils.py", line 78, in rustcall
    raise exc
ValueError: invalid DNA character in input k-mer: <kmer with an N>

The documentation for sourmash sig kmers says: By default, sig kmers ignores bad k-mers (e.g. non-ACGT characters in DNA). If --check-sequence is provided, sig kmers will error exit on the first bad k-mer.

Docs: https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-signature-kmers-extract-k-mers-and-or-sequences-that-match-to-signatures

So, the docs should be updated to say by default non-ACGT will cause sig kmers to exit.

ctb commented 10 months ago

aaaaactually I think the docs are the correct behavior 😆

Fixed in https://github.com/sourmash-bio/sourmash/pull/2856