marbl / parsnp

Parsnp was designed to align the core genome of hundreds to thousands of bacterial genomes within a few minutes to few hours. Input can be both draft assemblies and finished genomes, and output includes variant (SNP) calls, core genome phylogeny and multi-alignments. Parsnp leverages contextual information provided by multi-alignments surrounding SNP sites for filtration/cleaning, in addition to existing tools for recombination detection/filtration and phylogenetic reconstruction.
Other
126 stars 25 forks source link

UnicodeDecodeError when using a directory of input sequences (-d) #95

Closed joshwkearney closed 3 years ago

joshwkearney commented 3 years ago

I'm running ParSNP v1.5.4 (installed from conda) and ran into a behavior that seems to be a regression from v1.2.

What's expected After running the command parsnp -r ./covid.fa -d ./genomes/ -o ./results/ ParSNP should align the genomes in ./genomes/ and output the results.

What happens ParSNP fails with the output:

|--Parsnp 1.5.4--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
15:34:45 - INFO - 
**************************
SETTINGS:
|-refgenome:    ./covid.fa
|-genomes:  
    ./genomes/myseq7.fa
    ./genomes/myseq5.fa
    ...8 more file(s)...
    ./genomes/myseq3.fa
    ./genomes/myseq4.fa
|-aligner:  muscle
|-outdir:   ./results/
|-OS:   Linux
|-threads:  1
**************************

15:34:45 - INFO - <<Parsnp started>>
15:34:45 - INFO - No genbank file provided for reference annotations, skipping..
Traceback (most recent call last):
  File "/home/software/miniconda3/envs/test_env/bin/parsnp", line 819, in <module>
    hdr = ff.readline()
  File "/home/software/miniconda3/envs/test_env/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 1078: invalid start byte

Workarounds This error can be avoided by using the command parsnp -r ./covid.fa -d ./genomes/*.fa -o ./results/, or by specifying each .fa file individually as a list, or by using ParSNP v1.2 instead of 1.5.4.

I'm unsure if specifying a directory for the -d option is supported in v1.5.4 because it isn't mentioned in the help menu, but it was supported in v1.2 and I would expect the two to be compatible. Not a big deal either way, but a more helpful error message would be appreciated. All my test data is attached in a zip. Thanks!

data.zip

bkille commented 3 years ago

The -d flag will include everything in the directory. There appears to be a .DS_Store file in your genomes folder:

(base) bkille@tripp:~/temp$ ls -la genomes/
total 368
drwxr-xr-x 2 bkille bkille  4096 Jun  9 11:59 .
drwxrwxr-x 4 bkille bkille  4096 Jun 10 20:13 ..
-rw-r--r-- 1 bkille bkille  6148 Jun  9 11:59 .DS_Store
-rw-r--r-- 1 bkille bkille 30647 Jun  8 17:43 myseq10.fa
-rw-r--r-- 1 bkille bkille 30622 Jun  8 17:43 myseq11.fa
-rw-r--r-- 1 bkille bkille 30622 Jun  8 17:43 myseq1.fa
-rw-r--r-- 1 bkille bkille 30607 Jun  8 17:43 myseq2.fa
-rw-r--r-- 1 bkille bkille 30634 Jun  8 17:43 myseq3.fa
-rw-r--r-- 1 bkille bkille 30659 Jun  8 17:43 myseq4.fa
-rw-r--r-- 1 bkille bkille 30621 Jun  8 17:43 myseq5.fa
-rw-r--r-- 1 bkille bkille 30622 Jun  8 17:43 myseq6.fa
-rw-r--r-- 1 bkille bkille 30621 Jun  8 17:43 myseq7.fa
-rw-r--r-- 1 bkille bkille 30623 Jun  8 17:43 myseq8.fa
-rw-r--r-- 1 bkille bkille 30612 Jun  8 17:43 myseq9.fa