kblin / ncbi-genome-download

Scripts to download genomes from the NCBI FTP servers
Apache License 2.0
925 stars 176 forks source link

UnicodeEncodeError when writing metadata #78

Closed danudwary closed 5 years ago

danudwary commented 5 years ago

Using ncbi-genome-download 0.2.7 from bioconda in a python 2.7 Anaconda env: Error does not occur when -m is left out

$ ncbi-genome-download -m ncbi-metadata.tsv -s refseq -F genbank -l complete -p 8 -r 3 bacteria
Traceback (most recent call last):
  File "/global/homes/d/dudwary/.conda/envs/datamng/bin/ncbi-genome-download", line 10, in <module>
    sys.exit(main())
  File "/global/homes/d/dudwary/.conda/envs/datamng/lib/python2.7/site-packages/ncbi_genome_download/__main__.py", line 24, in main
    ret = args_download(args)
  File "/global/homes/d/dudwary/.conda/envs/datamng/lib/python2.7/site-packages/ncbi_genome_download/core.py", line 144, in args_download
    return config_download(config)
  File "/global/homes/d/dudwary/.conda/envs/datamng/lib/python2.7/site-packages/ncbi_genome_download/core.py", line 197, in config_download
    table.write(handle)
  File "/global/homes/d/dudwary/.conda/envs/datamng/lib/python2.7/site-packages/ncbi_genome_download/metadata.py", line 93, in write
    row.write(handle)
  File "/global/homes/d/dudwary/.conda/envs/datamng/lib/python2.7/site-packages/ncbi_genome_download/metadata.py", line 70, in write
    handle.write(u"\t".join(values))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 179-180: ordinal not in range(128)
jrjhealey commented 5 years ago

I came across a similar issue with some other code recently.

What are your terminal locale settings?

danudwary commented 5 years ago
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
jrjhealey commented 5 years ago

Does the issue persist if you switch to UTF8? The syntax is something like en_US.utf8 or en_US.UTF-8. It might be that the metadata file has some unusual characters.

danudwary commented 5 years ago

That seems to be working. A little annoying to have to swap out of my cluster's defaults, but I don't do these downloads with much regularity, so it's workable.

jrjhealey commented 5 years ago

You could also try to run it with Python 3 which switched from ascii to Unicode support innately. That might be sufficient.

Alternatively, Kai may be able to patch it to ensure it reads UTF8 internally.

kblin commented 5 years ago

Hi Dan,

I really hate github notifications somehow.

Anyway, the problem is that the file we're trying to write contains a non-7bit-ASCII character. If your environment is POSIX, we really don't know which of the 8bit tables we should be using for the unicode conversion.

There's no magic to "make it use UTF8 internally" in this case, because we already are: https://github.com/kblin/ncbi-genome-download/blob/master/ncbi_genome_download/metadata.py#L70 is a unicode write. We could make this work regardless, but not for python2 and python3 from the same codebase.

Most reasonable systems use a UTF8 locale these days, and if you decide to instead use POSIX as a locale which enforces 7bit ASCII, I feel like it's your job to make sure you only handle ASCII inputs. NCBI downloads contain non-ASCII characters all the time.

I don't consider this a bug in ncbi-genome-download.

jananiravi commented 5 years ago

Hello, Thanks for developing & supporting this really useful genome download package! I'm using UTF-8 encoding on my macOS (high sierra) terminal. I see the same error, though. Is there anything I can do to circumvent this? thank you!

kblin commented 5 years ago

Just to check, what's the output of your locale command?

jananiravi commented 5 years ago

Here it is:

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
kblin commented 5 years ago

Hm, ok, that should work. What exact command are you running?

jananiravi commented 5 years ago

These are two commands that I ran with and without -m to avoid the UTF error.

ncbi-genome-download -s 'refseq' -F 'gff' -l 'complete' --genus 'Mycobacterium tuberculosis variant bovis' -R 'all' -o mbov-genomes/ -N -m metatable_mbov -v bacteria

There should be 14 rows in the metadata table, but due to the error only 4 lines get written.

ncbi-genome-download -s 'refseq' -F 'gff' -l 'complete' --genus 'Mycobacterium tuberculosis' -R 'all' -o mtb-genomes/ -m metadata-mtb -N -v bacteria

I'm checking with other output formats to see if the metadata error still remains. thanks!

kblin commented 5 years ago

Ah, I can reproduce this on python 2, but not on python 3. I'll see if I can fix this, but as a temporary workaround you might want to try running ncbi-genome-download in python 3.

kblin commented 5 years ago

Ah, I see the issue. I missed a plain open() call for opening the metadata file. On python 2 that causes the file to be opened in ASCII mode, and then the non-ASCII characters in record GCF_000234725.1 cause the crash.

The bad news is that whatever encoding that line is, it's not UTF-8 either, so the line is still corrupt. But at least it doesn't crash.

kblin commented 5 years ago

Fixed with the 0.2.8 release.