statgen / pheweb

A tool to build a website to browse hundreds or thousands of GWAS.
MIT License
154 stars 65 forks source link

gzip error on dbSNP download during 'pheweb add-rsids' #155

Closed ttbek closed 3 years ago

ttbek commented 3 years ago

When running: pheweb add-rsids

Downloading rsids from dbSNP
dbsnp will be stored at '/root/.pheweb/cache/rsids-150.vcf.gz'
Downloading dbsnp!
100% [....................................................................] 7541904638 / 7541904638
Done downloading.
Converting /Data/generated-by-pheweb/sites/dbSNP/dbsnp-b150-GRCh37.gz -> /root/.pheweb/cache/rsids-150.vcf.gz
FAILED with status 1
output was:

gzip: /Data/generated-by-pheweb/sites/dbSNP/dbsnp-b150-GRCh37.gz: invalid compressed data--crc error

gzip: /Data/generated-by-pheweb/sites/dbSNP/dbsnp-b150-GRCh37.gz: invalid compressed data--length error

(Details in /Data/generated-by-pheweb/tmp/exception-2021-01-25T03-02-19.840870)

The exception says:

======= Exception ====
FAILED with status 1
output was:

gzip: /Data/generated-by-pheweb/sites/dbSNP/dbsnp-b150-GRCh37.gz: invalid compressed data--crc error

gzip: /Data/generated-by-pheweb/sites/dbSNP/dbsnp-b150-GRCh37.gz: invalid compressed data--length error

======= Traceback ====
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pheweb/command_line.py", line 148, in main
    run(sys.argv[1:])
  File "/usr/local/lib/python3.8/dist-packages/pheweb/command_line.py", line 142, in run
    handlers[subcommand](argv[1:])
  File "/usr/local/lib/python3.8/dist-packages/pheweb/command_line.py", line 63, in f
    module.run(argv)
  File "/usr/local/lib/python3.8/dist-packages/pheweb/load/add_rsids.py", line 102, in run
    download_rsids.run([])
  File "/usr/local/lib/python3.8/dist-packages/pheweb/load/download_rsids.py", line 31, in run
    run_script(r'''
  File "/usr/local/lib/python3.8/dist-packages/pheweb/load/load_utils.py", line 100, in run_script
    raise PheWebError(
pheweb.utils.PheWebError: FAILED with status 1
output was:

gzip: /Data/generated-by-pheweb/sites/dbSNP/dbsnp-b150-GRCh37.gz: invalid compressed data--crc error

gzip: /Data/generated-by-pheweb/sites/dbSNP/dbsnp-b150-GRCh37.gz: invalid compressed data--length error

Which doesn't tell me too much more. Taking a look at the file:


ls -la generated-by-pheweb/sites/dbSNP/dbsnp-b150-GRCh37.gz
-rw-r--r-- 1 root root 8054833262 Jan 25 03:02 generated-by-pheweb/sites/dbSNP/dbsnp-b150-GRCh37.gz

It's not a permissions issue, it's in a Docker container and everything is root. The size of the downloaded file seems wrong? Tried downloading twice, same. This error isn't related to the user input data, right? Because I was using a GRCh38 based file just to try things out, but I think this is in regards to the downloaded data.

ttbek commented 3 years ago

This probably is the issue, I see that the first download attempt also had a different size: -rw-r--r-- 1 root root 7878918094 Jan 24 19:57 dbsnp-b150-GRCh37.gz Is there a more robust way to download this file? I don't usually have any trouble with downloads via, e.g. Firefox or wget. Python script downloads have been a bane in the past, not sure why they are so unreliable, but they've messed me up for weeks before (from other projects).
Found the URL in the code, I'm downloading now with wget instead, will let you know how it goes.

pjvandehaar commented 3 years ago

That method of downloading resources was unreliable. I've changed pheweb in the hg38 branch to download from our own server instead. I'm hoping to test that code a bit more and then merge those changes into master and make a new release in the next couple days. In the meantime I recommend using the new code, especially if you're on hg38.

ttbek commented 3 years ago

Thanks. We actually need hg37 for our data, I had just grabbed the FinnGen data quickly to test with. The wget of the file is the correct size and it seems to be proceeding now. Are there other resource downloads I should consider suspect?

For anyone that encounters the same, they can get the file as: wget https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-All.vcf.gz And then it needs to be renamed: mv 00-All.vcf.gz dbsnp-b150-GRCh37.gz The file should be in ./generated-by-pheweb/sites/dbSNP/ Then the previously failing command can be run again.

I'll close the issue since going forward hg38 will of course be the standard more and more frequently (well, until the next one of course).