kevinhu / cancer_data

A unified downloader+preprocessor for cancer genomics datasets
https://cancer_data.kevinhu.io
MIT License
14 stars 7 forks source link

tcga_normalized_gene_expression fails to download due to md5sum mismatch #68

Closed poneill closed 2 years ago

poneill commented 2 years ago

Hi, thanks for putting this repo together, it looks very handy.

On cancer_data version 0.1.0, I tried to download the tcga_normalized_gene_expression dataset via

cancer_data.download("tcga_normalized_gene_expression"),

but this failed with the message:

243iB [00:00, 58.9kiB/s]
EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz does not match provided md5sum. Attempting second download.
Downloading https://pancanatlas.xenahubs.net/download/EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz
243iB [00:00, 36.1kiB/s]
Second download of EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz failed. Recommend manual inspection.

Yet when I manually inspect the md5sum of the .gz, everything looks ok:

import hashlib

import cancer_data

schema_md5 = cancer_data.schema().loc['tcga_normalized_gene_expression']['downloaded_md5']

fname = "/Users/pat/Downloads/EB++AdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz"

with open(fname, "rb") as f:
    data = f.read()
    observed_md5 = hashlib.md5(data).hexdigest()

assert schema_md5 == observed_md5  # "5fbfb5a4854a2cfc8a95c3ada5379fd4"

Am I doing something silly? Thanks in advance.

kevinhu commented 2 years ago

Seems that Xena has changed their download paths – you can find the new download link here in the meantime: https://tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com/download/EB%2B%2BAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena.gz

I'll update the URLs and re-download to make sure the contents are still consistent.

kevinhu commented 2 years ago

Updated the package – you can check out 0.3.5 which should have the correct URL

poneill commented 2 years ago

Thanks!