mdelcorvo / TOSCA

Snakemake workflow for somatic mutation detection without matched normal samples
MIT License
11 stars 2 forks source link

broken URL in config for ESP6500SI (download_database rule) #9

Open snizzo opened 8 months ago

snizzo commented 8 months ago

Hi! I've reproduced the targeted analysis in the TOSCA paper, using data provided by PRJEB36436.

Issue In rules/database.smk rule download_database fails because the URL for ESP6500SI returns 404. The error is not reported until a later job tries to unzip the (broken) archive complaining about tar being unable to detect archive data.

Possible solution As a drop in solution I've changed the hardcoded URL in config/<yourconfig>.yaml from:

ESP: "http://evs.gs.washington.edu/evs_bulk_data/ESP6500SI-V2-SSA137.GRCh38-liftover.snps_indels.vcf.tar.gz"

to:

ESP: "https://web.archive.org/web/20220419143751if_/https://evs.gs.washington.edu/evs_bulk_data/ESP6500SI-V2-SSA137.GRCh38-liftover.snps_indels.vcf.tar.gz"

This holds for both GRCh38 and GRCh37 and works. The only copy available for download of that file I've found is a snapshot from the internet archive. I'm not sure about them being happy about people directly linking or downloading from them (they suggest using their own cli client).