hammerlab / biokepi

Bioinformatics Ketrew Pipelines
Apache License 2.0
27 stars 4 forks source link

URL for B37 decoy has trailing bytes that annoy Gunzip #117

Open smondet opened 8 years ago

smondet commented 8 years ago

Gunzip succeeds but displays decompression OK, trailing garbage ignored and returns 2.

-q silences the warning: http://www.gzip.org/#faq8 but does not make it return 0.

(URL: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz)

arahuja commented 8 years ago

Did #118 fix this?

smondet commented 8 years ago

@arahuja no I found no way to tell nicely to gunzip to ignore those errors without ignoring other potential errors.

The options are:

What do you think?

arahuja commented 8 years ago

Hm, self-hosting seems like a solution that will have it's own issues eventually - unless we just put in this repo?

Ignoring gzip errors and computing sums is nice, but not sure how that is manageable for all downloads.

Looks like 1000genomes acknowledges this issue with the file as well: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/README_human_reference_20110707

This file is compressed by razip from the samtools package for random access. Gzip may complain "decompression OK, trailing garbage ignored", but this does not affect the correctness of the decompressed file.

I think just putting that file here or in Github LFS is the easiest for now.

hammer commented 7 years ago

@smondet what is this blocked on?

smondet commented 7 years ago

@hammer it's bolocked on either 1000genomes providing a proper gz file or us taking a decision on how to bypass the problem :) (I'd like to implement the MD5 solution one day but self-hosting seems to me like the fastest route)

hammer commented 7 years ago

@smondet sounds like we're not blocked then, we should implement the self-hosting workaround.