openSNP / snpr

The sources of the openSNP website
http://opensnp.org
MIT License
174 stars 46 forks source link

Supported compression for genotype files? #427

Closed BenjaminHCCarr closed 6 years ago

BenjaminHCCarr commented 7 years ago

https://opensnp.org/genotypes/new Says that:

Provide your genotyping file
*Zipped genotypings are fine too.*

It would be useful to know if you truly mean ZIP standard or can also handle gzip and bzip2


du -hs masterVarBeta_23andme.vcf*
662M    masterVarBeta_23andme.vcf
 47M    masterVarBeta_23andme.vcf.bz2
 71M    masterVarBeta_23andme.vcf.gz
 42M    masterVarBeta_23andme.vcf.xz
 71M    masterVarBeta_23andme.vcf.zip
BenjaminHCCarr commented 7 years ago

I successfully uploaded a 23andme.vcf.gz without getting an error email and the file is populating the Genotypes field.

tsujigiri commented 7 years ago

Do the files come zipped when you download them from 23andMe? And if so, in which format?

BenjaminHCCarr commented 7 years ago

I don't have any 23andme files. Thought I am sure they are straight up zip.

I have some Ancestry files which is: dna-data-YEAR-Month-day.zip

Not sure about 23andMe: https://www.openhumans.org/activity/23andme/upload/?next=/activity/23andme/ but @madprime might know of top of her head


Current code handled gzip fine. Having not looked at the code but having forked with openSNP/snpr upstream (see openSNP/snpr#42) I could look for that.

It may be that zip/gzip/bzip2 are already handled. XZ is unlikely handled by default but provides the best compression, but those who use are "Power Users" and can uncompress/recompress" to a supported file type. The whole genome stuff like Illumina's personal stuff is likely to be bzip2 or gzip.

BenjaminHCCarr commented 7 years ago

Hmmm, I'm confused on why it was able to parse my 23andme-exome format file:

It appears you are using the rubyzip gem: https://rubygems.org/gems/rubyzip

https://github.com/openSNP/snpr/search?p=2&q=zip&type=&utf8=%E2%9C%93

And pouting through the http://www.rubydoc.info/gems/rubyzip/1.2.1 docs, there is only support fot *.zip files, no mention of gzip.

@gedankenstuecke point me to preparsing for something else, and I imagine that is where all the unzipping is done.

However I sent up 23andme.vcf.gz and in my userid list is says 23andme-exome-vcf so it appears to have parsed and I can download it:

When I dowload it it comes down as bzip2:

tyr:~/tmp benc$ file xxxx.23andme-exome-vcf.yyy
xxxx.23andme-exome-vcf.yyyy: bzip2 compressed data, block size = 900k