Genomics data - Githubissues

nemequ / squash-corpus

Designing a new corpus for lossless general-purpose compression

15 stars 2 forks source link

Genomics data #17

Open bknowles opened 7 years ago

bknowles commented 7 years ago

I actually found out about Squash through the page at http://jdlm.info/articles/2017/05/01/compression-pareto-docker-gnuplot.html and realized that the genomics dataset that is used for those tests would be an excellent addition to your corpus. They link to the page at http://hgdownload.cse.ucsc.edu/downloads.html if you want to download it directly.

nemequ commented 7 years ago

I'll leave this open for future discussion, but I'm quite hesitant about this idea; see the "Designed for the 99%" section of the README.

If someone is interested in putting together a genome compression benchmark using the squash-benchmark code I'd be happy to help with the squash side of things, including accepting patches to squash-benchmark-web to pull configuration in from a separate configuration file (so it's easier to publish results using custom data), but I don't think it would be appropriate to include it in this corpus.

You might be interested in quixdb/squash-benchmark#35; I believe the data you link to is already covered, but if anything is missing I'd be happy to add it to the benchmark's Makefile to make it more easily testable.

bknowles commented 7 years ago

Designed for the 99%. I like that!

I can definitely see that the Genomic data would not be included in that corpus. However, as the algorithms get better and faster, you're going to have to select larger and harder targets to test against. The Genomics data set would qualify as larger/harder, but then it wouldn't be in the 99%.

I'll be fascinated to see how your squash corpus evolves over time to deal with this issue.

Thanks again!

abcbarryn commented 6 years ago

I disagree that that isn't in the "98%" and I think a genomics fastq data file would be an excellent test data set.

nemequ commented 6 years ago

I disagree that that isn't in the "98%" and I think a genomics fastq data file would be an excellent test data set.

This might be more persuasive if you explained your position.

Genomics data is obviously a huge user of compression, and an important use case for codec developers, but it's not really a useful data point for most people looking to choose a compression codec. IMHO that makes it a perfect fit for an additional genomics-specific corpus.

FWIW, now that Web Assembly has stabilized a bit I plan to finish putting together this corpus soon. Unless someone presents a good argument for including genomics data, I don't plan to.