Compression benchmark - Githubissues

wdecoster commented 5 years ago

Hi,

I just tested this on some recent PromethION data (470Gbyte) with compress_fast5 from the ont_fast5_api. The output is 347Gbyte, so that's a 26% reduction. Quite impressive, but not the same as the 40% claimed in the README. Would you think it is due to differences in read lengths between my set and yours? Would fewer longer reads compress better than lots of smaller reads?

Cheers, Wouter

iiSeymour commented 5 years ago

Hey @wdecoster do you have any other datasets in your fast5 files? VBZ is only applied to the raw signal so that could explain it. HDF5 uses chunked compression so I would think it should be pretty robust to read length but I haven't investigated that in much detail myself.

wdecoster commented 5 years ago

I'm not sure about which other datasets, at least I didn't put anything in there. They're "fresh" from a PromethION run from last week. How can I check what's in there?

fbrennen commented 5 years ago

If all you've got is a fast5 file with raw data in it then HDF5 will add ~15kB+ or so on top of the raw dataset, which can drag your total reduction down a bit depending on read length. I'm not entirely sure how much the total overhead goes up as you add more reads to a multi-read file, but on paper it should be pretty small as we don't actually copy the shared metadata (UniqueGlobalKey, tracking_id) for each new read, but softlink them instead.

We see about the same real-world savings ourselves internally (20-30ish percent), with reads in the ~10kbase range.

You can check "what's in there" using h5dump or just opening the file yourself in HDFview.

aroelo commented 4 years ago

Hi, I have a similar question.

I used the compress_fast5 function from the ont-fast5-api tools, but my compression rate is only around 3%, and I'm wondering what is the cause of this, perhaps it is related to the read length?
The command I used is:

compress_fast5 -i /5_workspace/basecalling/20191203_FAK34410_semina_2/fast5 -s /2_space/nanopore_raw/compress_fast5test/ -c vbz -t 32 --recursive

The size of my original folder is 134GB and the size of the folder with compressed fast5's is 130GB. In the folder are 1832 multi fast5's, containing 4000 reads each. There aren't any other datasets in my fast5 files.

I've also added a histogram of the read length:

fbrennen commented 4 years ago

Hi @aroelo -- what version of MinKNOW did you use to generate your files with? Might you have already enabled vbz compression? MinKNOW can produce vbz-compressed files as of version 19.12, and re-compressing them with ont-fast5-api will produce very small gains just like you're reporting (the next version of ont-fast5-api will have a tool to report the type of compression your files use, though that will probably not make it out until early next year).

aroelo commented 4 years ago

Hi @fbrennen. Sorry it took some time to get back to you over the holidays.

I got the following information from the report.md file:

Tracking ID
===========

{
    "asic_id": "353824084",
    "asic_id_eeprom": "4787110",
    "asic_temp": "32.300030",
    "asic_version": "IA02D",
    "auto_update": "0",
    "auto_update_source": "https://mirror.oxfordnanoportal.com/software/MinKNOW/",
    "bream_is_standard": "0",
    "device_id": "MN29264",
    "device_type": "minion",
    "distribution_status": "stable",
    "distribution_version": "19.06.9",
    "exp_script_name": "N/A",
    "exp_script_purpose": "sequencing_run",
    "exp_start_time": "2019-12-03T14:05:25Z",
    "flow_cell_id": "FAK34410",
    "guppy_version": "3.0.6+9999d81",
    "heatsink_temp": "32.675781",
    "hostname": "MT-110086",
    "installation_type": "nc",
    "local_firmware_file": "1",
    "operating_system": "ubuntu 16.04",
    "protocol_group_id": "20191203",
    "protocol_run_id": "",
    "protocols_version": "4.1.9",
    "run_id": "a11a63f83334f9a9cc05e5fedda6440270de5d41",
    "sample_id": "semina_2",
    "usb_config": "MinION_fx3_1.1.1_ONT#MinION_fpga_1.1.0#ctrl#USB2",
    "version": "3.4.8"
}

As you can see we used version 19.06.9, so I guess it is not possible then that the files were already compressed.

fbrennen commented 4 years ago

Hi @aroelo -- thanks for the update. Could you give us one of the fast5 files to have a look at?

aroelo commented 4 years ago

@fbrennen Sure, I've attached a wetransfer link to the first fast5 file - before using ont-fast5-api.
https://we.tl/t-ToApxXkhJI

fbrennen commented 4 years ago

Hi @aroelo -- thanks very much for that. It looks like ont-fast5-api is not correctly hardlinking the tracking_id and context_tags sections after we compress reads. Normally we only store a single copy of the tracking_id and context_tags sections (because they're usually repeated across every single read), and then each other read section just links to them (meaning they take up much less space):

read_0
    tracking_id
        <some data>
    context_tags
        <some data>
read_1
    tracking_id
        <links to the tracking_id for read_0>
    context_tags
        <links to the context_tags for read_0>
[...]

Instead of doing that we're incorrectly duplicating the tracking_id and context_tags sections, so when reads are particularly short (like yours are) the increase in the size of each read entry due to the duplication is larger than the savings you get by compressing the raw data with vbz. Your data is actually compressing perfectly well. =)

We'll get this fixed as soon as we can.

aroelo commented 4 years ago

@fbrennen Thanks for the explanation and looking into it. And it's great to hear that it can likely be fixed, will be looking forward to an update!

thomasvangurp commented 4 years ago

@fbrennen any updates on this?

fbrennen commented 4 years ago

Hi @thomasvangurp -- sorry, not at the moment. It's high on the list though. =) Are you seeing a significant difference in file sizes?

fbrennen commented 4 years ago

Hi @thomasvangurp @aroelo -- apologies for the delay in fixing this. We have just released ont_fast5_api version 3.1.1 which correctly hardlinks the tracking_id and context_tags sections. We've seen a noticeable decrease in file size after repacking (~10% on the test datasets we have, though this validation hasn't been exhaustive). Please have a try and let us know how you get on.

nanoporetech / vbz_compression

Compression benchmark #4