hasindu2008 / slow5tools

Slow5tools is a toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.
https://hasindu2008.github.io/slow5tools
MIT License
90 stars 6 forks source link

Malformed blow5 record - can they still be merged? #95

Closed hengjwj closed 1 year ago

hengjwj commented 1 year ago

Hi again @hasindu2008,

I encountered the following error when attempting to merge BLOW5s: [slow5_get_next_mem::ERROR] Malformed blow5 record. Failed to read the record size. Missing blow5 end of file marker. At src/slow5.c:3236

This data came from ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR470/ERR4708848/HEK293T-Mettl3-KO-rep2.tar.gz (I just saw that this was also mentioned in #89). I split the FAST5s into batches of 50 files per batch and ran f2s and saw that only 7 FAST5s were lost so I proceeded to delete the FAST5s.

While f2s was running, I did observe several of these errors but thought the BLOW5 file would still be intact: [f2s_child_worker::ERROR] Bad fast5: Could not read contents of the fast5 file 'FAK28957_4e4cc36706f219188246d6743803dc2e9ed55520_403.fast5'.

The command I used for f2s was: slow5tools f2s -p $numcore $i -d ${i}_blow5

Can these BLOW5s still be merged?

Joel

hasindu2008 commented 1 year ago

Hi

If there was a error, slow5tools process terminates at that point and thus I highly recommend not trying to merge such files (there are hacky ways to do it, but I highly discourage it). I downloaded the tar.gz file you mentioned to give it a go, but when I try to extract:

gzip: stdin: invalid compressed data--crc error
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Did you get a similar error? The downloaded tar.gz is around 124GB in my case.

Psy-Fer commented 1 year ago

Perhaps compare md5sum hash to check the download isn't doing something strange.

hasindu2008 commented 1 year ago

@Psy-Fer Do you know how to get the MD5 from ENA? The link is https://www.ebi.ac.uk/ena/browser/view/ERR4708848.

hengjwj commented 1 year ago

Hi

If there was a error, slow5tools process terminates at that point and thus I highly recommend not trying to merge such files (there are hacky ways to do it, but I highly discourage it). I downloaded the tar.gz file you mentioned to give it a go, but when I try to extract:

gzip: stdin: invalid compressed data--crc error
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Did you get a similar error? The downloaded tar.gz is around 124GB in my case.

Yeah, I got a similar error but the authors said that it should be fine as the fast5 are still extracted.

Psy-Fer commented 1 year ago

Hey @hasindu2008

You can find it in the xml file at the bottom.

Should be 4f3f118f5ba809da987bbaf69edb8860

hasindu2008 commented 1 year ago

Thanks @Psy-Fer seems they match.

@hengjwj I managed to convert that dataset. How I did was:

  1. extract the tarball into a directory named fast5
  2. run slow5tools f2s and noted down the names of the FAST5 files that caused problems (Error messages in f2s)
  3. move those badFAST5 files from fast5 directory to a separate directory called quarantine
  4. deleted the blow5 files generated from step 2 and relaunched slow5tools f2s on cleaned up fast5 directory
  5. Merge the blow5 files from step 4 using slow5tools merge

There were only a handful of bad FAST5 files and if we do not care about those few thousand reads, all good. However, I wanted to rescue as much as possible from those corrupted FAST5. So I went along the following steps:

  1. Then I converted fast5 files in quarantine into single-read fast5 files using ONT's multi_to_single_fast5, into a directory called q_single
  2. Ran f2s on q_single and noted down which single-read fast5 files that are bad
  3. Delete those bad single-read fast5 files
  4. delete the blow5 files from step 8 and relaunched slow5tools f2s on cleaned up q_single directory
  5. merge the slow5 files from step 10
  6. merge the two merged blow5 files from step 5 and step 11

Anyway, I uploaded the final BLOW5 files to https://slow5test.s3.amazonaws.com/HEK293T-Mettl3-KO-rep2.blow5 temporarily, so that you can download it and save your time.

Note that there are multiple ways to handle bad FAST5 files. The above method is what I felt like doing today. Some other ways are discussed at https://github.com/hasindu2008/slow5tools/issues/89, some of which are easier than above.

hasindu2008 commented 1 year ago

@hengjwj If you are planning to convert some more datasets, first check if they are belonging to the https://github.com/GoekeLab/sg-nex-data project for which there are already converted BLOW5 files at http://sg-nex-data-blow5.s3-website-ap-southeast-1.amazonaws.com/.

hengjwj commented 1 year ago

https://slow5test.s3.amazonaws.com/HEK293T-Mettl3-KO-rep2.blow5

Will download asap. Thanks for generating the file and the guide!