hasindu2008 / slow5tools

Slow5tools is a toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.
https://hasindu2008.github.io/slow5tools
MIT License
89 stars 6 forks source link

run_id issue in certain files #54

Closed SAMtoBAM closed 2 years ago

SAMtoBAM commented 2 years ago

Hi there,

I am testing slow5tools on a few datasets, converting fast5 to blow5 (slow5tools f2s) For most datasets it works well (~37% smaller file sizes) but unfortunately for some datasets I am running into the following warnings etc

[search_and_warn::WARNING] slow5tools-v0.3.0: Weird fast5: Attribute file_version/read_001445a6-1ee9-4e18-90c3-7afee025d1b3 in AKH_1a/AKH_1a_8.fast5 is unexpected. This warning is suppressed now onwards. [search_and_warn::WARNING] slow5tools-v0.3.0: Weird fast5: Attribute previous_read_id/PreviousReadInfo in AKH_1a/AKH_1a_8.fast5 is unexpected. This warning is suppressed now onwards. [search_and_warn::WARNING] slow5tools-v0.3.0: Weird fast5: Attribute previous_read_number/PreviousReadInfo in AKH_1a/AKH_1a_8.fast5 is unexpected. This warning is suppressed now onwards. [fast5_group_itr::ERROR] Bad fast5: run_id is missing in the AKH_1a/AKH_1a_8.fast5 in read_id 001445a6-1ee9-4e18-90c3-7afee025d1b3. [read_fast5::ERROR] Bad fast5: Could not iterate over the read groups in the fast5 file AKH_1a/AKH_1a_8.fast5. [f2s_child_worker::ERROR] Bad fast5: Could not read contents of the fast5 file 'AKH_1a/AKH_1a_8.fast5'.

So it appears the first real issue is that the run_id is missing in these files and this only appears to occur in my older fast5 datasets (data from around 2017-2018), where it seems that the fast5 format didn't contain run_id info. Even using the --allow option doesn't help These files can be manipulated using h5dump, ont_fast5_api and basecalled etc so it is not that the files are corrupted in some way Plus by manually checking the files with h5dump I can see the difference in fast5 structure between files that worked with slow5tools and those that didn't (including but not limited to the missing run_id information)

If there is not a run_id present could one just be placed in it's absence (a randomly generated name as is done)? considering the --allow option appears to choose the first run_id anyway. Or in general just to handle older fast5 files?

Thanks a lot

hasindu2008 commented 2 years ago

Hi @SAMtoBAM

Can I know if this is a single fast5 or multi-fast5 file? Also it would be great if you could share one or two of these files so we can have a look?

SAMtoBAM commented 2 years ago

These are multi-fast5 files here is an example

hiruna72 commented 2 years ago

Thanks @SAMtoBAM. In the fast5 file you shared, run_id is not available in the read group but in tracking_id group. I will implement a patch for this soon.

SAMtoBAM commented 2 years ago

Excellent, this 'tracking_id' does appear to correspond to a run specific tag. Thanks for fixing this

hasindu2008 commented 2 years ago

Thanks for reporting this and this kind of community help will be useful to make our tools better.

We are surprised by the number of inconsistencies these FAST5s have from time to time. To give an example, the file version in FAST5 can be sometimes a string, sometimes an int and sometimes a float/double. How many million types of fast5 structures have been out there is a mystery.

Also if you are converting for archiving purposes, please do a sanity check before any fast5 deleting by counting the number of reads in SLOW5 and FAST5. We recently came up with a dataset that has the same FAST5 file name inside pass and fail directories that cause slow5tools to replace its converting files. A quick sanity check that we do in house using bash:

#estimate number of reads in multi-fast5
NUMFAST5=$(find fast5dir -name '*.fast5' | wc -l)
NUM_FAST5_READS=$(echo "($NUMFAST5)*4000" | bc)
echo $NUM_FAST5_READS

#get slow5reads             
NUM_SLOW5_READS=$(slow5tools stats reads.blow5 | grep "number of records" | awk '{print $NF}')
echo $NUM_SLOW5_READS

For multi-fast5 with 4000 reads, these numbers should be closer (won't be exactly the same as the last FAST5 could have less than 4000 reads. An added advantage is, running slow5tools stats will read through the whole file and will complain if something is malformed.

SAMtoBAM commented 2 years ago

I most certainly understand the pain of dealing with different fast5 formats, having a dataset that was gathered between 2017-2020, the format has changed dramatically during this time and your tool is not the first which has not been able to treat all the data the same!

Thanks for the heads up. I used slow5tools stats to check the files which worked properly with f2s and all the reads appear to be there and no complaints! As a side note, would you recommend replacing several fast5 files each containing 4000 reads for a single blow5 file containing all reads? or is it more appropriate to maintain a single slow/blow5 file for each set of 4000 reads?

hasindu2008 commented 2 years ago

Actually not just our tools, we have come across certain old datasets where ONT's own latest Guppy basecaller fails due to FAST5 issues.

About the number of reads per slow5 - for our inhouse datasets I convert a whole sample into a single BLOW5 file. This way, we can just move around the single file without needing to tar or anything. Also, then one index file works seamlessly with f5c/nanopolish or any other tool that take one blow5 file as input (like for BAM and fastA).

On the other hand, one may prefer 50 GB BLOW5 files if their file system prevents large files and/or need to transfer as pieces.

We thought about these multiple scenarios and thus have merge and split - the users can choose to merge or split depending on the need.

hasindu2008 commented 2 years ago

@SAMtoBAM We have implemented support for these kinds of files where the run_id is available only inside the tracking ID. Please check the master branch and give it a go. It will work if the run ID across all the reads in the same multi-fast5 are the same.

The example you provided seems to have multiple run IDs in the same multi-fast5 through. Is it possibly something generated through single to multi fast5 conversion? For those, for now, you will need to specify --allow option. --alow is not recommended if you are going to use for archiving purposes as run_id related metadata will be lost. If you are having many such samples and planning to convert them for archiving purposes, we can implement a new sub tool to properly support these kind of these multi-fast5 with multiple run IDs.

SAMtoBAM commented 2 years ago

Excellent, I compiled the latest version 0.3.0-dirty and ran it on some of those fast5 sets which failed and it appears to have worked perfectly with this warning:

slow5tools-v0.3.0-dirty: Weird or ancient fast5: converting the attribute Raw/read_number from H5T_STD_U32LE to SLOW5_INT32_T for consitency. This warning is suppressed now onwards.

I will be using the --allow option as most my datasets are a mixture of runs. I have many samples of which many were sequenced across many different flowcells with different barcodes during multiplexing and different mixtures of strains etc so in the end fast5 files were merged per sample, hence the mixture of run IDs. Do you see any problems with this downstream? particularly if only used for the read data and not the metadata?

hasindu2008 commented 2 years ago

Hi, I analysed the example multi-fast5 you provided and seems like metadata are very similar see: header.txt. If you are using it for read data (for nanopolish/f5c etc) this metadata does not matter. However, one important piece of metadata is if the experiment is DNA or RNA. Another is if MinION/GridION vs PromethION as this can affect basecalling model selection. As long as your data is not a mix of DNA and RNA or MinION/GridION and PromethION it should be alright. However, if it is for archiving purpose it is good to do it right. I wrote an example script on how you could utilise ONT's multi to single fast5 converter, HDF5-tools and GNU command line tools to first classify reads based on runIDs and then running slow5tools on those, so that -a option is not needed.

Try this script https://github.com/hasindu2008/slow5tools/blob/dev/scripts/mixed-multi-fast5-to-blow5.sh

You can simply run as mixed-multi-fast5-to-blow5.sh

Note that I quickly wrote it today after seeing your response and I only tested on two small datasets. So do a bit of testing incase there is a stupid bug. At the end the script does some sanity checks by counting number of reads in fast5 vs slow5 and also checking if read IDs are unique. This script is not very efficient so could take a bit of rime to run. The bottleneck is using h5dump to grab the run_id and moving the files. Also it requires a bit of space for temporary files in the current directory.

hasindu2008 commented 2 years ago

I am closing this issue, if you have any more questions please feel free to reopen or open a new issue.