Allow to keep multiple run_ids when multi-fast5 contains them

waltergallegog commented 2 years ago

Hello, I have been learning about the fasta5 and slow5 formats recently (thanks a lot for all the tools and info you have provided in your recent papers).

I have started using slow5tools to convert fast5 to slow5, and as example dataset I'm using the .fast5 files from: https://github.com/nanopore-wgs-consortium/NA12878 For example, if you follow the RNA links you will come across with download links like: http://s3.amazonaws.com/nanopore-human-wgs/rna/Multi_Fast5/Chip137_IVT_NA12878_Data_reads/Chip137_IVT_NA12878_Data_reads_0.fast5

When I try to convert this file I get an error like:

$ slow5tools f2s Chip137_IVT_NA12878_Data_reads_0.fast5 -o Chip137_IVT_NA12878_Data_reads_0.slow5

[fast5_attribute_itr::ERROR] Ancient fast5: Different run_ids found in an individual multi-fast5 file. Cannot create a single header slow5/blow5. Consider --allow option.

If I use the --allow option, then only the first run_id is used:

$ slow5tools f2s Chip137_IVT_NA12878_Data_reads_0.fast5 -o Chip137_IVT_NA12878_Data_reads_0.slow5 -a
[search_and_warn::WARNING] slow5tools-v0.3.0: Ancient fast5: Different run_ids found in an individual multi-fast5 file. First seen run_id will be set in slow5 header.

From what I understood in FAST5 Demystified, you expect the run_id to be unique across all reads in a multi-fast5 file. And from what I understood from the slow5 description, the slow5 format supports multiple read groups.

With that in mind, I have some questions that maybe you can help me with:

Do you know why or how common is it for multi-fast5 files to have multiple run_ids?
In the ERROR and warning you mention "Ancient fast5". Does it mean the multiple run_ids was allowed in old fast5 versions? (the version of my example file is 2.0)
Given that the slow5 format already supports multiple run_ids, would it be possible (or worth it) to add support for this multiple run_ids fast5 files, by keeping all the original run_ids instead of just the first one?
If you have a small dataset with proper fast5 files that you can share as part of the repo, that could help a lot.

Thanks for your help.

hasindu2008 commented 2 years ago

Hi @waltergallegog

Thanks for checking out slow5tools. You are right slow5 format supports multi-read groups. However, the current implementation of slow5tools f2s requires that an individual fast5 file contains only one run id which is always the case for data produced by MinKNOW. My answers are below.

Do you know why or how common is it for multi-fast5 files to have multiple run_ids?

This only happens when the original fast5 files were single fast5 files (generated around 3 years ago) and such single fast5 files were converted to multi fast5 using (ONT's fast5 API)(https://github.com/nanoporetech/ont_fast5_api) that mixes run IDs together.

In the ERROR and warning you mention "Ancient fast5". Does it mean the multiple run_ids was allowed in old fast5 versions? (the version of my example file is 2.0)

Again, the original fast5 files were at least 3 years old and were converted by the ONT's fast5 API making them weird. The FAST5 version here is perhaps related to the version created by the ONT's fast5 API rather than the original sequencer.

Given that the slow5 format already supports multiple run_ids, would it be possible (or worth it) to add support for this multiple run_ids fast5 files, by keeping all the original run_ids instead of just the first one?

Yeh, given this problem is not present for modern sequencing runs, we did not prioritise it. But yeah, we will provide at least a separate subtool in slow5tools or a script to handle this situation in the future. Another alternative is if the original fast5 files are available, for instance, which is the case for the dataset that you tried [https://github.com/nanopore-wgs-consortium/NA12878/blob/master/nanopore-human-transcriptome/fastq_fast5_bulk.md], those can be fed to slow5tools and multiple run IDs can be retained.

If you have a small dataset with proper fast5 files that you can share as part of the repo, that could help a lot.

The original FAST5 for the NA12878 used in the SLOW5 paper including the subset is here: https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA744329, but even the subset is 70GB. Perhaps I should make a subsubset and add the link to the SLOW5 repository. For now, could you try on this https://cloudstor.aarnet.edu.au/plus/s/srVo6NEicclqQNE/download which is a coronavirus dataset from a FLONGLE run we used for https://github.com/Psy-Fer/interARTIC.

waltergallegog commented 2 years ago

Hi @hasindu2008, Thanks for your quick and detailed reply. It is clear now what the issue was.

But yeah, we will provide at least a separate subtool in slow5tools or a script to handle this situation in the future.

That's great to hear.

For now I'll check out the two datasets you recommended.

hasindu2008 commented 2 years ago

@waltergallegog

We added a subsubset from the NA12878 dataset and relevant links. See Q3 under https://hasindu2008.github.io/slow5tools/faq.html. Thanks for the suggestion.

Hope you had no issues with the last corona virus dataset I provided.

waltergallegog commented 2 years ago

Hello @hasindu2008 Thanks for sharing the dataset and for the update. I've been working with it without any issues.

For me this ticket can be considered closed, but let me know if you want to keep it open for this point:

But yeah, we will provide at least a separate subtool in slow5tools or a script to handle this situation in the future.

hasindu2008 / slow5tools

Allow to keep multiple run_ids when multi-fast5 contains them #60