nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
477 stars 59 forks source link

basecalled pod5/fast5 #667

Closed GregorD96 closed 6 months ago

GregorD96 commented 6 months ago

Currently, nucleotide databases such as ENA (European Nucleotide Archive) only accept basecalled fast5 as data format. This will likely change to pod5 or basecalled pod5 soon. However, the current version of dorado (v0.5.3) supports neither the output of basecalled fast5 nor a basecalled version of pod5. Implementing this would be a great addition imo, and save time converting files once data upload is required.

Best,

Gregor

vellamike commented 6 months ago

Hi @GregorD96 , ENA accepts FASTQ and BAM format for basecalls, we thinks it makes sense to separate the raw data (POD5) format from the basecalls where we prefer to use standard formats (BAM, SAM etc). What is the reason that a "basecalled POD5" format would be beneficial?

lpryszcz commented 6 months ago

Hi @vellamike , we used to submit raw data (Fast5) to all our RNA projects because we believe it may be useful for many ppl. Currently, ENA only support basecalled fast5 to be submitted. So we could upload BAM/FastQ, but we can't upload pod5 (raw data).

I've seen this being discussed in several places over last year or so, but haven't seen any solution yet. Any idea how to navigate this?

vellamike commented 6 months ago

Hi, the core issue here seems to be that ENA doesn't support POD5, rather than the absence of a "basecalled POD5 format". I would suggest that you petition the ENA to add POD5 support.

However, I understand that this answer doesn't help you in the meantime! We do have a POD5 to FAST5 tool which you could use to convert your POD5 to FAST5 and upload this to ENA - I think this should solve your issue until ENA supports POD5?

lpryszcz commented 6 months ago

Not really, we tried to submit raw (unbasecalled) Fast5 produced this way and ENA didn't accepted it... We had to re-basecall Fast5 with older version of guppy in order to be able to upload them to ENA (Fast5 has to contain move table...). Ok, we'll ping ENA about this. Thanks!

lpryszcz commented 4 months ago

Hi, we got a response from ENA today regarding the topic - I'm pasting it below so others don't have to contact them multiple times regarding the same topic.

In brief, there isn't any alternative for basecalled Fast5 in the near future from their side.
Could maybe ONT push this forward? Or provide some alternative? Otherwise it won't be possible to share raw data.

Unfortunately we still only accept ONT data as tar.gz file containing both the
.fast5 file and base called .fastq file.

ONT doesn’t have consistent file format which is why we have remained with
fast5. Generally, we prefer raw read data in bam or fastq format but for ONT,
fast5 is our only accepted format.

We have been discussing the POD5 format for ONT in our team but are yet to
decide whether we will switch over and if we did then this switch would have a
longer timeline.

You can also submit the ONT data converted to bam or fastq format using a
third party tool (not provided by the ENA).
vellamike commented 4 months ago

Thank you for this detailed feedback @lpryszcz ,

We will discuss this directly with ENA.

Mike

Psy-Fer commented 4 months ago

Hey Mike and others,

We have submitted files other than fast5. The requirement is that it's raw data and basecaller data. That could technically be any format.

For example we have submitted to sra and ena. For Ena we just did a blow5 with the fastq as a tar.gz

I don't see why you couldn't do the same with pod5 and fastq or bam.

See Q7 here in our docs https://hasindu2008.github.io/slow5tools/faq.html

Or here for an example https://www.ebi.ac.uk/ena/browser/view/ERR11768584

Hope that helps

James