Dorado only basecalls 4000 reads?

Ge0rges commented 1 year ago

Hello,

I have a 296GB FAST5 file from a previous Nanopore run (R9). I am using dorado to call modified bases by doing ./dorado/bin/dorado basecaller dorado_models/dna_r9.4.1_e8_sup@v3.3 fast5_folder --modified-bases 5mCG -x "cpu" > fast5_folder/calls.bam where fast5_folder contains only one 296GB .fast5 file.

The resulting calls.bam file is 13MB and dorado output says it called 4000 reads.

Why is dorado only calling 4000 reads? This happens when I convert the fast5 file to pod5 and then try to basecall the pod5 file using the same method.

vellamike commented 1 year ago

Hi @Ge0rges - this sounds like an issue with the FAST5 rather than Dorado. Could you try installing the latest version of the pod5 Python tool using pip install pod5 and then use pod5 view input.pod5 to check how many reads are in your input POD5?

Ge0rges commented 1 year ago

pod5 view on the converted fast5 file using pod5 convert fast5 -o file.pod5 -t 20 input.fast5 gives a TSV with 4001 lines, corresponding to 4000 reads. During the conversion the progress bar shows 4000 reads as well. However the resulting pod5 is 107MB where the original fast5 is 296GB.

HalfPhoton commented 1 year ago

Hi @Ge0rges,

Using your pod5 environment, could you please run the following snippet and share with us the output?

I suspect there might be unknown keys in the file which might not be picked up by our tooling.

import h5py
from pathlib import Path

# Set your path here
path = Path("/path/to/your/file.fast5")
assert path.exists()

with h5py.File(path) as _h5:
    total_records = len(_h5)
    print(f"total records: {total_records}")

    reads = [k for k in _h5.keys() if str(k).startswith("read_")]
    print(f"total reads: {len(reads)}")
    print(f"example read keys: {reads[:5]}")

    unknown = [k for k in _h5.keys() if not str(k).startswith("read_")]
    print(f"total non-reads: {len(unknown)}")
    print(f"some unknown read keys: {unknown[:5]}")

Kind regards, Rich

Ge0rges commented 1 year ago

Here's the output, thank you for your help:

total records: 4000
total reads: 4000
example read keys: ['read_0001871a-724b-4863-b12b-f40f852f1d5a', 'read_00237550-3b29-4379-b67d-e367111e715e', 'read_00266f08-b04f-4fee-a592-757fdc78d66c', 'read_0027e226-3100-42d4-a5b4-a52915656cea', 'read_003b4185-bae4-4f9d-805b-ba801be9f49b']
total non-reads: 0
some unknown read keys: []

HalfPhoton commented 1 year ago

@Ge0rges, This report is showing us that your fast5 file only has 4000 reads and 0 unusual records.

This indicates that pod5 and dorado are doing all the work they can and that it's more likely an issue with your file.

Do you know how this file was created?

Kind regards, Rich

Ge0rges commented 1 year ago

I don't know the precise history of the file. Thank you for your help, I will investigate on my end.

Ge0rges commented 1 year ago

@HalfPhoton I can't seem to get clear answers on the history of this file. I think it might be a concatenated fast5 from one or multiple runs. Do you have a suggestion on tools that could help me understand the contents of this file?

HalfPhoton commented 1 year ago

@Ge0rges ,

HDFView might be able to help, but seeing as both H5py and the hdf5 reader in dorado cannot open more than 4000 records in your file I'm not sure how helpful that might be.

If the file was formed by concatenation of multiple fast5 files, a solution would be to search for the HDF header "magic bytes" defined in the specification and split the file into parts.

Decimal: | 137 | 72 | 68 | 70 | 13 | 10 | 26 | 10
Hexadecimal: | 89 | 48 | 44 | 46 | 0d | 0a | 1a | 0a
ASCII C Notation: | \211 | H | D | F | \r | \n | \032 | \n

ssscj commented 4 months ago

Hi, I met the same problem. Have you found the solution? Thanks.

nanoporetech / dorado

Dorado only basecalls 4000 reads? #293