nanoporetech / ont_fast5_api

Oxford Nanopore Technologies fast5 API software
Other
144 stars 28 forks source link

the function get_latest_analysis fails after upgrading MinKNOW to version 21 and onward #72

Closed avielmiluz closed 2 years ago

avielmiluz commented 2 years ago

We use the software package ont_fast5_api to convert the FAST5 to FASTQ. After upgrading MinKNOW to version 21 and onward, the codes fails in the function get_latest_analysis, probably because the structure of the FAST5 files in that MinKNOW version has changed.

The error:

Traceback (most recent call last): File "/home/user/software/nanoDx/current/scripts/fast5_to_fastq.py", line 10, in with Basecall1DTools(read, group_name=read.get_latest_analysis('Basecall_1D')) as basecall: File "/home/user/software/Fixed_conda_env_v031/39166fb3/lib/python3.9/site-packages/ont_fast5_api/analysis_tools/base_tool.py", line 55, in init raise KeyError('No group: {} found in file: {}'.format(group_name, self.filename)) KeyError: 'No group: None found in file: FAR34537_pass_barcode02_a3b28545_57.fast5'

The version of the package ont_fast5_api is 4.0.0 build pyhdfd78af_0.

Will you release a new version of ont_fast5_api that supports the nee FAST5 format? Or any other solution?

Thank you.

fbrennen commented 2 years ago

Hi @avielmiluz -- this is because the "Analyses" sections are no longer being written to fast5 files, as was described in the 21.11 release note and the earlier announcement here. If you have enabled live basecalling in MinKNOW (which it sounds like you have if you're expecting Analyses sections to be present) then fastq files should already be generated during your experiment -- you will need to get them that way.

avielmiluz commented 2 years ago

Hi @fbrennen , Thank you for replying. Since we started to use the ont_fast5_api package that converts fast5 to fastq, we stopped enabling basecalling in Mk1c MinKNOW and ask only for fast5. This has saved us a lot of time and storage. The sequencing was much quicker and the pipeline (that mainly needs the methylation information) was faster in about 80%. Now we have to enable basecalling in MinKNOW as you suggested, but it is a bit disappointing. Will there be a compatible version of ont_fast5_api that can restore this improvement?

fbrennen commented 2 years ago

Hi @avielmiluz -- if you don't enable basecalling in MinKNOW, then there will not be a Basecall_1D section for you to fetch out using ont-fast5-api. How have you been able to extract that Basecall_1D section and its fastq entry without performing basecalling somewhere?

avielmiluz commented 2 years ago

Hi @fbrennen I will try to explain what we did before upgrading to MinKNOW. We did not check out the option of FASTQ in the sequencing option in Mk1c. We only output FAST5 which saves us a lot of time and storage. This is because our pipeline converts these fast5 files to fastq (which is required for copy number analysis) by the code below. This code accept fast5 file and applies the function basecall.get_called_sequence to convert to fastq very quickly : Indeed, if fails beforehand, at the reading stage (Basecall1DTools) when looking for the group 'Basecall_1D' by the function get_latest_analysis. But in the previous structure of fast5 (MinKNOW version 20) it worked beautifully. Now it does not. Of course, if I will enable the basecalling option and output also FATSQ files , the code will be skipped, and the process will take longer.\, because of the basecalling in the device. The code below just saved us a lot of basecalling tine the produced fastq files, and it pity that it does not work in the new fast5 structure.

the code:

import sys from ont_fast5_api.fast5_interface import get_fast5_file from ont_fast5_api.fast5_file import Fast5File from ont_fast5_api.analysis_tools.basecall_1d import Basecall1DTools

f = sys.argv[1]

with get_fast5_file(f, mode="r") as f5: for read in f5.get_reads(): with Basecall1DTools(read, group_name=read.get_latest_analysis('Basecall_1D')) as basecall: fq = basecall.get_called_sequence('template', fastq = True) print(fq)

fbrennen commented 2 years ago

Hi @avielmiluz -- it has never been possible for MinKNOW to generate a fast5 file with a FASTQ entry inside it without basecalling. This is because basecalling is required in order to generate the sequence in the FASTQ record. If you were previously able to extract a FASTQ record from your fast5 files after they were produced by MinKNOW, then basecalling must have been enabled. So I am certain that MinKNOW is already able to generate FASTQ files on your Mk1c with zero additional sequencing time -- all you need to do is check the appropriate box for additional FASTQ output -- and your overall process should therefore not take any longer.

I am very sorry that your old scripts don't work -- I hope it's not too much trouble to update them to use the FASTQ files generated by MinKNOW. Do note that MinKNOW can compress the FASTQ files if that would help, and can also output BAM files instead if that would be more useful in your downstream analysis.

avielmiluz commented 2 years ago

Hi @fbrennen I see. I am also sure now that we enable basecalling, just not the fastq output. The function get_called_sequence probably extracts the basecalled sequence from the fast5. So, it does not work in the new structure. I will enable the FASTQ output in the device. The only disadvantages probably will be the storage size and file copy and transfer time from the device to the pipeline that resides in another computer. We do use VMZ compression.

Thank you for your help.