ktan8 / nanopore_telomere_basecall

14 stars 3 forks source link

Stuck on basecalling problematic reads, fast5 format issue? #3

Closed santiago-es closed 2 years ago

santiago-es commented 2 years ago

Hi,

Thanks for making this tool available to the community! I've been attempting to test this out on our own nanopore data and ran into an issue at the basecalling problematic reads step.

perl main.pl /media/santiago/SESUbuntu/MinION/Fast5s/BJ-f5/ ../2_identify_problematic_reads/bj-errors.repeatcounts.filtered.readname . bj-corrected.fastq.gz
INFO:Fast5Filter:overwriting filename mapping file filename_mapping.txt --:--:--
INFO:Fast5Filter:overwriting multiread file batch0.fast5
| 13 of 13|##################################################|100% Time: 0:00:00
INFO:Fast5Filter:0 reads extracted
WARNING:Fast5Filter:3 reads not found!
sh: 1: bonito: not found

above is the input and stdout for this step. The problematic readname file is bj-errors.repeatcounts.filtered.readname (output from step 1).

The fast5 folder contains multi-read fast5 files. There are certainly reads within them but none of the problematic reads identified are being extracted. Do the fast5 files in the input folder need to be in single-read format for this program to work? Or what else could be going wrong here?

Thank you for your help!

FYI here is an example line from the *.readname problematic read file: 9a4a30e3-635c-461c-883e-7c37d600ceb2 runid=f37187eff9704afa515007969c05694256d59a2f read=1828 ch=161 start_time=2022-04-19T19:44:24.496900-07:00 flow_cell_id=FAS04609 protocol_group_id=04-19-22_HelaPARNKO_BJ_TeloSeq sample_id=41922-3 barcode=barcode04 barcode_alias=barcode04

ktan8 commented 2 years ago

Hi Santiago,

From the errors, it seems that the bonito basecaller has not been installed in your environment.

I have updated the README to include more information on the required packages. It can be found at the following link. https://github.com/ktan8/nanopore_telomere_basecall#dependencies

Please let me know if you have further issues.

KT

santiago-es commented 2 years ago

Hey there,

Thanks for your help with this! I'm continuing to troubleshoot some steps in this pipeline. Now I'm on the 3basecalling problematic reads step and although I have 3 reads in the .readnames file marked for correction, the perl script from 3* does not seem to successfully basecall those reads even though the script runs without failure.

Here's the command line input and the STDOUT from the command

`perl 3_basecall_problematic_reads/main.pl /oak/stanford/projects/genomics/ses94/bj-fast5/ bj-errors.problematic.repeatcounts.filtered.readname . bj.corrected.fasta.gz INFO:Fast5Filter:overwriting filename mapping file filename_mapping.txt | 0% ETA: --:--:-- INFO:Fast5Filter:overwriting multiread file batch0.fast5 DEBUG:h5py._conv:Creating converter from 5 to 3 | 13 of 13|########################################################################################################################|100% Time: 0:00:00 INFO:Fast5Filter:0 reads extracted WARNING:Fast5Filter:3 reads not found!

loading model 3_basecall_problematic_reads/../1_bonito_basecalling_model/chm13_nanopore_trained_run225/ outputting unaligned fastq completed reads: 0
duration: 0:00:01 samples per second 0.0E+00 done`

the output file, bj.corrected.fasta.gz is empty. I suspect it has something to do with this warning "3 reads not found!". the same Fast5 directory was used, however, and this result is the same whether I run the 3_* perl script as a standalone, or try to run the fullpipeline perl script.

Any idea on what could be causing the issue? I'm running this now on the GPU compute node on the Stanford sherlock computing cluster thanks again!

ktan8 commented 2 years ago

Hi,

Given the "WARNING:Fast5Filter:3 reads not found!" error when there were only 3 candidate reads, it seems to me that the fast5 files that you are pointing to do not contain the same readnames as the initial fastq/fasta that you are using for your run? So one thing I would check is to see if the fast5 files indeed correspond to the initial fasta/fastq files. Further, you can double check manually if the read names in the "bj-errors.problematic.repeatcounts.filtered.readname" file can be found in the fast5 files you used. If these reads cannot be found, there may be an issue with the fast5 files you are using.

Hope this helps.

KT

k-Vartika commented 2 years ago

Hi @santiago-es and @ktan8,

I also encountered the same issue. It turns out that there was a problem in 2_identify_problematic_reads step. The .readname file has the full read details like @santiago-es mentioned,

9a4a30e3-635c-461c-883e-7c37d600ceb2 runid=f37187eff9704afa515007969c05694256d59a2f read=1828 ch=161 start_time=2022-04-19T19:44:24.496900-07:00 flow_cell_id=FAS04609 protocol_group_id=04-19-22_HelaPARNKO_BJ_TeloSeq sample_id=41922-3 barcode=barcode04 barcode_alias=barcode04

But in the fast5 file, the read name is just 9a4a30e3-635c-461c-883e-7c37d600ceb2. Since the program searches for the read name with "runid= ..barcode_alias=barcode04", it does not find the read in fast5 file.

A solution to over come the problem is to change the code at this step in 2_identify_problematic_reads step/main.pl file : system("cut -f1 $repeat_count_file_filtered > $repeat_count_file_filtered_readname"); The default delimiter is tab but in our case to get just the read name, specify space as a delimiter system("cut -d " " -f1 $repeat_count_file_filtered > $repeat_count_file_filtered_readname");

Hope this helps @santiago-es.

I not sure about how it works for @ktan8, maybe difference in the read names (the read details are delimited by tab)? This might be due the different initial base caller used (Guppy or Bonito).

Thanks & Regards, Vartika