Annotating raw file issues

sumin5784 commented 3 years ago

Hello,

I did resquiggling and got errors, so was trying to annotate raw files with fastqs. And I'm keep having weird results from annotating. I tried two times but tombo can read all the fast5 identifies, but nothing is annotated. I'm using published dataset, so not sure whether the file has problems or not. I downloaded data from NCBI SRA, and fast5 files are single-fast5 files, but I only have one fastq file, which I fetched out using fastq-dump in SRA Toolkit.

If fastq file has problems, then what can I do next? Do I need to do basecalling on my own? Could anyone give me feedbacks?

GenomeFASTA="/workspace/nanopore_mRNA/AnnotationData/CDS/hg19/GENCODE_V38_hg19_Transcripts.fa"
Main="/workspace/nanopore_mRNA/Sumin/HEK293T_ERR4706161/WT-rep1-00"

tombo resquiggle --processes 128 ${Main}/fast5 ${GenomeFASTA} 
[04:55:20] Final unsuccessful reads summary (100.0% reads unsuccessfully processed; 1040661 total reads):
   100.0% (1040661 reads) : Fastq slot not present in --basecall-group                                      
[04:55:20] Saving Tombo reads index to file.

GenomeFASTA="/workspace/nanopore_mRNA/AnnotationData/CDS/hg19/GENCODE_V38_hg19_Transcripts.fa"
Main="/workspace/nanopore_mRNA/Sumin/HEK293T_ERR4706161/WT-rep1-00"
tombo preprocess annotate_raw_with_fastqs --fast5-basedir ${Main}/fast5 \
                                          --fastq-filenames ${Main}/fastq/WT-rep1.fastq \
                                          --processes 128

[09:34:03] Preparing reads and extracting read identifiers.
****** WARNING ****** Basecalls exsit in specified slot for some reads. Set --overwrite option to overwrite these basecalls.                                                               
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1040661/1040661 [02:56<00:00, 5903.13it/s]
[09:37:03] Annotating FAST5s with sequence from FASTQs.
****** WARNING ****** Some FASTQ records contain read identifiers not found in any FAST5 files or sequencing summary files.                                                                          
0it [00:28, ?it/s]                                                                                                                                                                                   
[09:37:32] Added sequences to a total of 0 reads.
]

GenomeFASTA="/workspace/nanopore_mRNA/AnnotationData/CDS/hg19/GENCODE_V38_hg19_Transcripts.fa"
Main="/workspace/nanopore_mRNA/Sumin/HEK293T_ERR4706161/WT-rep1-00"
tombo preprocess annotate_raw_with_fastqs --fast5-basedir ${Main}/fast5 \
                                          --fastq-filenames ${Main}/fastq/WT-rep1.fastq \
                                          --processes 128 --overwrite
[09:42:25] Preparing reads and extracting read identifiers.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1040661/1040661 [03:06<00:00, 5570.82it/s]
[09:45:58] Annotating FAST5s with sequence from FASTQs.
****** WARNING ****** Some FASTQ records contain read identifiers not found in any FAST5 files or sequencing summary files.                                                                          
  0%|                                                                                                                                                                    | 0/1040661 [00:25<?, ?it/s]
[09:46:26] Added sequences to a total of 0 reads.
******************** WARNING ********************
        Not all read ids from FAST5s or sequencing summary files were found in FASTQs.
                This can result from reads that failed basecalling or if full sets of FAST5s/sequence summaries are not processed with full sets of FASTQs.

Regards, Sumin

bhargava-morampalli commented 2 years ago

What I usually do is basecall the data again with --fast5 out option during basecall. That way I have the fast5s with sequence information already in it. Then, I convert these fast5s into single fast5s and then proceed with resquiggle command.

ky66 commented 2 years ago

What I usually do is basecall the data again with --fast5 out option during basecall. That way I have the fast5s with sequence information already in it. Then, I convert these fast5s into single fast5s and then proceed with resquiggle command.

Not possible with guppy.

sahoo2000 commented 5 months ago

It is possible on guppy but use an earlier version ( I would recommend any version before 6.3) of it as --fast5_out flag was deprecated in the recent versions.

marcus1487 commented 2 months ago

I would recommend converting to Remora for raw signal alignment which uses standard POD5 and BAM files as input, greatly simplifying these issues.

nanoporetech / tombo

Annotating raw file issues #358