Trouble with Tombo pipeline for RNA-Direct Sequencing

Hi, I'm trying to perform a m6a (on RNA) detection using tombo (and some other softwares). I'm having trouble performing the resquiggle, I've read many of the other opened issues and I can't find a way for it to work.

Here is what I tried: I started with the fast5 files from the Mk1c nanopore equipment of an RNA-Direct Sequencing, these are multiple files containing 4000 (by default) signals.

multi_to_single_fast5 --input_path /.../fast5_total/ --save_path /.../fast5_single_total --recursive -t 12

On the single fast5 files I basecalled using megalodon

megalodon /.../fast5_single_total \
    --guppy-config rna_r9.4.1_70bps_hac.cfg \
    --outputs basecalls mappings \
    --rna \
    --guppy-server-path /.../guppy_basecall_server \
    --reference /.../At_transcriptome.fa \
    --output-directory /.../basecall_single_local_megalodon \
    --devices 0 \
    --overwrite \
    --processes 12

After the basecalling was done I used the tombo preprocess annotate_raw_with_fastqs

tombo preprocess annotate_raw_with_fastqs --overwrite --fast5-basedir /.../fast5_single_total --fastq-filenames /.../basecall_single_local_megalodon/basecalls.fastq

Then the single fast5 files was created (using --batch_size 900000 )

single_to_multi_fast5 --input_path /.../fast5_single_total --save_path /.../fast5_multi_total --filename_base batch_output -t 12 --recursive --batch_size 900000

Finally the resquiggle

tombo resquiggle /.../fast5_multi_total /.../At_transcriptome.fa --processes 12 --num-most-common-errors 5

which throws the following message

[11:17:04] Loading minimap2 reference.
[11:17:06] Getting file list.
******************** ERROR ********************
    Reads do not to contain basecalls. Check --basecall-group option if basecalls are stored in non-standard location or use `tombo annotate_raw_with_fastqs` to add basecalls from FASTQ files to raw FAST5 files.

I have tried using tombo preprocess annotate_raw_with_fastqs again on the big multi fast5 file, and still doesn't work:

[11:36:13] Preparing reads and extracting read identifiers.
****** WARNING ****** Basecalls exsit in specified slot for some reads. Set --overwrite option to overwrite these basecalls.                        
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.87it/s]
[11:36:13] Annotating FAST5s with sequence from FASTQs.
****** WARNING ****** Some FASTQ records contain read identifiers not found in any FAST5 files or sequencing summary files.
0it [00:09, ?it/s]
[11:36:22] Added sequences to a total of 0 reads.

I'm really missing which is my mistake on the pipeline. I even tried this using a genome (rather than a transcriptome) and using guppy (rather than megalodon).

Perhaps something from the fast5 files I'm starting with? I did a regular basecall+mapping and the data looks perfect on the IGV

I would really appreciate your help.

Thank you

@Lucas-Servi Did you ever get this working? This is what I have done so far. I don't think you are converting fast5 files correctly. Tombo requires single-fast5 files (one read per file, but multiple files) so you must use the multi_to_single fast5 command. However, you are using single_to_multi_fast5 (we don't want multi-fast5).

Run guppy basecaller to obtain fastq files.

./guppy_basecaller --input_path /path/to/fast5_dir \
                                 --save_path /path/to/fastq_dir \
                                 --config rna_r9.4.1_70bps_hac.cfg \
                                 --compress_fastq \
                                 --records_per_fastq 0 \
                                 --disable_pings \
                                 --device cuda:all \
                                 --recursive \
                                 --chunks_per_runner 1024 \
                                 --min_qscore 10

Run multi_to_single_fast5. Guppy output is currently in multi-fast5 format (default) and we want single-fast5.

multi_to_single_fast5 --input_path /path/to/fast5_dir \
                      --save_path /path/to/single_fast5_dir \
                      --threads 16 \
                      --recursive

# Merge FASTQ files
cd /path/to/fastq_dir/pass
cat fastq_runid_*fastq.gz > sample1_pass.fastq.gz
gunzip sample1_pass.fastq.gz
cd /path/to/fastq/fail
cat fastq_runid_*fastq.gz > sample1_fail.fastq.gz
gunzip sample1_fail.fastq.gz

Fix the sequencing_summary file

#!/usr/bin/python3

with open('../fastq/sequencing_summary.txt') as file, open('../single_fast5_dir/sequencing_summary_fix.txt', 'w') as outfile:
    header = next(file)
    outfile.write(header)
    for line in file:
        line = line.split()
        line[0] = f"{line[1]}.fast5"
        line.append('\n')
        outfile.write("\t".join(line))

Run tombo preprocess to add basecalls from fastq file to single-fast5 files.

# Set variables
pass=/path/to/fastq_dir/pass/sample1_pass.fastq
fail=/path/to/fastq_dir/fail/sample1_fail.fastq

# Add basecalled sequence from FASTQs to single-FAST5s
tombo preprocess annotate_raw_with_fastqs --fast5-basedir /path/to/single_fast5_dir \
                                          --fastq-filenames $pass $fail \
                                          --overwrite \
                                          --sequencing-summary-filenames /path/to/single_fast5/sequencing_summary_fix.txt \
                                          --processes 50

I am currently running tombo resquiggle but am troubleshooting. This is the current command I have.

tombo resquiggle --rna \
                 --q-score 10 \
                 --processes 5 \
                 --num-most-common-errors 5 \
                 --ignore-read-locks \
                 --overwrite \
                 --threads-per-process 50 \
                 /path/to/single_fast5_dir \
                 /references/mouse/Mus_musculus.GRCm39.cdna.all.fa

nanoporetech / tombo

Trouble with Tombo pipeline for RNA-Direct Sequencing #420