nanoporetech / tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from raw nanopore sequencing data.
Other
230 stars 54 forks source link

Trouble with Tombo pipeline for RNA-Direct Sequencing #420

Open Lucas-Servi opened 1 year ago

Lucas-Servi commented 1 year ago

Hi, I'm trying to perform a m6a (on RNA) detection using tombo (and some other softwares). I'm having trouble performing the resquiggle, I've read many of the other opened issues and I can't find a way for it to work.

Here is what I tried: I started with the fast5 files from the Mk1c nanopore equipment of an RNA-Direct Sequencing, these are multiple files containing 4000 (by default) signals.

multi_to_single_fast5 --input_path /.../fast5_total/ --save_path /.../fast5_single_total --recursive -t 12

On the single fast5 files I basecalled using megalodon

megalodon /.../fast5_single_total \
    --guppy-config rna_r9.4.1_70bps_hac.cfg \
    --outputs basecalls mappings \
    --rna \
    --guppy-server-path /.../guppy_basecall_server \
    --reference /.../At_transcriptome.fa \
    --output-directory /.../basecall_single_local_megalodon \
    --devices 0 \
    --overwrite \
    --processes 12

After the basecalling was done I used the tombo preprocess annotate_raw_with_fastqs

tombo preprocess annotate_raw_with_fastqs --overwrite --fast5-basedir /.../fast5_single_total --fastq-filenames /.../basecall_single_local_megalodon/basecalls.fastq

Then the single fast5 files was created (using --batch_size 900000 )

single_to_multi_fast5 --input_path /.../fast5_single_total --save_path /.../fast5_multi_total --filename_base batch_output -t 12 --recursive --batch_size 900000

Finally the resquiggle

tombo resquiggle /.../fast5_multi_total /.../At_transcriptome.fa --processes 12 --num-most-common-errors 5

which throws the following message

[11:17:04] Loading minimap2 reference.
[11:17:06] Getting file list.
******************** ERROR ********************
    Reads do not to contain basecalls. Check --basecall-group option if basecalls are stored in non-standard location or use `tombo annotate_raw_with_fastqs` to add basecalls from FASTQ files to raw FAST5 files.

I have tried using tombo preprocess annotate_raw_with_fastqs again on the big multi fast5 file, and still doesn't work:

[11:36:13] Preparing reads and extracting read identifiers.
****** WARNING ****** Basecalls exsit in specified slot for some reads. Set --overwrite option to overwrite these basecalls.                        
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.87it/s]
[11:36:13] Annotating FAST5s with sequence from FASTQs.
****** WARNING ****** Some FASTQ records contain read identifiers not found in any FAST5 files or sequencing summary files.
0it [00:09, ?it/s]
[11:36:22] Added sequences to a total of 0 reads.

I'm really missing which is my mistake on the pipeline. I even tried this using a genome (rather than a transcriptome) and using guppy (rather than megalodon).

Perhaps something from the fast5 files I'm starting with? I did a regular basecall+mapping and the data looks perfect on the IGV

I would really appreciate your help.

Thank you

kenneditodd commented 1 year ago

@Lucas-Servi Did you ever get this working? This is what I have done so far. I don't think you are converting fast5 files correctly. Tombo requires single-fast5 files (one read per file, but multiple files) so you must use the multi_to_single fast5 command. However, you are using single_to_multi_fast5 (we don't want multi-fast5).

  1. Run guppy basecaller to obtain fastq files.
    ./guppy_basecaller --input_path /path/to/fast5_dir \
                                     --save_path /path/to/fastq_dir \
                                     --config rna_r9.4.1_70bps_hac.cfg \
                                     --compress_fastq \
                                     --records_per_fastq 0 \
                                     --disable_pings \
                                     --device cuda:all \
                                     --recursive \
                                     --chunks_per_runner 1024 \
                                     --min_qscore 10
  2. Run multi_to_single_fast5. Guppy output is currently in multi-fast5 format (default) and we want single-fast5.

    multi_to_single_fast5 --input_path /path/to/fast5_dir \
                          --save_path /path/to/single_fast5_dir \
                          --threads 16 \
                          --recursive
    
    # Merge FASTQ files
    cd /path/to/fastq_dir/pass
    cat fastq_runid_*fastq.gz > sample1_pass.fastq.gz
    gunzip sample1_pass.fastq.gz
    cd /path/to/fastq/fail
    cat fastq_runid_*fastq.gz > sample1_fail.fastq.gz
    gunzip sample1_fail.fastq.gz
  3. Fix the sequencing_summary file

    #!/usr/bin/python3
    
    with open('../fastq/sequencing_summary.txt') as file, open('../single_fast5_dir/sequencing_summary_fix.txt', 'w') as outfile:
        header = next(file)
        outfile.write(header)
        for line in file:
            line = line.split()
            line[0] = f"{line[1]}.fast5"
            line.append('\n')
            outfile.write("\t".join(line))
  4. Run tombo preprocess to add basecalls from fastq file to single-fast5 files.

    # Set variables
    pass=/path/to/fastq_dir/pass/sample1_pass.fastq
    fail=/path/to/fastq_dir/fail/sample1_fail.fastq
    
    # Add basecalled sequence from FASTQs to single-FAST5s
    tombo preprocess annotate_raw_with_fastqs --fast5-basedir /path/to/single_fast5_dir \
                                              --fastq-filenames $pass $fail \
                                              --overwrite \
                                              --sequencing-summary-filenames /path/to/single_fast5/sequencing_summary_fix.txt \
                                              --processes 50
  5. I am currently running tombo resquiggle but am troubleshooting. This is the current command I have.
    tombo resquiggle --rna \
                     --q-score 10 \
                     --processes 5 \
                     --num-most-common-errors 5 \
                     --ignore-read-locks \
                     --overwrite \
                     --threads-per-process 50 \
                     /path/to/single_fast5_dir \
                     /references/mouse/Mus_musculus.GRCm39.cdna.all.fa