nanobaselib / NanoBaseLib

[NeurIPS 2024 D&B] A Multi-task Benchmark Dataset for Nanopore Sequencing.
https://nanobaselib.github.io
Apache License 2.0
3 stars 0 forks source link

<p align="center" "> Website | Paper (Coming Soon) | Poster | Tutorials

Demo Dataset

The processed demo dataset (1.1GB) can be downloaded from Zenodo.

NanoBaseLib/
    ├── base_calling/
    ├── dataprep/
    ├──  ... 
    ├── demo_dataset/
    │    ├── 0_reference/
    │    │        └── ref.fa
    │    ├── 1_raw_signal/
    │    │        ├── single_fast5/
    │    │        ├── multi_fast5/
    │    │        └── multi_pod5/
    │    ├── 2_base_called/
    │    │        ├── dorado/
    │    │        │     └── dorado.bam
    │    │        ├── guppy/
    │    │        │     ├── fail/
    │    │        │     ├── pass/
    │    │        │     └── workspace/
    │    │        ├── demo_dataset.fastq
    │    │        └── reads-ref.sorted.filter.bam
    │    ├── 3_tailfindr/
    │    │        └──  tails.csv
    │    ├── 4_nanopolish/
    │    │        ├── eventalign.txt
    │    │        ├── eventalign_combined.txt
    │    │        ├── polya.tsv
    │    │        └── polya-pass-only-with-head.tsv
    │    ├── 5_tombo/
    │    │        └── single_reads/
    │    │                ├── tombo_resquiggle.txt
    │    │                └── tombo_summary.txt
    │    └── 6_segpore/
    └── demo_dataset_base_calling/
        ├──  fast5/
        ├──  input/
        ├──  label/
        └──  output/  

Preprocessing Pipeline

git clone https://github.com/nanobaselib/NanoBaseLib.git
cd NanoBaseLib

Download the clean demo dataset (319.1 MB), unzip it, and place it in the NanoBaseLib folder. Your directory structure should look like the example above.

Step 1: Data Standardization

If the format of your data is single-fast5, convert it into multi-fast5. Not need for demo dataset.

cd demo_dataset
single_to_multi_fast5 --input_path 1_raw_signal/single_fast5 \
--save_path 1_raw_signal/multi_fast5 \
--filename_base demo --batch_size 4000 --recursive

If Dorado base caller is needed, convert multi-fast5 to pod5.

pod5 convert fast5 1_raw_signal/multi_fast5/*.fast5  \
--output 1_raw_signal/multi_pod5 \
--one-to-one 1_raw_signal/multi_fast5

Step 2: Base Calling

guppy_basecaller -c rna_r9.4.1_70bps_hac.cfg --num_callers 20 --cpu_threads_per_caller 20 \
                 -i 1_raw_signal/multi_fast5 -s 2_base_called/guppy  --fast5_out
dorado basecaller rna002_70bps_hac@v3 1_raw_signal/multi_pod5 \
                  --estimate-poly-a > 2_base_called/dorado/dorado.bam 

Step 3: Mapping

cd 2_base_called
find guppy -type f -name "*.fastq" -exec cat {} + > demo_dataset.fastq
nanopolish index --directory=../1_raw_signal/multi_fast5 demo_dataset.fastq
minimap2 -ax map-ont --MD -t 8 --secondary=no ../0_reference/ref.fa demo_dataset.fastq | samtools sort -o reads-ref.sorted.bam -T reads.tmp

image

samtools view -b -F 2324 reads-ref.sorted.bam > reads-ref.sorted.filter.bam
samtools index reads-ref.sorted.filter.bam
samtools quickcheck reads-ref.sorted.filter.bam
samtools view -h -o reads-ref.sorted.filter.sam reads-ref.sorted.filter.bam

Step 4: PolyA Detection

nanopolish polya --threads=32 --reads=demo_dataset.fastq --bam=reads-ref.sorted.filter.bam \
                 --genome=../0_reference/ref.fa > ../4_nanopolish/polya.tsv
grep -E 'PASS|readname' ../4_nanopolish/polya.tsv > ../4_nanopolish/polya-pass-only-with-head.tsv
library(tailfindr)
df <- find_tails(fast5_dir = '2_base_called/guppy/workspace',
                 save_dir = '3_tailfindr',
                 csv_filename = 'tails.csv',
                 num_cores = 20)

Step 5: Segmentation and Event Alignmnet

nanopolish eventalign --reads demo_dataset.fastq \
--bam reads-ref.sorted.filter.bam \
--genome ../0_reference/ref.fa \
--signal-index \
--scale-events \
--summary ../4_nanopolish/summary.txt \
--threads 32 > ../4_nanopolish/eventalign.txt
cd ..
mkdir 5_tombo/single_reads
for i in `ls 2_base_called/guppy/workspace`; do multi_to_single_fast5 --input_path 2_base_called/guppy/workspace/${i}  --save_path 5_tombo/single_reads --recursive -t 20; done
tombo resquiggle 5_tombo/single_reads 0_reference/ref.fa --overwrite --processes 20

image

NanoBaseLib Commands

combine_nanopolish_eventalign

# cd NanoBaseLib
python dataprep/combine_nanopolish_eventalign.py \
--input_file demo_dataset/4_nanopolish/eventalign.txt  \
--base_type rna --n_processes 10

extract_tombo_resquiggle

python dataprep/extract_tombo_resquiggle.py --root_dir demo_dataset/5_tombo/single_reads --folder 0

generate_rna_dataset

mkdir demo_dataset_base_calling
python base_calling/generate_rna_dataset.py \
--input_file demo_dataset/4_nanopolish/eventalign.txt \
--summary_file demo_dataset/4_nanopolish/summary.txt \
--reference_file demo_dataset/0_reference/ref.fa \
--fast5_folder demo_dataset/1_raw_signal/multi_fast5 \
--save_folder demo_dataset_base_calling

image

base_calling_dataloader

python base_calling/base_calling_dataloader.py --root_dir demo_dataset_base_calling

You can use the Dataloader to develop a new model for base calling and evaluate its performance using the base_calling_evaluation.py script.