Seongmin-Jang-1165 commented 1 day ago

Issue Report

Please describe the issue:

Hello, I am currently performing target-specific custom barcode demultiplexing using Direct RNA seq data.

In the options, there is a setting for mask1_front, and the explanation states:

(Required) The leading flank for the front barcode (applies to single and double ended barcodes). Can be an empty string.

From my understanding, this option is for specifying the flank sequence for the front-attached barcode.

In my Direct RNA seq library, adapters are only attached to the rear end of the read, and I have inserted barcodes into this adapter sequence.

In this case, how should I adjust this option? I am thinking of setting it as follows:

mask1_front = ""
mask1_rear = ""
mask2_front = ""
mask2_rear = "GGCC"

What do you think of this approach?

For reference, although this is target-specific, there are multiple targets, so it’s difficult to define a single flank sequence for the front barcode. However, the rear part is clear since I can identify the custom-specific adapter sequence from the Direct RNA seq manual.

Steps to reproduce the issue:

Please list any steps to reproduce the issue.

Run environment:

Dorado version: 0.8.0
Dorado command: /home/rnagenomics/sm/Nanopore/20240923_Histone_Direct_RNA_seq/dorado-0.8.0-linux-x64/bin/dorado basecaller sup --no-trim --barcode-arrangement barcode_arra.toml --barcode-sequences barcode_sequence.fastq /home/rnagenomics/sm/Nanopore/20240923_Histone_Direct_RNA_seq/rawdata/data/240923histone/HepG2/20240923_1720_P2S-01504-A_PAW01284_59f98b81/pod5_total/ > DORADO_Barcode_basecall_3.bam
Operating system:
Hardware (CPUs, Memory, GPUs) : A100
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): POD5
Source data location (on device or networked drive - NFS, etc.):
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)

malton-ont commented 1 day ago

Hi @Seongmin-Jang-1165,

You should specify your flanks in mask1_* and additionally set rear_only_barcodes = true. You may want a longer flank sequence to ensure the mask is found correctly given you have only one side specified, and you may need to tweak the scoring parameters.

Seongmin-Jang-1165 commented 20 hours ago

hello @malton-ont thank you for the advice!!

Although this is not directly related to the previous topic, I have a question I would like to ask.

My current plan is to perform basecalling, then demultiplex using custom barcode analysis, and subsequently categorize the raw signal (POD5) according to the demultiplexing results.

When looking at the basecalled data, each read has a unique read_id, and I am thinking of using this to match it with the raw data for classification.

Would it be possible to do this? If so, how can it be done? Is there an already established method for this?

I would appreciate your advice on this matter. Thank you!

malton-ont commented 18 hours ago

Yes, this should be possible. Note that any reads that have been split will have new read-ids, so you'll need to look at the pi tag to get the id of the corresponding parent read that would be present in the pod5 file.

You'll probably want to take a look at https://pypi.org/project/pod5/, particularly the filter and subset commands, but that discussion may be better placed on the community forums as that isn't a dorado issue.

billytcl commented 15 hours ago

Just jumping into this thread with a Q: do the barcodes in RNA004 have to be RNA, or can they be DNA? It's unclear to me what the basecaller would do for read trimming as I'm guessing it removes a DNA-associated signal.

Eg. If I have ADAPTER-BARCODE-AAAAA-RNA, does the barcode have to be RNA or can it be DNA?

Seongmin-Jang-1165 commented 5 hours ago

@malton-ont Thank you for reply!! i'll try it.

Hello @billytcl according to SQK-RNA004 Direct RNA sequencing library kit, it provide adapter for PolyA+ RNA(RTA) and suggest about target-specific custom adapter. I prepared the library with custom adapter.

custom adapter is made with 2 DNA primer that contains partially complementary sequence and when making library, annealing step is needed.

so i don'n know about the RTA, but it seems like RTA is also composed of DNA strand.

checking the library protocol will be helpful. (https://nanoporetech.com/document/direct-rna-sequencing-sequence-specific-sqk-rna004)

and also there are a few article about demultiplexing direct RNA seq data. it said that the raw signal is very different between RNA read region and adapter region because of difference DNA & RNA. (https://genome.cshlp.org/content/30/9/1345)

more, DORADO manual said it detects DNA adapter sequence, so i assume it auto-trim DNA adapter sequence. but it has an option --no-trim that inhibits adapter trimming.

so i think it is okay making barcode with DNA

I had similar question and searched for this, and this informations is what I found. please share with me if there are wrong or updated information

nanoporetech / dorado

Question Regarding mask1_front and Barcode Demultiplexing in Direct RNA Seq" #1060

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment:

Logs