Provide more details about custom barcodes with unique Illumina barcodes for the start and end of the sequence

tijyojwad commented 11 months ago

@tijyojwad Could you provide more details about custom barcodes? I have a custom demultiplex solution that is very inefficient. I'm using a board with unique Illumina barcodes for the start and end of the sequence (i5 and i7 from the IDT UDI plate, they are unique, so any detection would put the sequence read into the correct demultiplexed file). But I don't understand if the barcode sequences should be in a toml file or a separate file. I also have multiple i5 and multiple i7, should I put them in the same fasta file? In which order?

Originally posted by @gearhq in https://github.com/nanoporetech/dorado/issues/495#issuecomment-1854844230

tijyojwad commented 11 months ago

Hi @gearhq

I created a new issue for your question. Sorry I missed this earlier!

The barcode sequences need to go into a separate FASTA file. This file will look something like

>i5_01
ACTGTG
>i5_02
TGCAG
...
>i7_01
TCACT
>i7_02
GACTACG
...

then in your arrangements file, you can see

barcode1_pattern = "i5_%02i"
barcode2_pattern = "i7_%02i"
first_index = 1
last_index = <N>

where N is the maximum barcode index from your FASTA file

gearhq commented 10 months ago

Hi @tijyojwad, thank you! I appreciate your time.

I tried demultiplexing my fastq file with these barcodes (I am using 10 pairs, but I put 3 here for brevity): barcodes_dorado.fasta

>i5_LIGH103656
CTTCAGCATA
>i5_LIGH103665
TCTAATCGCA
>i5_LIGH103686
TGATAATGCG
>i7_LIGH103656
AACCGAAGGA
>i7_LIGH103665
AAGTACTCGA
>i7_LIGH103686
GCTCAACCAA

I am using this arrangement file: arrangement_file_toml

[arrangement]
name = "custom_barcode"
kit = "BC"
mask_1_front = ""
mask_1_rear = ""
mask_2_front = ""
mask_2_rear = ""
barcode1_pattern = "i5_%02i"
barcode2_pattern = "i7_%02i"
first_index = 1
last_index = 3

## Scoring options
[scoring]
min_soft_barcode_threshold = 0.2
min_hard_barcode_threshold = 0.2
min_barcode_score_dist = 0.1

I used these parameters: dorado demux --output-dir demultiplex_dorado \ --kit-name custom_barcode -t 32 \ --barcode-sequences barcodes_dorado.fasta \ --barcode-arrangement arrangement_file_toml \ --emit-fastq \ --barcode-both-ends \ allreads_HQ.fastq.gz

Then, I got this error:

[2024-01-03 13:50:50.751] [info] Barcode for custom_barcode
[2024-01-03 13:50:50.751] [info] > starting barcode demuxing
terminate called after throwing an instance of 'std::out_of_range'
  what():  [error] key "mask1_front" not found
 --> arrangement_file_toml
   |
 1 | [arrangement]
   | ~~~~~~~~~~~~~ in this table
S01_demultiplex.sh: line 23: 958948 Aborted                 (core dumped) /projects/augusto_lab/04_tools/binaries/dorado-0.5.1-linux-x64/bin/dorado demux --output-dir demultiplex_dorado --kit-name custom_barcode -t "${N_THREADS}" --barcode-sequences barcodes_dorado.fasta --barcode-arrangement arrangement_file_toml --emit-fastq --barcode-both-ends allreads_HQ.fastq.gz

For this test I used the pre-compiled package dorado-0.5.1-linux-x64. Could you help me understand what is wrong?

tijyojwad commented 10 months ago

Hi @gearhq - you found a bug in our documentation! It needs to be mask1_front/rear and mask2_front/rear

gearhq commented 10 months ago

Hi @tijyojwad , that's great!

I changed the toml file and now I get a different error:

[2024-01-03 14:17:57.648] [info] Barcode for custom_barcode
[2024-01-03 14:17:57.648] [info] > starting barcode demuxing
terminate called after throwing an instance of 'std::runtime_error'
  what():  Could not find i5_01 in pre-built or custom barcode sequences
S01_demultiplex.sh: line 23: 959365 Aborted                 (core dumped) /projects/augusto_lab/04_tools/binaries/dorado-0.5.1-linux-x64/bin/dorado demux --output-dir demultiplex_dorado --kit-name custom_barcode -t "${N_THREADS}" --barcode-sequences barcodes_dorado.fasta --barcode-arrangement arrangement_file_toml --emit-fastq --barcode-both-ends allreads_HQ.fastq.gz

So I suppose I need to provide the indices with the numbers (i5_01, i5_02, i7_01, i7_02 and so on) and not the sample names (like I did i5_LIGH103656, i7_LIGH103656 and so on), is that correct? So, to demultiplex the fastq files and use the sample names as file names, do I need to manually rename them or provide the "--sample-sheet" parameter? I'm asking this because I did not find any sample sheet templates in the barcode section of the documentation.

tijyojwad commented 10 months ago

that's right, you need to name them as i5_01 and so on.

Here's some documentation on sample sheet aliasing - https://github.com/nanoporetech/dorado/blob/master/documentation/SampleSheets.md

gearhq commented 10 months ago

Alright, I tried adding this sample spreadsheet with minimal information:

kit,experiment_id,flow_cell_id,barcode,alias
BC,ligh_pool2,flow_cell_1,i5_01,LIGH103656
BC,ligh_pool2,flow_cell_1,i5_02,LIGH103665
BC,ligh_pool2,flow_cell_1,i5_03,LIGH103686
BC,ligh_pool2,flow_cell_1,i7_01,LIGH103656
BC,ligh_pool2,flow_cell_1,i7_02,LIGH103665
BC,ligh_pool2,flow_cell_1,i7_03,LIGH103686

I updated the barcode fasta headers:

>i5_01
CTTCAGCATA
>i5_02
TCTAATCGCA
>i5_03
TGATAATGCG
>i7_01
AACCGAAGGA
>i7_02
AAGTACTCGA
>i7_03

I also updated the dorado parameters with the tsv file:

dorado demux --output-dir demultiplex_dorado \
    --kit-name custom_barcode \
    --sample-sheet barcodes_dorado.tsv \
    -t 32 \
    --barcode-sequences barcodes_dorado.fasta \
    --barcode-arrangement arrangement_file_toml \
    --emit-fastq \
    --barcode-both-ends allreads_HQ.fastq.gz

The software worked!

Unfortunately all the reads were unclassified, I still need to find out what is going on. Is there a serious problem, since I don't have the mask1_front/rear and mask2_front/rear sequences?

tijyojwad commented 10 months ago

yeah it won't really work when you provide no masks. The current algorithm expects at least one of the masks to be provided - we need to add a check for that in the config parsing so we don't allow this mode. And that's because the heuristic we use to find barcodes depends on having a flank sequence to identify the approximate location of the barcode.

Alternatively we can look into adding a flank free search, but that would potentially be lower accuracy. Generally barcodes have flanks though - would you be able to find that info?

gearhq commented 10 months ago

Yes, after some alignments and grep searches I found that most of the flanking regions are similar to the TruSeq universal adapter sequence "AATGATACGGCGACCACCGAGATCTACAC" and the TruSeq index/reverse adapter sequence "TCTGAACTCCAGTCAC". I tried various combinations of these sequences in mask1_front/rear and mask2_front/rear, but the result was all unclassified sequences. Additionally, I tried shorter versions of these sequences (e.g. just "ACAC" or "CAC" at 5' end) and got nothing classified as well. I am not sure what could be wrong now.

I think the flank free implementation would be great, in addition you could reduce the number of input files as well. Probably only the table would be easier to configure. Furthermore, the table could have less mandatory information (perhaps just the sample name and the 5' and 3' barcodes), with all other information being optional.

In my custom demultiplex solution, I implemented a flank free search using python and biopython. I used the biopython alignment algorithm allowing a configurable number of mismatches. I then looked at the Hamming distances between my barcodes and configured the allowed mismatches accordingly. It worked very well, despite being very slow and not yet having the option for trimming the barcode, adapter and primer regions. The reason I did this implementation was the fact that other solutions did not work correctly according to the tests I performed on my data. Some of them did not work on high-quality base calls, others found barcodes and flanking regions that did not match my sequences.

The reason I did this implementation was the fact that other solutions did not work correctly according to the tests I performed on my data. Some of them did not work on high-quality base calls (not being able to classify any barcode as it is happening now on demux), others found barcodes and flanking regions that did not match my sequences.

tijyojwad commented 10 months ago

Hi @gearhq - that's unfortunate to hear. Would it be possible for you to share a few reads that you're confident is barcoded from your custom script, and share that? I can take a look to see what's not working with the barcode sequences and flanks you've provided.

I've made a note to add a flank free implementation into dorado. I will let you know when we prioritize it

gearhq commented 8 months ago

Hello @tijyojwad,

I was only able to get back to the demultiplex problem now. Unfortunately, I did not have better results than those I described previously in recent attempts to demultiplex this sequencing. If you could give me some guidance on how to use the Dorado demultiplexer with this data I would be very grateful.

Here is a dropbox link with our pilot data with high quality base calls to perform the demultiplex and a file with the barcodes we used: https://www.dropbox.com/scl/fo/nqunrbfhyiwnr62991nir/h?rlkey=jqg9ndmiw3uul4kutry8kczaa&dl=0

If you have any access problems, just let me know.

tijyojwad commented 8 months ago

I was able to get classifications with this setting

arrangement.toml

[arrangement]
name = "test_github_kit"
first_index = 1
last_index = 10
kit = "BC"
mask1_front = "GACCACCGAGATCAC"
mask1_rear = "ACACTCTTTCCC"
mask2_front = "GACGGCATACGAGAT"
mask2_rear = "GTGACTGGAGTTCAGA"
barcode1_pattern = "F%02i"
barcode2_pattern = "B%02i"

sequences.fasta

>F01
CTTCAGCATA
>F02
TCTAATCGCA
>F03
TGATAATGCG
>F04
ACAGTACCAA
>F05
TTAGGTGAAG
>F06
AACAGAGGAT
>F07
CAACATACTC
>F08
GCCAAGTCTA
>F09
ACGTCTATCT
>F10
ATCTCCTCAA
>B01
TCCTTCGGTT
>B02
TCGAGTACTT
>B03
TTGGTTGAGC
>B04
GTGCTCATAA
>B05
CTGTGAAGCG
>B06
TATGTCCAAC
>B07
CAGAGGATTA
>B08
TGTTCTGCCT
>B09
CACGCGCTTA
>B10
ATTGCTTCGG

gearhq commented 8 months ago

Hi @tijyojwad ,

With the configuration you provided, I was able to demultiplex by barcode. Now I have some questions.

So for Dorado to identify the barcodes, am I required to use this naming pattern for my barcodes in my fasta file? Thus how I supposed to map barcodes to sample names in the sample spreadsheet? Could you please provide an example?

Thank you!

andreott commented 8 months ago

Hi, I think this is the right discussion for me to join as I have a similar application.

If we have paired indices, will it only consider the combination of matching barcodes, so the first barcode_1 with the first barcode_2 and so on or all combinations?

gearhq commented 8 months ago

Hi @tijyojwad,

I am using this sample sheet: experiment_id,kit,flow_cell_id,alias,barcode test,BC,FAW74194,sample3656,barcode01 test,BC,FAW74194,sample3665,barcode02 test,BC,FAW74194,sample3686,barcode03 test,BC,FAW74194,sample3745,barcode04 test,BC,FAW74194,sample3781,barcode05 test,BC,FAW74194,sample3792,barcode06 test,BC,FAW74194,sample3796,barcode07 test,BC,FAW74194,sample4072,barcode08 test,BC,FAW74194,sample3813,barcode09 test,BC,FAW74194,sample6257,barcode10

Then, at my results folder, the file names still "test_github_kit_barcode01.fastq", "test_github_kit_barcode02.fastq", (...), instead of "test_github_kit_sample3656.fastq", "test_github_kit_sample3665.fastq", (...). Could you figure out what am I doing wrong?

tijyojwad commented 8 months ago

Hi @gearhq - this is a subtlety that I don't think we've mentioned in our docs (or have proper checks for) so I made a mistake. The kit column in the sample sheets corresponds to the name entry in the CustomBarcodes file... and that name cannot have _. I'll work on getting those ironed out for the next release.

So if you update your toml to

[arrangement]
name = "test-github-kit"
first_index = 1
last_index = 10
kit = "BC"
mask1_front = "GACCACCGAGATCAC"
mask1_rear = "ACACTCTTTCCC"
mask2_front = "GACGGCATACGAGAT"
mask2_rear = "GTGACTGGAGTTCAGA"
barcode1_pattern = "F%02i"
barcode2_pattern = "B%02i"

and your sample sheets to

experiment_id,kit,flow_cell_id,alias,barcode
test,test-github-kit,FAW74194,sample3656,barcode01
test,test-github-kit,FAW74194,sample3665,barcode02
test,test-github-kit,FAW74194,sample3686,barcode03
test,test-github-kit,FAW74194,sample3745,barcode04
test,test-github-kit,FAW74194,sample3781,barcode05
test,test-github-kit,FAW74194,sample3792,barcode06
test,test-github-kit,FAW74194,sample3796,barcode07
test,test-github-kit,FAW74194,sample4072,barcode08
test,test-github-kit,FAW74194,sample3813,barcode09
test,test-github-kit,FAW74194,sample6257,barcode10

it'll work

tijyojwad commented 8 months ago

Hi, I think this is the right discussion for me to join as I have a similar application.

If we have paired indices, will it only consider the combination of matching barcodes, so the first barcode_1 with the first barcode_2 and so on or all combinations?

Hi @andreott - it will check pairwise (i.e. first barcode_1 with first barcode_2)

andreott commented 8 months ago

Hi, I think this is the right discussion for me to join as I have a similar application. If we have paired indices, will it only consider the combination of matching barcodes, so the first barcode_1 with the first barcode_2 and so on or all combinations?

Hi @andreott - it will check pairwise (i.e. first barcode_1 with first barcode_2)

Thanks! So if we have non unique adapters (same p5 index combined with multiple p7), it is no problem to have repeated barcode sequences with different ids, right?

However it seems like it will not work in that case. If I provide a second occurrence of the same p5 barcode with a different p7, the first combination is not detected/reported anymore.

tijyojwad commented 8 months ago

the way the algorithm works currently, having repeated sequences will cause the demux algorithm to think that 2 different barcodes have a confident hit for the read (since we check for at least barcode_1 or barcode_2 to be found for each barcode pair). That will then cause the read to be unclassified.

This particular use case has come up a couple of times now, so I think I will add support for it.

nanoporetech / dorado

Provide more details about custom barcodes with unique Illumina barcodes for the start and end of the sequence #559