--estimate-polyA doesn't find tails with custom sequencing kit

ArnaudStigliani commented 10 months ago

Dear developers, We have developed our own Nanopore protocol which involves custom barcode sequences. They are longer and differ from the Oxford Nanopore official barcodes. We are starting to use Dorado and the possibility to find polyA tails is a feature in which we are very interested in. Until now, we used tailfindR but it is really slow compared to Dorado. The problem is that tailfindR and Dorado outputs differ a lot (much more than what other users observed): basically, Dorado finds a 100 tails while tailfindR finds 10^6 in a single library.

From my understanding of Dorado algorithm, it uses the sequencing adapters as boundaries to find the end of the reads, but I don't know how the barcodes are taken into account. Is it possible that our custom barcodes clash with the dorado polyA tail estimate algorithm ? And is there an easy fix for that by changing some lines in the PolyACalculator.cpp code (or something else) ? Thank you !

Best, Arnaud

smaegol commented 10 months ago

I have the same observation. Dorado outputs much less poly(A) tails than tailfindr or nanopolish. What's more, for cDNA libraries I observed that for standard SQK-PCS111 the numbers are no even as bad, but for barcoded libraries (SQK-PCB111-24) it finds poly(A) in no more than 10% of reads. So I agree that probably the initial step of looking for adaptors is problematic here. As dorado is able to locate adapters/barcodes itself (for trimming), could this be used for location of poly(A) tails? From what I see in the PolyACalculator.cpp, for poly(A) estimation the separate function using edlib and predefined adapter sequences is used https://github.com/nanoporetech/dorado/blob/a7fb3e3d4afa7a11cb52422e7eecb1a2cdb7860f/dorado/read_pipeline/PolyACalculator.cpp#L214-L240

best, Pawel

tijyojwad commented 10 months ago

Hi @ArnaudStigliani and @smaegol - as you've rightly pointed out, dorado only uses specific adapter sequences right now to look polyA locations. In addition, polyA detection with barcodes is also not optimized. Both are something we plan to work on, and it's good to see there's interest in them!

We can certainly use additional adapters to look for the polyA signal, and also provide support for custom adapters (that's our plan at least). We won't be able to have this available for a few weeks at least though.

@smaegol As a workaround if you are able to build locally you can try to change the adapter sequences in that file. That should improve things. Similarly for @ArnaudStigliani , if you have the barcode flank sequences that can be used as the adapter strings and the algorithm should work (although you might need to increase the window size to something larger than 150).

ArnaudStigliani commented 10 months ago

Hi @tijyojwad, Thanks a lot for your quick answer ! I will try it out when I have time. And I am looking forward to the next updates ;). Thanks also @smaegol for your input. Best, Arnaud

MustafaElshani commented 10 months ago

Currently I have a few hundred millions of reads that I have done with SQK-PCB111-24, I have tailfindr and dorado PolyA tail lengths for them, trying to analyse them.

Unlike @smaegol I am getting poly A tails with dorado in 98%+ reads.

There seems to be a lot of 400+ Poly A/T tail lengths from dorado ePolyA which seems suspicious. Also there seem to be subset of near 0 poly A tail while tailfinder is into the 0-130 There also appears to be a subset of data where Tailfindr reports tail lengths around 100, yet dorado records significantly higher lengths, 300+

Attached is preliminary subset of data plot, comparing the tailfindr and dorado,matched by their read_id

Is there any QC on dorado in terms of validating polyA tail reads site part ie if it is internal homopolymer or premature termination of reads.

tijyojwad commented 8 months ago

Hi @MustafaElshani - apologies for the super delayed response. Would you be willing to share a subset of your pod5 data that show the overestimated tails? Is this R9 or R10 data?

smaegol commented 8 months ago

Hi,

@MustafaElshani @tijyojwad

I've found a reason why I'm missing most of the reads on barcoded libraries. Although it's stated that adapter/primer/barcode trimming is automatically disabled for DNA when --estimate-poly-a is switched on, in fact it's not true. When using --kit-name option remember to add --no-trim to get poly(A) results.

Dear developers, can you fix this in the next release please so trimming is automatically switched-off when performing poly(A) estimation?

HalfPhoton commented 8 months ago

@smaegol, I'll raise this internally and reply with a decision on this.

tijyojwad commented 8 months ago

good catch @smaegol ! we are turning off primer and adapter trimming but didn't account for barcodes. Will get that fixed in the next release!

tijyojwad commented 8 months ago

@ArnaudStigliani @MustafaElshani - are you able to redo your runs with --no-trim added as well to see if that addresses your issues?

MustafaElshani commented 8 months ago

Hi @tijyojwad

The samples were run with '--no-trim'. These were samples run on R9 flowcells saved as fast5. I converted these with 'pod5 convert fast5 ' using using 'pod5'.

After converting to pod5 exact script used were

# Run dorado basecaller for the current batch
../../dorado-0.5.0-linux-x64/bin/dorado basecaller \
  ../../dorado-0.5.0-linux-x64/models/dna_r9.4.1_e8_sup@v3.6 \
  ./dorado/pod5/ \
  --sample-sheet ./samples.csv \
  --kit-name SQK-PCB111-24 \
  --no-trim \
  --estimate-poly-a \
  --min-qscore 7 \
  --device cuda:all \
  --emit-moves \
  > ./dorado/basecalled/sample.bam

echo "Finished Basecalling"
echo "Starting demux"

# Demultiplexing step
../../dorado-0.5.0-linux-x64/bin/dorado demux \
  --sample-sheet ./samples.csv \
  --output-dir ./dorado/basecalled/ \
  --no-trim \
  --no-classify \
  ./dorado/basecalled/sample.bam

Then I followed by extracting the read_id and its 'pt:i' and compare it to tailfindr 'tail_length' for the same read_id Is it worth running with version 0.5.3? Could the conversion from fast5 cause an issue?

MustafaElshani commented 8 months ago

Below another seperate dataset comparing tailfindr with dorado polyA tail

tijyojwad commented 8 months ago

Hi @MustafaElshani - rerunning with 0.5.3 shouldn't be required for your case, and I doubt the conversion of fast5 to pod5 is causing this issue. Would you be open to sharing a subset of reads, especially ones where the lengths differ a lot between tailfindr and dorado?

MustafaElshani commented 7 months ago

Hi @tijyojwad

Here I have included couple subsets of .pod5 for poly(A) >500 in dorado, and poly(A) 10 in dorado but >100 in tailfindr.

Appreciate you looking at this. Please do let me know if any thing further is required

tijyojwad commented 7 months ago

thanks you! I'll have a look at this right away

MustafaElshani commented 6 months ago

thanks you! I'll have a look at this right away

Hi @tijyojwad Any further updates on this?

Mustafa

tijyojwad commented 6 months ago

Hi @MustafaElshani - sorry I forgot to get back to you on this thread! I was able to reproduce what you're seeing and I've been able to find some improvements that should help. Those should be available in the next release!

ArnaudStigliani commented 4 months ago

Hi everyone, I apologize for coming back to this so late, I eventually found the time to look into it.

First, I tried last Dorado release, as one can feed their own primer sequences. Unfortunately, it did not succeed in finding any tails.

So I came back to the previous release (0.5.1+a7fb3e3d+dirty) as I could understand the code better. My apologies to the developper, I am aware that they put a lot of effort into this, I just wished to share somthing that I found interesting and that is related to @MustafaElshani 's plot (that the linearity kind of stops when tailfindR tails are > 250nt, and dorado finds suddenly extremely long tails).

In our protocol, we demultiplex libraries on the SSP side (we have protocol that doesn't use SSP, so instead, we add a demultiplexing barcode).

So I basically

put our custom VNP sequence
removed everything related to SSP (commented in the code)
changed few other distance thresholds

SignalAnchorInfo determine_signal_anchor_and_strand_cdna(const dorado::SimplexRead& read) {
    // const std::string SSP = "TTTCTGTTGGTGCTGATATTGCTTT";
    // const std::string SSP_rc = dorado::utils::reverse_complement(SSP);
    // const std::string VNP = "ACTTGCCTGTCGCTCTATCTTCAGAGGAGAGTCCGCCGCCCGCAAGTTTT";
    const std::string VNP = "GACATACCGTGAATCGCTATCCTATGTTCATCCGCACAAAGACACCGACAACTTTCTT";
    const std::string VNP_rc = dorado::utils::reverse_complement(VNP);
    int trailing_Ts = dorado::utils::count_trailing_chars(VNP, 'T');

    const int kWindowSize = 200; // increase window size to 200
    std::string read_top = read.read_common.seq.substr(0, kWindowSize);
    // spdlog::debug("read_top {}", read_top);

    auto bottom_start = std::max(0, (int)read.read_common.seq.length() - kWindowSize);
    std::string read_bottom = read.read_common.seq.substr(bottom_start, kWindowSize);
    // spdlog::debug("read_bottom {}", read_bottom);

    EdlibAlignConfig align_config = edlibDefaultAlignConfig();
    align_config.task = EDLIB_TASK_LOC;
    align_config.mode = EDLIB_MODE_HW;

    // Check for forward strand, let's only use VNP
    // EdlibAlignResult top_v1 = edlibAlign(SSP.data(), int(SSP.length()), read_top.data(),
                                         int(read_top.length()), align_config); // We don't need this 
    EdlibAlignResult bottom_v1 = edlibAlign(VNP_rc.data(), int(VNP_rc.length()), read_bottom.data(),
                                            int(read_bottom.length()), align_config);

    // int dist_v1 = top_v1.editDistance + bottom_v1.editDistance; 
    int dist_v1 =  bottom_v1.editDistance; //only keep the distance related to VNP
    // Check for reverse strand, again, let's forget about SSP
    EdlibAlignResult top_v2 = edlibAlign(VNP.data(), int(VNP.length()), read_top.data(),
                                         int(read_top.length()), align_config);
    // EdlibAlignResult bottom_v2 = edlibAlign(SSP_rc.data(), int(SSP_rc.length()), read_bottom.data(),
                                            int(read_bottom.length()), align_config);

    int dist_v2 = top_v2.editDistance; // only keep the distance to VNP and not SSP
    // int dist_v2 = top_v2.editDistance + bottom_v2.editDistance;
    // spdlog::trace("v1 dist {}, v2 dist {}", dist_v1, dist_v2);
    spdlog::debug("top_v1 {}, bottom_v1 {}, top_v2 {}, bottom_v2 {}", top_v1.editDistance, bottom_v1.editDistance, top_v2.editDistance, bottom_v2.editDistance);

    bool fwd = dist_v1 < dist_v2;
    // bool proceed = std::min(dist_v1, dist_v2) < 30 && std::abs(dist_v1 - dist_v2) > 10;
    bool proceed = std::min(dist_v1, dist_v2) < 7; // reduce the distance as we wan't to make sure that we have the right strand and we only have one adapter
    SignalAnchorInfo result = {false, -1, trailing_Ts};

After running it, it appears that it works pretty well (see image below) !

However, after some investigation, it appears that when the reverse strand is sequenced (see below)

5' ---- ADAPTER ---- FRONT_PRIMER ---- cDNA ---- poly(A) ---- RC(REAR_PRIMER) ---- 3'

We can see that Dorado estimates a lot of long tails on the reverse strand (in green on the image) while this doesn't happen on the forward strand.

I find it weird that this only happens on one side and not the other. Hope this was worth mentioning.

Note:

Maybe there is a a reason why I can't make the last Dorado release find tails ? Below is my config file.

[anchors]
front_primer = ""
rear_primer = "GACATACCGTGAATCGCTATCCTATGTTCATCCGCACAAAGACACCGACAACTTTCTT"
plasmid_front_flank = ""
plasmid_rear_flank = ""

[threshold]
flank_threshold = 0.6

[tail]
tail_interrupt_length = 0

Arnaud

ArnaudStigliani commented 4 months ago

Hi everyone, As a follow up on this, it seems that for aforementioned dots (where the pA is longer for Dorado than for tailfindR), Dorado offers a more accurate estimate than tailfindR (see link below).

https://github.com/adnaniazi/tailfindr/issues/71#issuecomment-2163141524

Best, Arnaud

nanoporetech / dorado

--estimate-polyA doesn't find tails with custom sequencing kit #562