jfjlaros / demultiplex

Versatile FASTA/FASTQ demultiplexer.
MIT License
32 stars 5 forks source link

Demultiplex v1.2.2 misses barcodes starting with a (series of) DELetions aka supposedly incompletely amplified PCR products #33

Closed mmokrejs closed 10 months ago

mmokrejs commented 10 months ago

Hi, when analyzing why 55% of the barcodes in PCR-amplicons are not recognized with demultiplex demux -d -m 1 -r -e 8 A.tsv foo_R1.fastq.gz foo_R2.fastq.gz I see an algorithmic bug. The PCR-amplicons were possibly obtained during too short PCR amplification time, IMO. I will need to check that with wet lab. Edit: The extension time was 17 seconds, Kapa High Fidelity polymerase, PCR product size ~455 nt.

Nevertheless, at least the leading DELetion should have been caught by -d -m 1 -r -e 8. There are 3 1-nt spanning DELetions inside of the barcode too, which should have been caught by the Levenshtein distance (see rows with counts 27, 35, 145 below).

leading_DELetions_in_fwd_barcodes

$ cat A.tsv
fwd_barcode_name fwd_barcode_sequence
JZ71_WU_F GCCCCTCT
JZ73_WUA_F CCGTGGGT
JZ74_WUME_F TAAGTAGA
JZ75_WUMN_F GCAACGTG
$

BTW, by search for a REV barcode in R2 read can rescue 5% of the data, but still 50% of reads are left in UNKNOWN_UNKNOWN group. I feel the PCR products will be in the other half of the data rotated incl. the barcodes or something like that. Need to dig into that still. Edit: Indeed, the Kapa High Fidelity polymerase is a proof-reading polymerase so the ~455 PCR products have blunt ends, hence the sequencing library inserts are in both orientations. That supposedly explains the remaining 50% of the cases.

jfjlaros commented 10 months ago

Can you provide the reads in the form of text instead of a picture? Then I can do some debugging on my side as well.

By the way, there are two lines that are numbered 27.

mmokrejs commented 10 months ago

Is this enough? Basically you can chop the count values and drop lowercased letters to make the testcases.

     24 GCCCCtCTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     24 gcaacgTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     27 gcaACGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     27 GCAaCGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     29 ccgtgGGTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     29 gcaacGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     34 ccgtGGGTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     35 CCGTGGgTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     42 gcccctCTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     44 taagtAGAGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     45 gcaaCGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     51 ccgTGGGTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     57 taagTAGAGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     63 gcAACGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     66 gcccCTCTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     70 nnnnnnGAGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
     84 nnnnnnnAGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
    104 nnnnnnnTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
    115 ccGTGGGTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
    122 taAGTAGAGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
    122 gcCCCTCTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
    143 gCAACGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
    145 GCCCcTCTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
    150 gccCCTCTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
    150 taaGTAGAGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
    176 gCCCCTCTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
    195 cCGTGGGTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
    228 nnnnnnnnGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
    256 tAAGTAGAGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG

Yeah, I realized too late there are two cases with 27 , but we understand each other which lines I meant.

mmokrejs commented 10 months ago

The incomplete PCR products are also on the 3'-end of the top-strand, resulting in incomplete barcode at the very beginning of _R2 read (17 seconds PCR amplification time of ~455 nt PCR product, Kapa High Fidelity polymerase). So I think my theory with incomplete PCR products is correct albeit not explaining all issues.

jfjlaros commented 10 months ago

I think I know what the source of the confusion is. In the following line,

27 GCAaCGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG

the actual read looks as follows:

GCACGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG

When we select the first 8 nucleotides, we get,

 GCACGTGG

which is at edit distance 2 of barcode GCAACGTG (JZ75_WUMN_F).


Indeed, the following test works as expected:

test.fa:

> read
GCACGTGGTACAAGT

test.tsv:

JZ75_WUMN_F GCAACGTG

Command:

demultiplex demux -d -m 2 -e 8 test.tsv test.fa

Output:

$ ls
test.fa  test_JZ75_WUMN_F.fa  test.tsv
mmokrejs commented 10 months ago

So maybe I was not clear. I added the lower-cased letters to rescue the sequence to the deemed error-free original and to emphasize which nucleotide is missing, hence the picture with letters crossed-though.

My point is that these sequences are from UNKNOWN file. I thought I enabled Levenshtein distance and I thought one DELetion event results in distance 1.

jfjlaros commented 10 months ago

I thought I enabled Levenshtein distance and I thought one DELetion event results in distance 1.

Yes, that is the point I am trying to make, because the barcode is missing 1 nucleotide, but we select 8 nucleotides from the read, the resulting candidate barcode contains an additional nucleotide from the part you highlighted in red. Hence we need an edit distance of 2 to correct for one deletion.

We could also select 7 nucleotides from the read, then an edit distance of 1 will suffice, but since we do not know the number of deletions beforehand, this is not a generic solution.

mmokrejs commented 10 months ago

Aha, now I get your point from the first paragraph. I wondered whether you grab also 8 chars plus up to N chars on each side to accommodate for a series of N InDels. That is how I would expect it to behave.

jfjlaros commented 10 months ago

The algorithm first selects the number of given nucleotides and then tests that selection against all barcodes.

You seem to be expecting some sort of asymmetric alignment e.g., the semi-global alignment. I am not sure if this will work but I can give it some thought.

mmokrejs commented 10 months ago

Thank you. That is why I did not bother tostitch my own code to handle the Levenshtein distance as it was clear I would need to work with a bit wider sequence to accommodate for the errors. Would be really helpful.

BTW: Seems in my data the REV (_R2.starstwith(rev_bc)) barcode is in the same orientation also in the very beginning of _R1 read so I will need to check for it too (_R1.startswith(rev_bc)), it is not even reverse-complemented there, it is just exactly same sequence with its follow-up 3'-sequence. How that happened I do not know yet. It feels ATM this will explain the other 50% of my data as the counts of cases are huge. Edit: Indeed, these are Kapa High Fidelity polymerase ~455nt PCR products, so they have blunt ends due to its proof-reading activity.

jfjlaros commented 10 months ago

Strange.

Perhaps you can simply merge the barcodes file and use the same file to demultiplex the first and second (UNKNOWN) round.

jfjlaros commented 10 months ago

I will close this issue for now. Feel free to reopen it if there are any other related issues.