Closed mmokrejs closed 10 months ago
Can you provide the reads in the form of text instead of a picture? Then I can do some debugging on my side as well.
By the way, there are two lines that are numbered 27.
Is this enough? Basically you can chop the count values and drop lowercased letters to make the testcases.
24 GCCCCtCTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
24 gcaacgTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
27 gcaACGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
27 GCAaCGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
29 ccgtgGGTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
29 gcaacGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
34 ccgtGGGTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
35 CCGTGGgTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
42 gcccctCTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
44 taagtAGAGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
45 gcaaCGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
51 ccgTGGGTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
57 taagTAGAGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
63 gcAACGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
66 gcccCTCTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
70 nnnnnnGAGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
84 nnnnnnnAGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
104 nnnnnnnTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
115 ccGTGGGTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
122 taAGTAGAGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
122 gcCCCTCTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
143 gCAACGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
145 GCCCcTCTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
150 gccCCTCTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
150 taaGTAGAGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
176 gCCCCTCTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
195 cCGTGGGTGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
228 nnnnnnnnGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
256 tAAGTAGAGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
Yeah, I realized too late there are two cases with 27
, but we understand each other which lines I meant.
The incomplete PCR products are also on the 3'-end of the top-strand, resulting in incomplete barcode at the very beginning of _R2
read (17 seconds PCR amplification time of ~455 nt PCR product, Kapa High Fidelity polymerase). So I think my theory with incomplete PCR products is correct albeit not explaining all issues.
I think I know what the source of the confusion is. In the following line,
27 GCAaCGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
the actual read looks as follows:
GCACGTGGTACAAGTGCTAGCCATATGGGTTGCCCTTTTG
When we select the first 8 nucleotides, we get,
GCACGTGG
which is at edit distance 2 of barcode GCAACGTG
(JZ75_WUMN_F).
Indeed, the following test works as expected:
test.fa
:
> read
GCACGTGGTACAAGT
test.tsv
:
JZ75_WUMN_F GCAACGTG
Command:
demultiplex demux -d -m 2 -e 8 test.tsv test.fa
Output:
$ ls
test.fa test_JZ75_WUMN_F.fa test.tsv
So maybe I was not clear. I added the lower-cased letters to rescue the sequence to the deemed error-free original and to emphasize which nucleotide is missing, hence the picture with letters crossed-though.
My point is that these sequences are from UNKNOWN file. I thought I enabled Levenshtein distance and I thought one DELetion event results in distance 1.
I thought I enabled Levenshtein distance and I thought one DELetion event results in distance 1.
Yes, that is the point I am trying to make, because the barcode is missing 1 nucleotide, but we select 8 nucleotides from the read, the resulting candidate barcode contains an additional nucleotide from the part you highlighted in red. Hence we need an edit distance of 2 to correct for one deletion.
We could also select 7 nucleotides from the read, then an edit distance of 1 will suffice, but since we do not know the number of deletions beforehand, this is not a generic solution.
Aha, now I get your point from the first paragraph. I wondered whether you grab also 8
chars plus up to N
chars on each side to accommodate for a series of N
InDels. That is how I would expect it to behave.
The algorithm first selects the number of given nucleotides and then tests that selection against all barcodes.
You seem to be expecting some sort of asymmetric alignment e.g., the semi-global alignment. I am not sure if this will work but I can give it some thought.
Thank you. That is why I did not bother tostitch my own code to handle the Levenshtein distance as it was clear I would need to work with a bit wider sequence to accommodate for the errors. Would be really helpful.
BTW: Seems in my data the REV (_R2.starstwith(rev_bc)
) barcode is in the same orientation also in the very beginning of _R1
read so I will need to check for it too (_R1.startswith(rev_bc)
), it is not even reverse-complemented there, it is just exactly same sequence with its follow-up 3'-sequence. How that happened I do not know yet. It feels ATM this will explain the other 50% of my data as the counts of cases are huge.
Edit: Indeed, these are Kapa High Fidelity polymerase ~455nt PCR products, so they have blunt ends due to its proof-reading activity.
Strange.
Perhaps you can simply merge the barcodes file and use the same file to demultiplex the first and second (UNKNOWN) round.
I will close this issue for now. Feel free to reopen it if there are any other related issues.
Hi, when analyzing why 55% of the barcodes in PCR-amplicons are not recognized with
demultiplex demux -d -m 1 -r -e 8 A.tsv foo_R1.fastq.gz foo_R2.fastq.gz
I see an algorithmic bug. The PCR-amplicons were possibly obtained during too short PCR amplification time, IMO. I will need to check that with wet lab. Edit: The extension time was 17 seconds, Kapa High Fidelity polymerase, PCR product size ~455 nt.Nevertheless, at least the leading DELetion should have been caught by
-d -m 1 -r -e 8
. There are 3 1-nt spanning DELetions inside of the barcode too, which should have been caught by the Levenshtein distance (see rows with counts27
,35
,145
below).BTW, by search for a REV barcode in R2 read can rescue 5% of the data, but still 50% of reads are left in
UNKNOWN_UNKNOWN
group. I feel the PCR products will be in the other half of the data rotated incl. the barcodes or something like that. Need to dig into that still. Edit: Indeed, the Kapa High Fidelity polymerase is a proof-reading polymerase so the ~455 PCR products have blunt ends, hence the sequencing library inserts are in both orientations. That supposedly explains the remaining 50% of the cases.