Closed mmokrejs closed 8 months ago
If I understand you correctly, you have a dataset in which dual barcoding is used. One barcode resides in the header of R1, the other in the header of R2. If this is the case, then a two step approach, as described here should be followed.
I am a bit confused about the large amount of reads that end up in the UNKNOWN files. This would mean that not all possible barcodes for R1 were provided. If the approach is more convoluted than simple dual barcoding (e.g., if you are indeed interested in the barcodes of R1 or R2, instead of R1 and R2), then a similar approach as the one above could be followed, with the addition that the UNKNOWN files from the first stage should also serve as input for the second stage.
If I understand you correctly, you have a dataset in which dual barcoding is used. One barcode resides in the header of R1, the other in the header of R2. If this is the case, then a two step approach, as described here should be followed.
No, I want R1.seq.startswith('fwd_barcode1')
or R2.seq.startswith('rev_barcode2')
to happen, ideally at once. The FASTA/Q header contain the Illumina barcodes which practically say nothing (this is an amplicon project and the amplicon products were scattered over 5 flowcells and 1 or 2 lanes on each to dillute them enough with samples of other customers), I am after the custom barcodes which are the very begining of the sequences.
I am a bit confused about the large amount of reads that end up in the UNKNOWN files. This would mean that not all possible barcodes for R1 were provided. If the approach is more convoluted than simple dual barcoding (e.g., if you are indeed interested in the barcodes of R1 or R2, instead of R1 and R2), then a similar approach as the one above could be followed, with the addition that the UNKNOWN files from the first stage should also serve as input for the second stage.
I want OR
in R1.seq.startswith('fwd_barcode1') or R2.seq.startswith('rev_barcode2')
. I am running it in a new attempt now as two commands but then I will basically throw away the just created FASTQs after I extract the readnames from each, concatenate both readnames.txt
files, sort
, uniq` and ask an external tool to extract those read-pairs from the original FASTQ file.
I am also worried that so many reads are in the UNKNOWN
group, haven't realized any explanation yet.
And as I wrote, I am getting lost in the https://demultiplex.readthedocs.io/en/latest/usage.html#multiple-barcodes section. Please try to rewrite it. Ideally, accept on the commandline A.tsv
and B.tsv
simultaneously.
And use read_R1.fq, read_R2.fq
instead of read_1_1.fq, read_2_1.fq
as the startup example. I am lost in which _"$num"
is which. And include cat A.tsv
and cat B.tsv
contents, then I could probably figure out what you mean.
I am running it in a new attempt now as two commands but then I will basically throw away the just created FASTQs after I extract the readnames from each, concatenate both readnames.txt files, sort, uniq` and ask an external tool to extract those read-pairs from the original FASTQ file.
Obviously this was naive approach, some reads get assigned to two samples because each of the barcodes points to a different sample. demultiplex
could resolve the conflicts and prefer hints either of the two barcodes which hopefully was error-free or had simply lower distance from its original barcode than the other one.
I want OR in
R1.seq.startswith('fwd_barcode1')
orR2.seq.startswith('rev_barcode2')
.
I see, this is indeed different from the procedure described in the documentation.
Ideally, accept on the commandline
A.tsv
andB.tsv
simultaneously.
There are too many combinations to capture in a CLI. That is why we suggest multi step approaches when multiple barcodes are involved.
Anyway, what about the following idea?
demultiplex demux -r -e 8 A.tsv foo_R1.fastq.gz foo_R2.fastq
After this step, all reads that could be demultiplexed with the barcodes in R1 are now done.
demultiplex demux -r -e 8 B.tsv foo_R2_UNKNOWN.fastq.gz foo_R1_UNKNOWN.fastq
Note that the order of R1 and R2 have been reversed, the first file in the list is used for demultiplexing, the rest of the files follow the first one.
Ideally, accept on the commandline
A.tsv
andB.tsv
simultaneously.There are too many combinations to capture in a CLI. That is why we suggest multi step approaches when multiple barcodes are involved.
You can accept 4-column TSV then. I don't think either is too complicated.
Anyway, what about the following idea?
1. Demultiplex using the barcode in R1 first. E.g., `demultiplex demux -r -e 8 A.tsv foo_R1.fastq.gz foo_R2.fastq`
After this step, all reads that could be demultiplexed with the barcodes in R1 are now done.
2. Demultiplex the UNKNOWN files using the barcode in R2 next, E.g., `demultiplex demux -r -e 8 B.tsv foo_R2_UNKNOWN.fastq.gz foo_R1_UNKNOWN.fastq`
Note that the order of R1 and R2 have been reversed, the first file in the list is used for demultiplexing, the rest of the files follow the first one.
I get the point. The "problem" is that this solution does not resolve barcode conflicts.
demultiplex
could prefer the lower distance, it has the values at hand while parsing. That is something I cannot do as an addon. And if the distances are same, let's throw the read-pair into UNKNOWN
.
3. Optionally, you could concatenate the appropriate files from the first step with those from the second step.
Well, the https://github.com/jfjlaros/demultiplex/issues/32#issuecomment-1858016790 approach yields reads assigned to two samples occassionally while your approach is indeed different and at elast ensures a read-pair ends up in just a single sample file.
You can accept 4-column TSV then. I don't think either is too complicated.
That is not the problem. It gets complex when the first barcode is in the first part of the header of R1, the second one is in the read of R2, the third one is in the second part of R3, etc. And then there are the logical operations, e.g., we want either R1 and R2 to match or R3, etc.
The "problem" is that this solution does not resolve barcode conflicts.
If there are barcode conflicts, then the set of barcodes was not designed properly. All barcode pairs should at least have a distance of 3 in order to correct one sequencing error and a distance of 5 to correct two errors, etc. In this case, my advice would be to allow for 0 errors.
I have the feeling I still not fully understand the problem.
The barcodes were designed (not by me) to have a Hamming distance 5 but this is not enough because InDels occur in real. Cannot demultiplex
really consider the distance values of each barcode while parsing both R1 and R2 input stream concurrently? That would be a tremendous help. For example, let's say fwd barcode might have some error while rev-barcode will could be error-free.
this is not enough because InDels occur in real
Is this Illumina data? This contains very few indels in general.
Cannot
demultiplex
really consider the distance values of each barcode while parsing both R1 and R2 input stream con-currently?
As explained above, I cannot see how that would result in a usable CLI as all the options should be duplicated for each input file.
fwd barcode might have some error while rev-barcode will could be error-free
This is why I suggested to demultiplex the UNKNOWN files from the first round. If no forward barcodes could be assigned, then we try again with the reverse barcodes. You may have to merge the files associated with the forward and reverse barcode pairs afterwards, depending on your downstream analysis.
I am closing this issue for now. Please feel free to reopen it if there are additional questions.
Hi, I tried to demultiplex at once both R1 and R2 files but then after checking that R2 barcodes are NOT in the UNKNOWN files it turned out they are. It seems
demultiplex
throws the who read -pair into UNKNOWN if R1 barcode is not found, possibly not even checking for barcode in R2.What I am after is to get into UNKNOWN files only read for which neither R1 barcode nor R2 barcode have been found. I assume this should be a runtime option. Some users may be more happy with only reads which have readable both R1 and R2 barcodes, but that is not my case. Either R1 barcode or R2 barcode detection is enough to assign the sample.
The documentation at https://demultiplex.readthedocs.io/en/latest/usage.html is somewhat difficult to grasp. I had to figure out I should follow "Other files" section with
-r
argument for other platform, to force search for the barcode in the read sequence itself, not in just the FASTA/Q header.Further, I failed to understand the "Multiple barcodes" section, seems I cannot provide in the CSV/TSV file for barcode1 and barcode2 barcodes for the R1 and R2 files at once.
So in the end my above problem stems from the fact that I did not provide the reverse read R2 barcodes to
demultiplex
at all. My bad. But is there a way for that at all?Also, for me it is more convenient to name the output files:
or at least
so in brief, not renaming them using the barcode names parsed from the CSV/TSV file(s) would be better.
Also to say, it is maybe not even written in the docs that the barcode sequences in the "other" mode are searched at the very beginning of the particular read, assumingly
record.seq.startswith(some_barcode_in_verbatim_as_provided_in_TSV_file)
. Fortunately the behavior matches my expectations. However, please clarify that in the docs.Thank you.