jfjlaros / demultiplex

Versatile FASTA/FASTQ demultiplexer.
MIT License
32 stars 5 forks source link

Demultiplex cannot separate read-pairs by R1 barcode and R2 barcode simultaneously if at least either of the two is found #32

Closed mmokrejs closed 8 months ago

mmokrejs commented 10 months ago

Hi, I tried to demultiplex at once both R1 and R2 files but then after checking that R2 barcodes are NOT in the UNKNOWN files it turned out they are. It seems demultiplex throws the who read -pair into UNKNOWN if R1 barcode is not found, possibly not even checking for barcode in R2.

What I am after is to get into UNKNOWN files only read for which neither R1 barcode nor R2 barcode have been found. I assume this should be a runtime option. Some users may be more happy with only reads which have readable both R1 and R2 barcodes, but that is not my case. Either R1 barcode or R2 barcode detection is enough to assign the sample.

demultiplex demux -d -m 1 -r -e 8 two-column-barcode-names-barcode-seqs.tsv ../foo_R1.fastq.gz ../foo_R2.fastq.gz

The documentation at https://demultiplex.readthedocs.io/en/latest/usage.html is somewhat difficult to grasp. I had to figure out I should follow "Other files" section with -r argument for other platform, to force search for the barcode in the read sequence itself, not in just the FASTA/Q header.

Further, I failed to understand the "Multiple barcodes" section, seems I cannot provide in the CSV/TSV file for barcode1 and barcode2 barcodes for the R1 and R2 files at once.

# fwd_barcode_names fwd_barcode_sequences rev_barcode_names rev_barcode_sequences
barcode1 ATCG barcode2 TGCA

So in the end my above problem stems from the fact that I did not provide the reverse read R2 barcodes to demultiplex at all. My bad. But is there a way for that at all?

Also, for me it is more convenient to name the output files:

foo_"$fwd_barcode_sequence"+"$rev_barcode_sequence"_R1.fastq.gz
foo_"$fwd_barcode_sequence"+"$rev_barcode_sequence"_R2.fastq.gz

or at least

foo_"$fwd_barcode_sequence"_R1.fastq.gz
foo_"$rev_barcode_sequence"_R2.fastq.gz

so in brief, not renaming them using the barcode names parsed from the CSV/TSV file(s) would be better.

Also to say, it is maybe not even written in the docs that the barcode sequences in the "other" mode are searched at the very beginning of the particular read, assumingly record.seq.startswith(some_barcode_in_verbatim_as_provided_in_TSV_file). Fortunately the behavior matches my expectations. However, please clarify that in the docs.

Thank you.

jfjlaros commented 10 months ago

If I understand you correctly, you have a dataset in which dual barcoding is used. One barcode resides in the header of R1, the other in the header of R2. If this is the case, then a two step approach, as described here should be followed.

I am a bit confused about the large amount of reads that end up in the UNKNOWN files. This would mean that not all possible barcodes for R1 were provided. If the approach is more convoluted than simple dual barcoding (e.g., if you are indeed interested in the barcodes of R1 or R2, instead of R1 and R2), then a similar approach as the one above could be followed, with the addition that the UNKNOWN files from the first stage should also serve as input for the second stage.

mmokrejs commented 10 months ago

If I understand you correctly, you have a dataset in which dual barcoding is used. One barcode resides in the header of R1, the other in the header of R2. If this is the case, then a two step approach, as described here should be followed.

No, I want R1.seq.startswith('fwd_barcode1') or R2.seq.startswith('rev_barcode2') to happen, ideally at once. The FASTA/Q header contain the Illumina barcodes which practically say nothing (this is an amplicon project and the amplicon products were scattered over 5 flowcells and 1 or 2 lanes on each to dillute them enough with samples of other customers), I am after the custom barcodes which are the very begining of the sequences.

I am a bit confused about the large amount of reads that end up in the UNKNOWN files. This would mean that not all possible barcodes for R1 were provided. If the approach is more convoluted than simple dual barcoding (e.g., if you are indeed interested in the barcodes of R1 or R2, instead of R1 and R2), then a similar approach as the one above could be followed, with the addition that the UNKNOWN files from the first stage should also serve as input for the second stage.

I want OR in R1.seq.startswith('fwd_barcode1') or R2.seq.startswith('rev_barcode2'). I am running it in a new attempt now as two commands but then I will basically throw away the just created FASTQs after I extract the readnames from each, concatenate both readnames.txt files, sort, uniq` and ask an external tool to extract those read-pairs from the original FASTQ file.

I am also worried that so many reads are in the UNKNOWN group, haven't realized any explanation yet.

mmokrejs commented 10 months ago

And as I wrote, I am getting lost in the https://demultiplex.readthedocs.io/en/latest/usage.html#multiple-barcodes section. Please try to rewrite it. Ideally, accept on the commandline A.tsv and B.tsv simultaneously.

And use read_R1.fq, read_R2.fq instead of read_1_1.fq, read_2_1.fq as the startup example. I am lost in which _"$num" is which. And include cat A.tsv and cat B.tsv contents, then I could probably figure out what you mean.

mmokrejs commented 10 months ago

I am running it in a new attempt now as two commands but then I will basically throw away the just created FASTQs after I extract the readnames from each, concatenate both readnames.txt files, sort, uniq` and ask an external tool to extract those read-pairs from the original FASTQ file.

Obviously this was naive approach, some reads get assigned to two samples because each of the barcodes points to a different sample. demultiplex could resolve the conflicts and prefer hints either of the two barcodes which hopefully was error-free or had simply lower distance from its original barcode than the other one.

jfjlaros commented 10 months ago

I want OR in R1.seq.startswith('fwd_barcode1') or R2.seq.startswith('rev_barcode2').

I see, this is indeed different from the procedure described in the documentation.

Ideally, accept on the commandline A.tsv and B.tsv simultaneously.

There are too many combinations to capture in a CLI. That is why we suggest multi step approaches when multiple barcodes are involved.

Anyway, what about the following idea?

  1. Demultiplex using the barcode in R1 first. E.g., demultiplex demux -r -e 8 A.tsv foo_R1.fastq.gz foo_R2.fastq

After this step, all reads that could be demultiplexed with the barcodes in R1 are now done.

  1. Demultiplex the UNKNOWN files using the barcode in R2 next, E.g., demultiplex demux -r -e 8 B.tsv foo_R2_UNKNOWN.fastq.gz foo_R1_UNKNOWN.fastq

Note that the order of R1 and R2 have been reversed, the first file in the list is used for demultiplexing, the rest of the files follow the first one.

  1. Optionally, you could concatenate the appropriate files from the first step with those from the second step.
mmokrejs commented 10 months ago

Ideally, accept on the commandline A.tsv and B.tsv simultaneously.

There are too many combinations to capture in a CLI. That is why we suggest multi step approaches when multiple barcodes are involved.

You can accept 4-column TSV then. I don't think either is too complicated.

Anyway, what about the following idea?

1. Demultiplex using the barcode in R1 first. E.g.,
   `demultiplex demux -r -e 8 A.tsv foo_R1.fastq.gz foo_R2.fastq`

After this step, all reads that could be demultiplexed with the barcodes in R1 are now done.

2. Demultiplex the UNKNOWN files using the barcode in R2 next, E.g.,
   `demultiplex demux -r -e 8 B.tsv foo_R2_UNKNOWN.fastq.gz foo_R1_UNKNOWN.fastq`

Note that the order of R1 and R2 have been reversed, the first file in the list is used for demultiplexing, the rest of the files follow the first one.

I get the point. The "problem" is that this solution does not resolve barcode conflicts.

demultiplex could prefer the lower distance, it has the values at hand while parsing. That is something I cannot do as an addon. And if the distances are same, let's throw the read-pair into UNKNOWN.

3. Optionally, you could concatenate the appropriate files from the first step with those from the second step.

Well, the https://github.com/jfjlaros/demultiplex/issues/32#issuecomment-1858016790 approach yields reads assigned to two samples occassionally while your approach is indeed different and at elast ensures a read-pair ends up in just a single sample file.

jfjlaros commented 10 months ago

You can accept 4-column TSV then. I don't think either is too complicated.

That is not the problem. It gets complex when the first barcode is in the first part of the header of R1, the second one is in the read of R2, the third one is in the second part of R3, etc. And then there are the logical operations, e.g., we want either R1 and R2 to match or R3, etc.

The "problem" is that this solution does not resolve barcode conflicts.

If there are barcode conflicts, then the set of barcodes was not designed properly. All barcode pairs should at least have a distance of 3 in order to correct one sequencing error and a distance of 5 to correct two errors, etc. In this case, my advice would be to allow for 0 errors.

I have the feeling I still not fully understand the problem.

mmokrejs commented 10 months ago

The barcodes were designed (not by me) to have a Hamming distance 5 but this is not enough because InDels occur in real. Cannot demultiplex really consider the distance values of each barcode while parsing both R1 and R2 input stream concurrently? That would be a tremendous help. For example, let's say fwd barcode might have some error while rev-barcode will could be error-free.

jfjlaros commented 10 months ago

this is not enough because InDels occur in real

Is this Illumina data? This contains very few indels in general.

Cannot demultiplex really consider the distance values of each barcode while parsing both R1 and R2 input stream con-currently?

As explained above, I cannot see how that would result in a usable CLI as all the options should be duplicated for each input file.

fwd barcode might have some error while rev-barcode will could be error-free

This is why I suggested to demultiplex the UNKNOWN files from the first round. If no forward barcodes could be assigned, then we try again with the reverse barcodes. You may have to merge the files associated with the forward and reverse barcode pairs afterwards, depending on your downstream analysis.

jfjlaros commented 8 months ago

I am closing this issue for now. Please feel free to reopen it if there are additional questions.