Closed wenweixiong closed 4 years ago
It looks like you have two barcodes in the header. Can you try to use the option -e 6
to select the relevant part?
On closer inspection, according to the specifications, the barcode for this read should be GTATCGTCGT
. The extra :AACGGT+AAGCCT
in the header of this read does not seem to follow the standard.
Do you have more information about this data set?
The GTATCGTCGT is the index on R1. This is used to demultiplex using bcl2fastq. AACGGT is the cell barcode on R1, AAGCCT is the cell barcode on R2. Can demultiplex programm be used to demultiplex based on AACGGT+AAGCCT sequence?
With some minor modifications this should be possible, but I need to know how this FastQ file was generated. Can you perhaps give the version of bcl2fastq and the actual command line used to generate this data set?
Also, are you sure that you do not confuse the cell barcode with the indexes? It seems to me that the dual index AACGGT+AAGCCT
identifies the sample and that GTATCGTCGT
identifies the cell.
In anticipation of your answer, I have implemented an option to select barcodes from a different part of the header.
Assuming that the barcodes.csv
only contains the first barcode:
1 AACGGT
The following command can be used: demultiplex demux --format=umi -e 6 barcodes.csv reads.fq
(the -e
option selects the first part of AACGGT+AAGCCT
).
Apologies for the late reply! Yes, this will be very helpful. The combination of AACGGT and AAGCCT informs me of the final identify. Is it possible to pull reads by "AACGGT+AAGCCT", rather than just AACGGT or AAGCCT?
Just a suggestion -perhaps flexibility can be implemented where the user can pull barcode sequences from any part (position) of the header?
Is it possible to pull reads by "AACGGT+AAGCCT", rather than just AACGGT or AAGCCT?
Yes, that is possible, but note that when you use the Levenshtein distance function for matching you may get some counterintuitive results.
The barcodes.csv
will look something like this:
1 AACGGT+AAGCCT
The following command should work: demultiplex demux --format=umi barcodes.csv reads.fq
.
Just a suggestion -perhaps flexibility can be implemented where the user can pull barcode sequences from any part (position) of the header?
I have been considering this, but the Illumina header format is rather strict, so it may make the interface more complicated than needed. If more non-standard headers pop up, I will reconsider.
If you can confirm that this addition works, I will release a new version.
I reinstalled using pip install demultiplex and tried the new argument.
Returned the following error:
demultiplex: error: unrecognized arguments: --format=umi
That is because the changes have not been released yet. You can install the patch with the following commands:
git clone https://github.com/jfjlaros/demultiplex.git
cd demultiplex
git checkout umi
pip install --upgrade .
I can confirm that the argument --format=umi works for this purpose!
Thank you.
I just released version 1.1.0.
Hi, I have barcode sequence in the header that I would like to use to split my reads accordingly. But it would seem that all my reads got categorized as UNKNOWN. In the example below, AACGGT is the forward barcode sequence, but demultiplex isn't able to recognize it.
@NB502048:452:HVVMCAFXY:1:11101:2591:1020:AACGGT+AAGCCT 1:N:0:GTATCGTCGT TTATGGACAACAGTCAAACAACAATTCTTTGTACTTTTTTTTTCCTTAGTCTTTCTTTGAAGCAGCAAGTATGATGAGCAAGCTTTCTCACAAGCATTTGGTTTTAAATTATGGAGTATGTTTCTGTGGAGACGAGAGTAGGT +
Please advise? Thanks!