All reads categorized as UNKNOWN

wenweixiong commented 4 years ago

Hi, I have barcode sequence in the header that I would like to use to split my reads accordingly. But it would seem that all my reads got categorized as UNKNOWN. In the example below, AACGGT is the forward barcode sequence, but demultiplex isn't able to recognize it.

@NB502048:452:HVVMCAFXY:1:11101:2591:1020:AACGGT+AAGCCT 1:N:0:GTATCGTCGT TTATGGACAACAGTCAAACAACAATTCTTTGTACTTTTTTTTTCCTTAGTCTTTCTTTGAAGCAGCAAGTATGATGAGCAAGCTTTCTCACAAGCATTTGGTTTTAAATTATGGAGTATGTTTCTGTGGAGACGAGAGTAGGT +

Please advise? Thanks!

jfjlaros commented 4 years ago

~~It looks like you have two barcodes in the header. Can you try to use the option -e 6 to select the relevant part?~~

jfjlaros commented 4 years ago

On closer inspection, according to the specifications, the barcode for this read should be GTATCGTCGT. The extra :AACGGT+AAGCCT in the header of this read does not seem to follow the standard.

Do you have more information about this data set?

wenweixiong commented 4 years ago

The GTATCGTCGT is the index on R1. This is used to demultiplex using bcl2fastq. AACGGT is the cell barcode on R1, AAGCCT is the cell barcode on R2. Can demultiplex programm be used to demultiplex based on AACGGT+AAGCCT sequence?

jfjlaros commented 4 years ago

With some minor modifications this should be possible, but I need to know how this FastQ file was generated. Can you perhaps give the version of bcl2fastq and the actual command line used to generate this data set?

Also, are you sure that you do not confuse the cell barcode with the indexes? It seems to me that the dual index AACGGT+AAGCCT identifies the sample and that GTATCGTCGT identifies the cell.

jfjlaros commented 4 years ago

In anticipation of your answer, I have implemented an option to select barcodes from a different part of the header.

Assuming that the barcodes.csv only contains the first barcode:

1 AACGGT

The following command can be used: demultiplex demux --format=umi -e 6 barcodes.csv reads.fq (the -e option selects the first part of AACGGT+AAGCCT).

wenweixiong commented 4 years ago

Apologies for the late reply! Yes, this will be very helpful. The combination of AACGGT and AAGCCT informs me of the final identify. Is it possible to pull reads by "AACGGT+AAGCCT", rather than just AACGGT or AAGCCT?

wenweixiong commented 4 years ago

Just a suggestion -perhaps flexibility can be implemented where the user can pull barcode sequences from any part (position) of the header?

jfjlaros commented 4 years ago

Is it possible to pull reads by "AACGGT+AAGCCT", rather than just AACGGT or AAGCCT?

Yes, that is possible, but note that when you use the Levenshtein distance function for matching you may get some counterintuitive results.

The barcodes.csv will look something like this:

1 AACGGT+AAGCCT

The following command should work: demultiplex demux --format=umi barcodes.csv reads.fq.

jfjlaros commented 4 years ago

Just a suggestion -perhaps flexibility can be implemented where the user can pull barcode sequences from any part (position) of the header?

I have been considering this, but the Illumina header format is rather strict, so it may make the interface more complicated than needed. If more non-standard headers pop up, I will reconsider.

jfjlaros commented 4 years ago

If you can confirm that this addition works, I will release a new version.

wenweixiong commented 4 years ago

I reinstalled using pip install demultiplex and tried the new argument.

Returned the following error:

demultiplex: error: unrecognized arguments: --format=umi

jfjlaros commented 4 years ago

That is because the changes have not been released yet. You can install the patch with the following commands:

git clone https://github.com/jfjlaros/demultiplex.git
cd demultiplex
git checkout umi
pip install --upgrade .

wenweixiong commented 4 years ago

I can confirm that the argument --format=umi works for this purpose!

jfjlaros commented 4 years ago

Thank you.

I just released version 1.1.0.

jfjlaros / demultiplex

All reads categorized as UNKNOWN #6