grenaud / deML

Maximum likelihood demultiplexing
GNU General Public License v3.0
46 stars 13 forks source link

Do I need to trim the indices? #11

Closed ahwanpandey closed 5 years ago

ahwanpandey commented 5 years ago

Hi again,

I have questions for a couple of scenarios that are similar:

1) I have an index file where the first 11 bases are the barcode, then there are the UMI reads. I only need to use the barcode to demultiplex. Is it better if I trim the index reads to only contain the first 11 bases or does it not matter? Does deML look across the whole index read for matches?

2) In another extreme scenario, I have an index read that has basically 11 bases barcode then the rest are read into the adapter for a total of 125bp index reads. How would you advise to proceed when the index reads contains more information than just the barcode?

Thanks!

grenaud commented 5 years ago
  1. No it does not look across the read. If you use fastq as your input format, I suggest to split your read such that the input:
[barcode][rest of read]

is split where [rest of read] goes into read1.fq.gz and [barcode] goes into index1.fq.gz

  1. I am confused, you 11 bases of barcode + 114 bases of adapter? Why would you sequence adapters?
ahwanpandey commented 5 years ago

Sorry I should have been more specific. I have a couple of different scenarios

1) 
R1.fastq.gz -> read
I1.fastq.gz -> [Barcode 11bp] [UMI] -> this is the index read

2)
R1.fastq.gz -> read
I1.fastq.gz -> [Barcode 11bp] [114bp extra] -> this is the index read

This is single cell rnaseq data I am processing for a collaborator and I'm not sure why their core sequenced it that way.

I am only interested in the 11bp barcode in order to demultiplex.

Do you recommend I trim off the extra sequences before using deML or is it OK to leave them in there? The files are rather large so if I could process them without trimming it would save some time and space.

grenaud commented 5 years ago

no, you need to take the first 11bp and stick them in a different file and use those in deml.

I will close this issue but let me know if this is still an issue and we will reopen.

ahwanpandey commented 5 years ago

Just wanted to update.

Did a quick test with a subset of reads and the results are the same with both the following scenarios:

1) 
R1.fastq.gz -> read
I1.fastq.gz -> [Barcode 11bp] [with extra reads] -> this is the index read
ASSIGNED:       8364    83.64%
PROBLEMS:       1636    16.36%
TOTAL:  10000   100.0%

2)
R1.fastq.gz -> read
I1.fastq.gz -> [Barcode 11bp] -> this is the index read
ASSIGNED:       8364    83.64%
PROBLEMS:       1636    16.36%
TOTAL:  10000   100.0%
grenaud commented 5 years ago

I forgot what the code does but if you have longer indices than the index list, it probably ignores the rest. Thank you for testing it and reporting it :-)

ahwanpandey commented 5 years ago

Of course, thanks for sharing this tool!