Demultiplexing indexed plasmid barcode libraries

rahmas36 commented 1 year ago

Hi, I would like to use lentiviral plasmid barcode libraries for lineage tracing based on an approach called STICR that was developed in a Nature paper last year. I have ordered two plasmid pools from Addgene. Each pool contains 50-60 M unique barcodes and contains a 3 nucleotide index adjacent to the barcode region. I have attached the plasmid maps and zoomed in on the barcode region and index sequence in each. The plasmids were digested with NotI and XhoI and then the 700 bp fragment containing the barcode was purified. A UMI containing adaptor was ligated to the XhoI site and the DNA fragments were amplified with illumina index containing primers, one annealing to the UMI containing adaptor and the other annealing to the "sequencing primer site" indicated on the map. These libraries are currently being sequenced and they have the same illumina indices. Therefore, I would have to demultiplex the fastq file using the 3 nucleotide index sequences (ATC and GAG). Would this be possible with ultraplex? Looking forward to hearing back. Thanks

Delayed-Gitification commented 1 year ago

Thanks for your detailed message. So you wish to demultiplex based on just the 3nt barcode? Certainly that should be possible, yes!

Delayed-Gitification commented 1 year ago

With such short barcodes, I recommend setting the mismatch number to 0. The default is 1 which would likely be too permissive here (for historic, and perhaps not terribly good, reasons)

rahmas36 commented 1 year ago

Hi, thanks for your reply! Would it be better to include a few more bp around the 3 nucleotide index to give better context? Maybe 10-20 bp?

Delayed-Gitification commented 1 year ago

3 nt should be fine. This is because when you do the sequencing, because your primer sequences (which Illumina uses to prime the sequencing-by-synthesis reaction) are in a defined position relative to the barcode, the barcode sequence will always be present at the same position in each read.

That said, if you want to be extra careful you could indeed include a single flanking nucleotide either side (perhaps even 2 nt each side). I wouldn't add 10-20 though - that would be too much :)

rahmas36 commented 1 year ago

Thanks! Could I run ultraplex on my mac? I expect the fastq file to have around 300-400 M reads. My processor is 3 GHz Quad-Core Intel Core i5 and the RAM is 8 GB. I just download from the github link and run pip install from the terminal, right? For the actual run, I should specify both the 5' and 3' barcode? As I mentioned, I ligated a UMI containing adaptor at the XhoI site, which will be read in Read1, and the 3 nucleotide index will be read from the other end, which is Read2. The UMI is 16 nucleotides long. So I am thinking of making a CSV file with this configuration, where in the first column, I include 16 Ns, flanked by 4 constant nucleotides on either side and in the second column I write the 2 3 nucleotide indices with two constant nucleotides flanking either side:

ATCTNNNNNNNNNNNNNNNNAGCA, TTATCGA, TTGAGGA

Delayed-Gitification commented 1 year ago

We tested ultraplex on Linux but I believe it should also work on Mac.

Those specs will be ok.

Installing via pip or conda is recommended. You shouldn't need to download from GitHub.

I believe your suggested CSV would work. Good luck!

rahmas36 commented 1 year ago

Hi, thanks for your reply. I have paired end reads, so I have two gzipped fastq files (R1, R2). The R1 contains the UMI, and the R2 contains the indexed barcode read. Could I specify these two fastq files in the command line?

rahmas36 commented 1 year ago

Hi, it looks like I was able to demultiplex my reads, but I have a few questions. There was a switch up at the sequencing center whereby the 3 nucleotide index was in read1 but the UMI was in read2, the opposite of what I intended. Anyways, I set up my CSV as so:

TAATAG NNNNNNNNNNNNNNNNAGCACG TAACTC NNNNNNNNNNNNNNNNAGCACG

The first column contains the sequences with the index (marked in bold) and the second column contains the UMIs.

What surprises me is that the fastq.gz read 1 file was 6.23 GB and the fastq.gz read 2 file was 15.32 GB but the demultiplexed files are very small. I attached a screenshot of the output. It would surprise me if there were many "no matches".

This was my command: ultraplex -i ROU-15478-STICR-1_TAAGGCGA-GCGATCTA_HVLFWDRX2_L002_001.R1.fastq.gz -i2 ROU-15478-STICR-1_TAAGGCGA-GCGATCTA_HVLFWDRX2_L002_001.R2.fastq.gz -b STICR_indices.csv -dbr -inm

This was the final printout: 382 million reads processed Demultiplexing complete! 383597392 reads processed in 5656.0 seconds

Delayed-Gitification commented 1 year ago

Yes definitely looks like there is an issue with your barcodes CSV as the output files here are tiny

rahmas36 commented 1 year ago

Hi, it turns out the sequencing was in the correct direction. The adaptors were already removed. Can I disable the adaptor removal function? I wonder if that's causing problems. I think I correctly set up the barcodes csv file. In read1 there is a list of UMIs followed by the constant sequence ( NNNNNNNNNNNNNNNNAGCACG) and read2 contains either index ATC (ATTATC) or GAG (ATTGAG).

I set up the CSV file like this:

NNNNNNNNNNNNNNNNAGCACG, ATTATC;ATTGAG

The script does not output any demultiplexed files. Maybe because it searches for the universal adaptor sequence, which is not present?

rahmas36 commented 1 year ago

Hi, I just wanted to follow up on my previous query. Looking forward to advice. Thanks!

rahmas36 commented 1 year ago

Hi, I just wanted to bump this up again. Looking forward to assistance. Thanks!

ulelab / ultraplex

Demultiplexing indexed plasmid barcode libraries #41