paulranum11 / SPLiT-Seq_demultiplexing

An unofficial demultiplexing strategy for SPLiT-seq RNA-Seq data
MIT License
26 stars 8 forks source link

Version of Split-seq #8

Closed geneticsmcgill closed 4 years ago

geneticsmcgill commented 4 years ago

Hi there,

I noticed that we are having an issue where barcode 1 seems to be missing for alot of the reads, which reduces the cell counts and reads per cell. I'm wondering if this may be due to us using v3 of split-seq and whether this pipeline reflects that version? They changed the positioning of barcode 1 for v3.

paulranum11 commented 4 years ago

Hi geneticsmcgill,

SPLiT-Seq_demultiplexing should be robust across differentially positioned split-seq barcodes. The barcode position is not a fixed parameter. Instead of extracting sequences at a fixed position SPLiT-Seq_Demultiplexing searches for sequence matches corresponding to each split-seq barcode. Because of this architecture barcode sequence and flanking sequence (but not position) impact barcode 1 identification. Have you confirmed that the barcodes you are using in read1 match the barcode and flanking sequences in the Round1_barcodes_new5.txt file?

Another consideration is the use of oligoDT and random hexamer RT primers. The --collapse setting needs to be set to true or false based on your configuration and desire to collapse reads obtained from barcodes in the same round 1 well. Unwanted collapse of barcodes could result in loss of 1/2 of the expected round 1 barcodes.

I hope this helps... let me know if you have further questions. If you can provide me with an example of your read2.fastq file and expected barcode configuration i may be able to look further into the issue.

-Paul

paulranum11 commented 4 years ago

You may find this conversation from a previous issue helpful when checking that the sequences in the Round1_barcodes_new5.txt match the sequences you used in your experiment.

https://github.com/paulranum11/SPLiT-Seq_demultiplexing/issues/3

geneticsmcgill commented 4 years ago

Hi Paul,

Thank you so much. Here is a short file on the reads that failed and others that passed. I used the same barcode files listed in the pipeline, since they seem to match version 3 of split-seq. I set the collapse to true.

paulranum11 commented 4 years ago

Hi geneticsmcgill,

Could you also point me to the SPLiT-Seq V3 documentation that you based your library prep on?

Thanks, Paul

paulranum11 commented 4 years ago

Also, i was unable to see any attached files. If you comment directly from github you should be able to add attachments.

geneticsmcgill commented 4 years ago

Hi Paul,

Sorry about that! Let me know if this is any better. Appreciate the help. Here are the files:

reads.zip SPLiTseqV3.0_OligonucleotideSequences (1).xlsx SPLiT-seq Protocol V3.0 (4).pdf

paulranum11 commented 4 years ago

Hi Geneticsmcgill,

I took a look at your reads. The reads in your passing file contain the predicted amplicon structure with both the heterogeneous barcode containing positions and the static connecting sequences (See attached image). However the majority of reads in the failing file bear little resemblance to the predicted SPLiT-Seq amplicon sequence at all (See attached image). The static intervening sequence between barcodes 2 and 3 is only detected in one of these reads. So from a bioinformatic perspective they are correctly failed.

There could be several explanations for the presence of these reads in your data including:

  1. Poor sequencing quality.
  2. Nextera XT Adapter ligation of non-split-seq fragments.
  3. Amplification of non-split-seq-barcoded fragments.
  4. Amplification of tagmented products using Nextera XT primers instead of the SPLiT-Seq specific primers which control the start position of barcode sequencing.

It is a very complex workflow so there are many places you could troubleshoot.

This probably isn't what you wanted to hear but... i hope this helps.

Screen Shot 2020-11-16 at 4 44 39 PM
paulranum11 commented 4 years ago

You also may have a concatamer issue. About 25% of the reads in the 1000.failed.sam.read2.fastq file contain all or part of a repeating sequence TGATACCACTGCTTCCCATTCACTCTGCGT . The reads below are composed almost exclusively of this repeating sequence.

GCGTTGATACCACTGCTTCCCATTCACTCTGCGTTGATACCACTGCTTCCCATTCACTCTGCGTTGATACCACTGCTTCCCATTCACTCTGCGT GCGTTGATACCACTGCTTCCCATTCACTCTGCGTTGATACCACTGCTTCCCATTCACTCTGCGTTGCTACCACTGCTTCCCATTCACTCTGCGT GCGTTGATACCACTGCTTCCCATTCACTCTGCGTTGATACCACTGCTTCCCATTCACTCTGCGTTGATACCACTGCTTCCCATTCACTCTGCGT

geneticsmcgill commented 4 years ago

Thank you Paul. I was wondering what you meant by #4? Amplification tagmented products using Nextera XT primers instead of SPLiT-Seq? Do you mean amplification during sequencing and maybe the sequence centre not using the Truseq P7 + Nextera N501?

paulranum11 commented 4 years ago

Amplification of the tagmented products should be performed with "BC_0118" AATGATACGGCGACCACCGAGATCTACACTAGATCGCTCGTCGGCAGCGTCAGATGTGTATAAGAGACAG and one of "BC_0076 through BC_0083" see the first tab of the excel doc you sent.

In my above comment "4." i meant to indicate that using the primers that come with the nextera XT kit instead of the SPLiT-Seq provided primers can cause problems because the nextera XT primers randomly set the read2 start position at the position of the nextera transposase sequence. In contrast the SPLiT-Seq BC_0018 primer positions the start of the read2 sequence at the beginning of the UMI such that it is in correct position to read through all the SPLiT-Seq barcodes.

geneticsmcgill commented 4 years ago

You also may have a concatamer issue. About 25% of the reads in the 1000.failed.sam.read2.fastq file contain all or part of a repeating sequence TGATACCACTGCTTCCCATTCACTCTGCGT . The reads below are composed almost exclusively of this repeating sequence.

GCGTTGATACCACTGCTTCCCATTCACTCTGCGTTGATACCACTGCTTCCCATTCACTCTGCGTTGATACCACTGCTTCCCATTCACTCTGCGT GCGTTGATACCACTGCTTCCCATTCACTCTGCGTTGATACCACTGCTTCCCATTCACTCTGCGTTGCTACCACTGCTTCCCATTCACTCTGCGT GCGTTGATACCACTGCTTCCCATTCACTCTGCGTTGATACCACTGCTTCCCATTCACTCTGCGTTGATACCACTGCTTCCCATTCACTCTGCGT

Thanks Paul. It seems to be a concatamer from sequencing primers? I ran fastqc on failed vs passed reads. I was wondering if you have any insight into why that may be the case given the high proportion of them in read2? I appreciate the help.

fastqc.zip

paulranum11 commented 4 years ago

One factor that can contribute to primer concatamer formation is the availibility of template sequence. It may be the case that you have a lower than ideal amount SPLiT-Seq barcoded template sequences available for amplification at this stage of the library prep. This could be from a non-optimal completion of any of the previous steps (bead purification, template switching, low numbers of nuclei...). One reagent that I would check if I were you is the Template Switching Oligo as it contains RNA bases and degradation of these bases can impede function. Is it stored in aliquots at -80? If so, take out a new aliquot. If not you may want to reorder it.

You may be able to get a sense for the extent of your issue by looking at the data that was successfully generated. Do you have a high number of genes and UMIs for the cells that were successfully identified? Or is it very low? High numbers of reads but low UMI counts would support the idea that you don't have much successfully barcoded template and that you need to troubleshoot your library prep.

paulranum11 commented 4 years ago

If you have found this advice or the software package useful please consider staring the repository.

Thanks, Paul