pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
155 stars 23 forks source link

SmartSeq3 Demultiplexed Fastq files #235

Closed AlisonKennedy closed 9 months ago

AlisonKennedy commented 10 months ago

Hi, I have SmartSeq3 data that has been de-multiplexed to produce 192 pairs of fastq files with no fastq files provided for the indices. Is it possible to run 'kb count' on this data. I do not have access to the bcl files in order to re-run bcl2fastq.

I have tried reconstructing the index fastq read for a single cell using the following command:

zcat SLX-21610.i701_i502.HTFMWDRX3_R2_001.fastq.gz | cutadapt -u 8 - > i701_i502_I2.fastq zcat SLX-21610.i701_i502.HTFMWDRX3_R2_001.fastq.gz | cutadapt -u 8 - > i701_i502_I2.fastq gzip i701_i502_I*

I then proceed to run the following command:

kb count -i /path/to/mm_v93_ERCC_kallisto.idx -g /path/to//t2g.txt -o /path/to/kallisto_ouput/v93_output -x SMARTSEQ3 --h5ad --verbose ../../fastq/test/i701_i502_I1.fastq.gz ../../fastq/test/i701_i502_I2.fastq.gz ../../fastq/test/SLX-21610.i701_i502.HTFMWDRX3_R1_001.fastq.gz ../../fastq/test/SLX-21610.i701_i502.HTFMWDRX3_R2_001.fastq.gz 2> error_log.txt

I have attached the error_log.txt

Any help would be appreciated or also if you can think of a better way of analysing the data given the format of my fastq files.

Alison error_log.txt

Yenaled commented 10 months ago

Oof, that's tricky because typically a "list of barcodes" is generated from the I1 and I2 files to figure out what read belongs to what cell. But since it's already demultiplexed, there's no need for that "list of barcodes".

Are you sure this is smart-seq3? Because it looks like smart-seq2 data (and you can just use the smart-seq2 option). And are you sure the index sequences are the first 8-bp's of the reads?

Please double check. If it really is smart-seq3, I'll come up with a solution for you.

AlisonKennedy commented 9 months ago

Hi,

Yes, these are definitely SmartSeq3 libraries but unfortunately have been sequenced using a standard SmartSeq2 pipeline (PE50).

As they've been demultiplexed and there are no fastq files for the index reads, the cellular dual index information is contained in the FASTQ read header and when grepping for the SS3 TAG sequence (ATTGCGCAATG) using the below command:

zcat SLX-21610.i701_i502.HTFMWDRX3_R1_001.fastq.gz | grep -A 3 ATTGCGCAATG | head -n 80 > SmartSeq3_TAG_sequence.txt

You can see that the SS3 TAG is at the beginning of the read as expected and would be followed by the 8bp UMI. I have attached SmartSeq3_TAG_sequence.txt

Any help you can offer for analysing this data would be much appreciated!!

Alison SmartSeq3_TAG_sequence.txt

Yenaled commented 9 months ago

OK, I see.

First, I'd recommend "remultiplexing" your data. You can do so by this tool that I developed: https://splitcode.readthedocs.io/en/latest/remultiplex_guide.html

^Just follow the example there but set --bclen=8 (since smart-seq3 barcodes are 8-bp long each).

This will get you three files: Your R1 files all concatenated together (out_R1.fastq.gz), your R2 files all concatenated together (out_R2.fastq.gz), and a 8-bp barcode file (barcodes.fastq.gz). The smart-seq3 format requires two barcode files so you can just make a copy of that 8-bp barcode file (i.e. cp barcodes.fastq.gz barcodes2.fastq.gz), and now you have your four files that you can use!

Now, run your usual kb count -x smartseq3 on those four files while specifying: -w None. (since we don't have an "on list" of barcodes that we need to correct to).

Then you should be good to go. Let me know how this works out for you!

AlisonKennedy commented 9 months ago

Hi, This has appeared to work and I am now analysis the downstream files. Thank you for your help!