minoda-lab / universc

UniverSC: a flexible cross-platform single-cell data processing pipeline
https://genomec.gsc.riken.jp/gerg/UniverSC/UniverSC_app_release/
GNU General Public License v3.0
43 stars 7 forks source link

Bug in SmartSeq3 with compressed fastq.gz inputs #4

Closed TomKellyGenetics closed 2 years ago

TomKellyGenetics commented 2 years ago

Problem

SmartSeq3 test jobs run for UniverSC v1.2.0 with decompressed fastq input files in plain text. However, these jobs fail at the Cell Ranger "CHUNK READS" step for fastq.gz files. The I1 and I2 files in input4cellranger have mismatches in the number of reads.

It appears that I1 and I2 files are renamed, not decompressed when copied to input4cellranger. For this reason the filtering subroutine is missing some rows in the output I1 and I2 fastq files and binary gzip files are pasted to the top reads for R1 instead of plain text barcodes expected in a fastq file.

R1 and R2 appear to be decompressed correctly.

Affected technologies

SmartSeq3 is affected. It may also occur with technologies requiring I1, I2, or R3, R4 files such as 10x v1, inDrops v3, or SmartSeq2. It appears to be an issue specific to these files and won't affect technologies only using R1 and R2 inputs for samples that have already been demultiplexed.

Solution

It is possible to check for fastq.gz file format before calling conversion steps or perl scripts requiring fastq files in plain text. Files should be decompressed in the input4cellranger directory to avoid affecting raw input files. Additionally these checks could be made specific to technologies requiring them. Files should not have suffixes renamed without decompression.

Workaround

In the meantime, it is recommend to decompress fastq.gz files with gunzip or unpigz before running UniverSC on these technologies.