nf-core / scrnaseq

A single-cell RNAseq pipeline for 10X genomics data
https://nf-co.re/scrnaseq
MIT License
214 stars 172 forks source link

Trimming R1 fast files in all aligners #333

Open ChristopherMancuso opened 6 months ago

ChristopherMancuso commented 6 months ago

Description of feature

I'm following up on a slack post that I put out 2 months ago at https://nfcore.slack.com/archives/CHN5BV5DW/p1712178056321859

I had a question about the fastq format needed for the different aligners. For everything I’m using nextflow v23.10.1 and scrnaseq v2.5.1. From the core at my work place both the R1 and R2 fastq files each have a length of 151 for the reads, instead of R1 being “trimmed” to just be only the barcode and umi (so like 28-ish bps depending on the protocol). When using --aligner cellranger this seems to be handled fine. However, when only switching --aligner to either alevin or star it doesn’t seem to handle that R1 read format well. For alevin the pipeline completes but the number of barcodes in barcodes.tsv is ~200k, which is roughly the number of reads, whereas the expected number of cells is ~5k. For star the pipeline fails at NFCORE_SCRNASEQ:SCRNASEQ:STARSOLO:STAR_ALIGN with the error EXITING because of FATAL ERROR in input read file: the total length of barcode sequence is 151 not equal to expected 28. My questions are, is this known behavior of the pipeline? I would like to use alevin or star in the future, do I need preprocess R1 and if so, any help in doing that? Thanks!

the run command I use looks like this, just only changing the aligner argument

nextflow run nf-core/scrnaseq --input samplesheet.csv --fasta /biostats_share/mancchri/genomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa --gtf /biostats_share/mancchri/genomes/Homo_sapiens.GRCh38.111.gtf --aligner cellranger --protocol 10XV3 --outdir . -work-dir ./work -c config.conf -profile singularity -r 2.5.1
grst commented 6 months ago

I would have totally expected both alevin and star to only use the number of nucleotides from R1 that is specified in the chemistry definition. In this case you run with --protocol 10XV3, which should lead to the following arguments passed to the tools:

STARsolo https://github.com/nf-core/scrnaseq/blob/c5a64459fec1506304f7064639f3f8e12d971269/assets/protocols.json#L49-L52

simpleaf https://github.com/nf-core/scrnaseq/blob/c5a64459fec1506304f7064639f3f8e12d971269/assets/protocols.json#L11-L13

@rob-p, is there additional trimming needed before running simpleaf?

ChristopherMancuso commented 5 months ago

@grst Thanks for taking the time to comment! I'm a nextflow novice but I tried to follow how adding --protocol 10XV3 to the main run command changed the aligner options in the assets folder too and it looked like it should work, but didn't for the reasons above. I did enough googling and I think if I downloaded alevin or star as standalone software I could get it to work, but to be honest I like nextflow so much I'll just take the speed hit and use nextflow with cell ranger if this problem persists.

One extra thing is that I did try this pipeline on the test data for every aligner and it works just fine. I manually checked the fastq files used in the test run and R1 there was already "pre-trimmed" to just the umis and barcodes.

grst commented 5 months ago

This should totally be supported. For me the question is if we can make alevin/starsolo directly work with the untrimmed reads (and what would be the appropriate command line options for that) or if we'd require an additional step that hard-trims the reads to the required length.

ChristopherMancuso commented 5 months ago

I will try to look into this and post what I find out here.