s4hts / HTStream

A high throughput sequence read toolset using a streaming approach facilitated by Linux pipes
https://s4hts.github.io/HTStream/
Apache License 2.0
49 stars 9 forks source link

Better 10X support (was Possibility to do QC only on one pair?) #172

Open jenzopr opened 4 years ago

jenzopr commented 4 years ago

Hi HTStream-Team,

is it possible to use your toolkit in paired-end mode, but actually apply QC functions (polyATrim, QWindowTrim) only on one pair (e.g. R2)? I'm asking because I tried cleaning paired-end FASTQ files from e.g. 10X Genomics (R1 = cell barcode, R2 = transcriptomic read) and noticed that many pairs are discarded because they fail in R1. In this setup, I don't care about QC of R1, because it contains only cell barcodes that have to be matched perfectly at a later time anyway.

A workaround could be to process R2 in single-end mode, but then I will lose the proper pairing because of discarded reads. Any ideas?

Thanks a lot! Jens

samhunter commented 4 years ago

Hi Jens, That is an interesting question/idea. We don't have that option currently, and I can't think of any great options for processing data in the way you suggest with HTS as is currently is.

Maybe it would be possible to write a quick hacky python script to store enough R1 prefix in the R2 description to keep barcodes, then stream R2 with the modified description in as SE data, finally streaming it back out to another python script that restored R1 from the header to make processing barcodes possible? Certainly not a great solution though.

@msettles has been looking into something similar for 10x data, and we have discussed the idea of having a "hts_10X" tool, maybe with additional features. Unfortunately we don't have a development timeline at this point.

jenzopr commented 4 years ago

Hi Sam,

any news on this? Maybe it would be enough to add an option to not discard reads in SE? This would keep the order (and pairing) with a second (non-filtered) file intact..

Best, Jens

samhunter commented 4 years ago

Hi Jens,

We are still thinking. But we think it would be better to process reads as pairs, but first extracting the 10x barcode and UMI.

How are you currently processing read1?

Sam

jenzopr commented 4 years ago

Currently I skip QC filtering entirely for 10X reads and rely on STAR-solo to do mapping, BC and UMI extraction and counting. It is thought to be an in-drop replacement for cell ranger. I also don't know if and how much read filtering improves the mapping in case of 10X...

msettles commented 4 years ago

Jenz, what are your read sizes? R1/R2? Are they the shortened form? 100x30(ish) or two full length reads 100x100 (or larger)?

jenzopr commented 4 years ago

Hi Matt! Typically R1 contains the cell barcode and the UMI, so its more ~30ish. The R2 contains cDNA and is full length (75 / 100bp). Does this help?

msettles commented 4 years ago

Jens,

We do alot of single cell and I get a mix, sometimes we do PE150, etc, sometimes the cheaper, shorter 100x30 like you say below. I recently make a HTS workflow and hack (for a sc qiagen UPX experiment) that will meet your needs as well (short libraries, won’t work for PE150.)

The script upx_get_bc_umi.py -h

Usage: upx_get_bc_umi.py [options] -o output_base inputfile.SAM

Options: -h, --help show this help message and exit --length=LENGTH length of read 2 --insert insert extracted tags into library as read2 --extract extract to tags and generate a SE read

Will basically extract the R2 and add it as a tag to R1 (so after upx_get_bc_umi.py —extract , you only have R1)

Can do more processing with HTS (they all look like SE reads now) and then upx_get_bc_umi.py —insert to reconstitute both pairs again.

My HTStream workflow looked like this

hts_Stats -L scRNA.log -1 file_R1.fastq.gz file_R2.fastq.gz -N “compute stats on original dataset” | hts_SeqScreener -A scRNA.log -N “screen for PhiX because I always do” | hts_Overlapper -o 5 -A scRNA.log -N “overlap reads” | upx_get_bc_umi.py --extract —length 27 | hts_PolyATTrim --skip_polyT --no-left -A scRNA.log -N “trim 3’ plolyA” hts_NTrimmer -A scRNA.log -N “Remove any N characters” | hts_QWindowTrim -A scRNA.log -N “Quality window trim” | hts_SeqScreener -s screen.fa -A scRNA.log -N “Screen out any potential adapter dimers” | hts_LengthFilter -m 50 -A scRNA.log -N “Removed any read shorter than 50bp” | upx_get_bc_umi.py --insert | hts_Stats -A scRNA.log -N “final stats” -f file_preproc

My screen.fa file looks like, UPX adds a Template switching oligo that is followed by the truseq adapter

TSO_Truseq GTACTCTGCGTTGATACCACTGCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCA

NOTE: Use the most current HTStream from the repo, lots of changes there, pull and compile from master, the tag change and lots of others will end up in release version 1.3

Let me know! Matt

On May 6, 2020 at 11:47:19 PM, Jens Preußner (notifications@github.com) wrote:

Hi Matt! Typically R1 contains the cell barcode and the UMI, so its more ~30ish. The R2 contains cDNA and is full length (75 / 100bp). Does this help?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ibest/HTStream/issues/172#issuecomment-625063829, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE6RZVKR42FFPWYTKRO3KLRQJKPNANCNFSM4JR2XEJA .