liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
269 stars 46 forks source link

run trust4 with 5' sorted T-cell data #268

Open guillemsanchezsanchez1996 opened 4 months ago

guillemsanchezsanchez1996 commented 4 months ago

Hello,

Thanks again for creating this nice package! I want to let you know that It has been instrumental to build up a paper that will be soon published in Nature Communications (and your software has been cited accordingly :) ).

I have the following question. I want to extract TCR sequences from a 5' GEX library from 10x. The SRA data shows the following structure of the fastq files

image

R1-R2 are technical reads. I have identified the 10x barcode in the first 16nt of the R3 but my question is...as r4 is 100bp only (compared to prior runs where r4 or r2, depending on the experiment where 120bp long), will I loose a lot of potential sequences because of this fastq structure?

Which command shall I use to handle properly this data? I tried to run it with this one: ./run-trust4 -f human_IMGT+C.fa --ref human_IMGT+C.fa -u /mnt/f/data/_4 -t 10 --barcode /mnt/f/data/_3 --barcodeRange 0 15 + but It gave me a lot of B cell sequences...when data was apparently sorted for ab T cells and GEX data looks like that, full of ab T cells.

Thanks a lot for your advice,

Guillem

mourisl commented 4 months ago

Thank you for using TRUST4 and glad it helps with your research! For the format, I guess there is some read sequence in read1 as well, and this is a paried-end data set. Since the read1 contains both barcode and read sequence information, you may also need to specify the read1Range as well. For example, there might be UMI sequences in read1. One way to check it is if you have cellranger's BAM file, you can infer some of the format information there, such as whether it is paired-end, and the length for each read end.

For getting too many B cells, they could be from ambient cells or some other noise. Do you also see many T cells too? Usually I also see a lot of noise, but after overlaying with Seurat/Scanpy-QC'ed barcodes, they are much cleaner.

Hope this helps. Looking forward to your publication! Congratulations!

guillemsanchezsanchez1996 commented 4 months ago

Sadly I do not have acces to the BAM data (fastq are downloaded from GEO)... I have experienced also some background noise specially from nonFACS sorted data. But in other FACS sorted data from others and also in my own experiments TRUST4 extraction of TCR reads is quite pure (few B sequences...). I was wondering if the read issue that you mentioned can affect somehow the alignment and assembly giving false positives?

Thanks a lot for your quick feedback,

Guillem

mourisl commented 4 months ago

That could be the reason. If most of the CDR3 information is on read1, then the assembly results may capture some artifacts. Could you please share some of the reads? I can take a look. Thank you.

guillemsanchezsanchez1996 commented 4 months ago

Here some of the R3 reads: @SRR25644424.295796 A00521:279:HTWJ3DSX2:1:1102:13774:16235 length=101 TCACAAGTCATGGTCACGTGCTCGGTTTTCTTATATGGGGAGTCGCGCGTCTCCCCCGCAGTAGCGGTAAAGCGGAAGTTATGCTGCAGCCGGAGCCCGGG +SRR25644424.295796 A00521:279:HTWJ3DSX2:1:1102:13774:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR25644424.295797 A00521:279:HTWJ3DSX2:1:1102:13829:16235 length=101 GTCCTCAAGTTAGGTATTCTAAGGTGTTTTCTTATATGGGGTCCTGAAATTCTGCCAGATGAATCTAGTAGTGATGAAGATGAAAAGAAAAACAAGGAAGA +SRR25644424.295797 A00521:279:HTWJ3DSX2:1:1102:13829:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF,FFFFFFFFFFFFFFFFFFFF @SRR25644424.295798 A00521:279:HTWJ3DSX2:1:1102:14082:16235 length=101 CTCGAGGGATATGAAGCACCGCCAGGTCCTTTGAGTTTTAAGCTGTGGCTCGTAGTGTTCTGGCGAGCAGTTTTGTTGATTTAACTGTTGAGGTTTAGGGC +SRR25644424.295798 A00521:279:HTWJ3DSX2:1:1102:14082:16235 length=101 ,FFFFF:FFFFFFF:F:FF,,::FFFFF:FFF,FFF::FFF::FF:FF:FFFFFFF:F:FFFFFFFF,FFF:FFF,:FFFFFF:FFF:FFFFF:FFFFFF: @SRR25644424.295799 A00521:279:HTWJ3DSX2:1:1102:14479:16235 length=101 TGATGTGTTATGCCCGCCTCTTCACGGGCAGGTCAATTTCACTGGTTAAAAGTAAGAGACAGCTGAACCCTCGTGGAGCCATTCATACAGGTCCCTATTTA +SRR25644424.295799 A00521:279:HTWJ3DSX2:1:1102:14479:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FF:,:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF,FFF @SRR25644424.295800 A00521:279:HTWJ3DSX2:1:1102:15293:16235 length=101 TACCTTATCACTGGGCAGGTGAGCAGTTTCTTATATGGGAGAGGGCGCCGGCTGCGGAGCCGCCCTCAGAGTCGCGAGGCCGGACGCAGCGCGGCGCCGCC +SRR25644424.295800 A00521:279:HTWJ3DSX2:1:1102:15293:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF:FFF::FFFFFFFFFFFFFFFF @SRR25644424.295801 A00521:279:HTWJ3DSX2:1:1102:15402:16235 length=101 CGGGTCACATGGGAACCTCTTCGGCGTTTCTTATATGGGGCTTTTCCAAGCGGCTGCCGAAGATGGCGGAGGTGCAGGTCCTGGTGCTTGATGGTCGAGGC +SRR25644424.295801 A00521:279:HTWJ3DSX2:1:1102:15402:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFF,F,::,F:,F,FF:FFFF,FFF,FFFFFFFFF:FF:FFF,FFFFFFFF:FFFFF:FFFFFFFF,FFFFF:,FFF @SRR25644424.295802 A00521:279:HTWJ3DSX2:1:1102:15781:16235 length=101 TACTCATAGAAGGTTTGATACACCATTTTCTTATATGGGGGGCGCGCCCAGCCTGCCAGCCGCGCTGCTGCTGCTCCTCCTGCTGTGGGACCGCTGACCGC +SRR25644424.295802 A00521:279:HTWJ3DSX2:1:1102:15781:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR25644424.295803 A00521:279:HTWJ3DSX2:1:1102:16052:16235 length=101 AGGTCCGTCGATCCCTCTGAGTTGAGTTTCTTATATGGGGACCGCCGAGACCGCGTCCGCCCCGCGAGCACAGAGCCTCGCCTTTGCCGATCCGCCGCCCG +SRR25644424.295803 A00521:279:HTWJ3DSX2:1:1102:16052:16235 length=101 FFFF::FF:FF:FF:FFFF:F,F,F:FFFF,FFFFF::FFFFFFFF:F:FFFFF,FFFFFFFF:FFFFFFFF,FFFF:F,:F:F:,FFFFF,F::FFFFF, @SRR25644424.295804 A00521:279:HTWJ3DSX2:1:1102:16251:16235 length=101 CTGAAACGTCAGTGGATTGTTGTATCTTTCTTATATGGGGCTCTTTCCCTGCCGCCGCCGAGTCGCGCGGAGGCGGAGGCTTGGGTGCGTTCAAGATTCAG +SRR25644424.295804 A00521:279:HTWJ3DSX2:1:1102:16251:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR25644424.295805 A00521:279:HTWJ3DSX2:1:1102:16468:16235 length=101 AGAATAGTCTGAAAGACTATTTTGGTTTTCTTATATGGGATGACCCACCAATCACATGCCTATCATATAGTAAAACCCAGCCCATGACCCCTAACAGGGGC +SRR25644424.295805 A00521:279:HTWJ3DSX2:1:1102:16468:16235 length=101 :FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF @SRR25644424.295806 A00521:279:HTWJ3DSX2:1:1102:16758:16235 length=101 TGAGAGGAGTGCGTGACTCTTCACCGTTTCTTATATGGGGCTAAACCTAGCCCCAAACCCACTCCACCTTACTACCAGACAACCTTAGCCAAACCATTTAC +SRR25644424.295806 A00521:279:HTWJ3DSX2:1:1102:16758:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR25644424.295807 A00521:279:HTWJ3DSX2:1:1102:18186:16235 length=101 AGTAGTCAGACGCTTTAATTCCACTTTTTCTTATATGGAAAGATTTGTAAGAAATTACTGGCTACTCAGCTTTGTGGGAGCAGCTGGTGACCCCAGGCAGA +SRR25644424.295807 A00521:279:HTWJ3DSX2:1:1102:18186:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFF,FFFF:,FFF,,,FFFFFFFFF:F:FF:FFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF @SRR25644424.295808 A00521:279:HTWJ3DSX2:1:1102:18801:16235 length=101 AACACGTTCCTATGTTTTTCGGCCGTTTTCGTATATGGGGATTCCTGAAGCTGACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCG +SRR25644424.295808 A00521:279:HTWJ3DSX2:1:1102:18801:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFF,FF,FF:,F,,,,FF,F:F:,FF,F,,F,:,FF,:,,FFFF,,:F:,F,F::F::FF:FFFF,FFFFFFF::F,F @SRR25644424.295809 A00521:279:HTWJ3DSX2:1:1102:19018:16235 length=101 CTACGTCCAGCAGTTTTGACTTTATTTTTTCTTATATGGGGGCAGCCGTGGCTGAGGAGCCTGTGGCGGCAGCGGCGATGGAACCAGCGGAGCAGCCGAGC +SRR25644424.295809 A00521:279:HTWJ3DSX2:1:1102:19018:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:,,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR25644424.295810 A00521:279:HTWJ3DSX2:1:1102:19723:16235 length=101 CAGTAACCAAACAACAACTTCAGGGATTTCTTATATGGGGCTTCTTTCTCGCCTAACGCTGCCAACATGGTGTTCAGGCGCTTCGTGGAGGTTGGCCGGGT +SRR25644424.295810 A00521:279:HTWJ3DSX2:1:1102:19723:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR25644424.295811 A00521:279:HTWJ3DSX2:1:1102:20121:16235 length=101 GGTGAAGTCTAACGGTAGTCAACGCTTTTCTTATATGGGGAGCTACGGCGGCGGCAGCGGCGGCGCGGGTGCGATTCCGAGCCGTTGAGACGCCTCTGCGG +SRR25644424.295811 A00521:279:HTWJ3DSX2:1:1102:20121:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:FFF,FF,:FFF,FFFFFFFF::,:F,F,,,FF,:,,:FF @SRR25644424.295812 A00521:279:HTWJ3DSX2:1:1102:21025:16235 length=101 GATCAGTTCAGAAATGCGCGGGCTAGTTTCTTATATGGGGAGAGCCCGAGCAGCGGCCAGGGTAACGCTGTCTTGTGGACCCGCACTTCCCACCAGAGACC +SRR25644424.295812 A00521:279:HTWJ3DSX2:1:1102:21025:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR25644424.295813 A00521:279:HTWJ3DSX2:1:1102:23122:16235 length=101 AGCATACAGTGAACATCATTGCTGCTTTTCTTATATGGGGGCCGGGGGACGGCGACAGCGGGTCGGCGGGCCGCAGGAGGGGGTCATGGGTAAAGACTACT +SRR25644424.295813 A00521:279:HTWJ3DSX2:1:1102:23122:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF @SRR25644424.295814 A00521:279:HTWJ3DSX2:1:1102:23140:16235 length=101 TCACAAGCAGATAATGCGTGTTAGGTTTTCTTATATGGGGAGTCTCCGGGATCCCCAGGCCTGGAGGGGGGTCTGTGCGCGGCCGGCTGGCTCTGCCCCGC +SRR25644424.295814 A00521:279:HTWJ3DSX2:1:1102:23140:16235 length=101 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR25644424.295815 A00521:279:HTWJ3DSX2:1:1102:23249:16235 length=101 CTAATGGCACCAACCGGTGTCAAATATTTCTTATATGGGGGCGCAATAGATATAGTACCGCAAGGGAAAGATGAAAAATTATAACCAAGCATAATATAGCA +SRR25644424.295815 A00521:279:HTWJ3DSX2:1:1102:23249:16235 length=101 FFFFFFFF:FF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR25644424.295816 A00521:279:HTWJ3DSX2:1:1102:24117:16235 length=101 GTTACAGTCGGAAACGGCTTGTTACATTTCTTATATGGGATGACCCCAATACGCAAAACTAACCCCCTAATAAAATTAATTAACCACTCATTCATCGACCT +SRR25644424.295816 A00521:279:HTWJ3DSX2:1:1102:24117:16235 length=101 FFFFFFFFFFFFFF:FFFFFFF:FFFFFFFF:FFFF:FFF:FF:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @SRR25644424.295817 A00521:279:HTWJ3DSX2:1:1102:24514:16235 length=101 ACGATACTCTGTCTCGAGTCTTCATTTTTCTTATATGGGGGAGCCAGGGCTTGGCGCGGCGGCCGTGGTTGCGGCGCGGGAAGTTTGGATCCTGGTTCCGT +SRR25644424.295817 A00521:279:HTWJ3DSX2:1:1102:24514:16235 length=101

mourisl commented 4 months ago

Thank you for sharing the data. I've blasted some of the reads and they definitely hit some transcript. Based on 10X tradition, I think the bases other than the first 26bp (16bp barcode+10bp UMI) should be the transcriptomic sequence data. Though some reads hit the transcript with all the 101bp. I think you can try to run TRUST4 with some options like "--readFormat bc:0:15,r1:26:-1" (a simplified read format specification starting from TRUSt4 v1.0.10. The old arguments also work) to ignore the first 26bp of read1 (R3 file).