liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
272 stars 47 forks source link

--barcodeRange #20

Closed lishashali closed 3 years ago

lishashali commented 4 years ago

Dear Developers, I have a question about the option '--barcodeRange INT INT CHAR '. The r1-length of single-cell 5′ data includes the Barcode(1-16) and UMI sequences. when I analyzed my single-cell 5′ data, This is my running command: run-trust4 -1 XXX_5_S7_L003_R1_001.fastq.gz -2 XXX_5_S7_L003_R2_001.fastq.gz -f hg38_bcrtcr.fa --ref human_IMGT+C.fa --barcodeRange 0 -16 + Is this right?

mourisl commented 4 years ago

Does R1 contain read information or just barcode+UMI sequences? If only barcode+UMI information, the command line should be: run-trust4 -u XXX_5_S7_L003_R2_001.fastq.gz -f hg38_bcrtcr.fa --ref human_IMGT+C.fa --barcode XXX_5_S7_L003_R1_001.fastq.gz --barcodeRange 0 15 +

Otherwise, you need to preprocess R1 file before running TRUST4.

lishashali commented 4 years ago

ok ,I get it ,Thank you for your reply.

lishashali commented 4 years ago

HI , /data/lishasha/TRUST4/TRUST4/run-trust4 -u /data/lishasha/TRUST4/liweizhong/fq2/lng/fq2/liweizhong_5_S7_L003_R2_001.fastq.gz -f /data/lishasha/TRUST4/TRUST4/hg38_bcrtcr.fa --ref /data/lishasha/TRUST4/TRUST4/human_IMGT+C.fa --barcodeRange 0 15 + --barcode /data/lishasha/TRUST4/liweizhong/fq2/lng/fq2/liweizhong_5_S7_L003_R1_001.fastq.gz [Tue Jun 30 02:10:44 2020] TRUST4 begins. [Tue Jun 30 02:10:44 2020] SYSTEM CALL: /data/lishasha/TRUST4/TRUST4/fastq-extractor -u /data/lishasha/TRUST4/liweizhong/fq2/lng/fq2/liweizhong_5_S7_L003_R2_001.fastq.gz -t 1 -f /data/lishasha/TRUST4/TRUST4/hg38_bcrtcr.fa -o TRUST_liweizhong_5_S7_L003_R2_001_toassemble --barcodeStart 0 --barcodeEnd 15 --barcode /data/lishasha/TRUST4/liweizhong/fq2/lng/fq2/liweizhong_5_S7_L003_R1_001.fastq.gz [Tue Jun 30 02:10:44 2020] Start to extract candidate reads from read files. Read file is empty. system /data/lishasha/TRUST4/TRUST4/fastq-extractor -u /data/lishasha/TRUST4/liweizhong/fq2/lng/fq2/liweizhong_5_S7_L003_R2_001.fastq.gz -t 1 -f /data/lishasha/TRUST4/TRUST4/hg38_bcrtcr.fa -o TRUST_liweizhong_5_S7_L003_R2_001_toassemble --barcodeStart 0 --barcodeEnd 15 --barcode /data/lishasha/TRUST4/liweizhong/fq2/lng/fq2/liweizhong_5_S7_L003_R1_001.fastq.gz failed: 256 at /data/lishasha/TRUST4/TRUST4/run-trust4 line 37.

lishashali commented 4 years ago

R1 contain just barcode+UMI sequences, and the command is :run-trust4 -u XXX_5_S7_L003_R2_001.fastq.gz -f hg38_bcrtcr.fa --ref human_IMGT+C.fa --barcodeRange 0 15 + --barcode XXX_5_S7_L003_R1_001.fastq.gz

mourisl commented 4 years ago

Could you please share the first few reads in liweizhong_5_S7_L003_R2_001.fastq.gz and liweizhong_5_S7_L003_R1_001.fastq.gz respectively?

lishashali commented 4 years ago

R1 R2 The first one is R1, the second one is R2.

mourisl commented 4 years ago

The command and the reads look fine to me. This error you had usually happens when the path is not accessible. It could also due to the binary files downloaded is incompatible with your system. Can you try the singularity image in the release package?

Here are some more information about singularity if you want to mount your own data folder: #13

lishashali commented 4 years ago

OK,I will try it. Thank you.

mourisl commented 4 years ago

We have released the source code of TRUST4. Can you compile TRUST4 from source code and give it a try? Thank you.

eegk commented 3 years ago

Dear developers,

I have the 10x R1, R2 and I1 with the barcodes. would be possible to read the barcodes directly from the I1 fastq file?

mourisl commented 3 years ago

Yes, you can directly run TRUST4 with the option "--barcode XX_I1.fastq --barcodeRange 0 15 +", supposing the first 16nt in the sequence is for barcode. Since the barcode is not in the read file, you don't need to preprocess any files in this situation.

eegk commented 3 years ago

Dear developers,

[scRNAseq]

I successfully recovered the clonotypes from my samples and they do match with TCRseq cdr3s but the cell specific barcodes dont match and I'm getting sometimes more than 1 cell barcode from trust per single barcode in the scRNAseq. Any idea of what could be?

i.e the barcode from cellranger is ATTGGTGGTGTCGCTG and TRUST is giving back AGCGAAAG_16027 and TTTCTGTC_7167

Thank you. Edgar

mourisl commented 3 years ago

It seems the barcode file may truncate some of the sequences. What was the command you used, and did you see the short barcode sequences like "AGCGAAAG" or "TTTCTGTC" in the I1 file? From the length, those barcode seems to be the sample index in 10X data instead of the cell barcode.

eegk commented 3 years ago

I used as follows: run-trust4 --barcode file1.I.fastq.gz -1 file1.R1.fastq.gz -2 file1.R2.fastq.gz -f /*/human_IMGT+C.fa -o file1.out

Maybe the headers are informative: file I @NS500672:615:HHWM5BGXB:3:11401:24464:1018 1:N:0:TCTNAAAG TCTNAAAG + AAA#AEEE @NS500672:615:HHWM5BGXB:3:11401:14791:1019 1:N:0:TCTNAAAG TCTNAAAG + AAA#AAEE @NS500672:615:HHWM5BGXB:3:11401:23247:1020 1:N:0:TCTTAAAG TCTTAAAG +

file R1 @NS500672:615:HHWM5BGXB:3:11401:24464:1018 1:N:0:TCTNAAAG NCTTCCACAGGCAGTAAAAGAGCGCGTTTCTTATATGGGAAACAGAATGGCTTTTTGGCTGAGAAGGCTGGGTCTACATTTCAGGCCACATTTGGGGAGACGAATGGAGTCATTCCTGGGAGGTGTTTTGCTGATTTTGTGGCTTCAAGT +

AAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEAEEEAAAAAEEA/A<<

@NS500672:615:HHWM5BGXB:3:11401:14791:1019 1:N:0:TCTNAAAG NTTGGCTCAATAGCAATCCATGTTACTTTCTTATATGGGGGATAAAAATGTGATAAACTTAGTATTGTTTTGAATTTTGTTTTTAATACCCAGGACTAGATTAGAATAGAATTCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC +

AAAAEEEEEEEEEEEE/EEEEEEEEEEEEAAEEEEEEEEEEE6EEEEEEEEEEEEEEEAEEEEEEEAEEEEEEEEA6EAEEEAEEEE<<6EAAEEEAEEEE<EAEEEEEEEEAEA<EEEAA6EEE/<<AAAA/EEEAAEAA<<AAEAA/

@NS500672:615:HHWM5BGXB:3:11401:23247:1020 1:N:0:TCTTAAAG NGATGTATCGTAGGAGTAGGTAGAGCTTTCTTATATGGGAAGACCCTAAACTACCAGTGGATAAAATCTTACCCCCACCATCTCCCTGGCCCAAGAGCTCCATCTTTGATGCTGATGAAGAAAAGTCCAAGCTTCTGACAAGGCTTCTAA +

mourisl commented 3 years ago

It seems the I1 file is for sample index, so the barcode information should be in the read fastq file. Based on the 10X document: https://assets.ctfassets.net/an68im79xiti/1CnKSfa7taoQwIEe0WaA4m/8635b2c9ee86c022e731b6fb2e13fed2/CG000080_10x_Technical_Note_Base_Composition_SC3_v2_RevB.pdf the barcode should be in the last portion of read1, and you may need to preprocess the read first depending on the library you are using. However, I manually checked those in 10X barcode whitelist but could not find those, suggesting either the barcode is on a different part or there were lots of sequencing errors in the region. I'm currently working on TRUST4 so it can better handle the fastq inputs from 10X Genomics data.

Currently, the more robust way to process 10X data is to run 10X Genomics cellranger first to generate the bam file, which automatically handles all the issues with barcode.

eegk commented 3 years ago

That makes a lot of sense. Thank you so much.

eegk commented 3 years ago

Hi again,

I re-ran cellranger and attempted to use TRUST4 directly on the bam file, however, I got this error: TRUST4 begins. SYSTEM CALL: bam-extractor -b possorted_genome_bam.bam -t 1 -f human_IMGT+C.fa -o file.outs.possorted_genome_bam.out_toassemble --barcode CB Start to extract candidate reads from bam file. Unknown genome name: GGTACAACTGGAACGAC failed: 256 at run-trust4 line 44.

No idea why it did not recognize the genome, any suggestion?

mourisl commented 3 years ago

For bam input, you need to use options "-f hg38_bcrtcr.fa --ref human_IMGT+C.fa". The hg38_bcrtcr.fa contains the coordinate information for the BCR/TCR genes.

eegk commented 3 years ago

ah, my bad. Thank you.