liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
272 stars 47 forks source link

questions on running TRUST4 #313

Open fonseca-dfm opened 1 day ago

fonseca-dfm commented 1 day ago

Hi there, thank you for this amazing tool. My team as been using it a lot (see issue #265) but when I try to reproduce the analysis I came up with a few questions. I did 10x genomics kit 5` GEX and VDJ for alpha beta and gamma delta TCRs. I am running the script like this (in this case for the vdj of gd tcells):

run-trust4 -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 path/to/YWI006-VDJ_S1_L001_R1_001.fastq.gz -2 path/to/YWI006-VDJ_S1_L001_R2_001.fastq.gz --barcode path/to/YWI006-VDJ_S1_L001_R1_001.fastq.gz --readFormat bc:0:15,r1:16:-1 -u path_to_10x_fastqs/YWI006-VDJ_S1_L001_R1_001.fastq.gz --od path/to/trust4/output/

I have a few questions regarding the outputs of TRUST4.

1) my output folder contains some outputed fasta files. How do I get rid of them?

2) are the barcodes matching the ones of cellranger when I process my GEX data? I want to match mu TRUST4 vdj annotation with the gene expression. One issue is that when we compare the overlap between barcodes of CellRanger VDJ and TRUST4 is that they dont overlap. do you correct for the white list of barcodes from 10x?

3) which scripts do you recommend running from Scripts and other post-analysis for TRUST4 folder to complete the analysis?

4) If I run the script like barcoderep-filter.py, do I keep the original data or the tsv/airr files get overwritten?

Again many thanks. We found trust4 to be most reliable tool for annotation of gd TCRs.

Best, Daniel

mourisl commented 1 day ago

run-trust4 -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 path/to/YWI006-VDJ_S1_L001_R1_001.fastq.gz -2 path/to/YWI006-VDJ_S1_L001_R2_001.fastq.gz --barcode path/to/YWI006-VDJ_S1_L001_R1_001.fastq.gz --readFormat bc:0:15,r1:16:-1 -u path_to_10x_fastqs/YWI006-VDJ_S1_L001_R1_001.fastq.gz --od path/to/trust4/output/

For the command line, I think TRUST4 cannot handle the mixed single-end and paired-end data. So the "-u" for the VDJ part will be ignored when "-1/-2" is given.

my output folder contains some outputed fasta files. How do I get rid of them?

You can use the option "--clean INT", so TRUST4 will remove those files after running.

are the barcodes matching the ones of cellranger when I process my GEX data? I want to match mu TRUST4 vdj annotation with the gene expression. One issue is that when we compare the overlap between barcodes of CellRanger VDJ and TRUST4 is that they dont overlap. do you correct for the white list of barcodes from 10x?

You can provide the option "--barcodeWhitelist whitelist_file" for barocde correction. The whitelist file is somewhere in the CellRanger folder. Nevertheless, you shall still observe a decent amount of barcode overlap. Did you find no overlap at all?

which scripts do you recommend running from Scripts and other post-analysis for TRUST4 folder to complete the analysis?

The trust-stats.py provides limited statistics, like Shannon entropy and clonality, on diversity measures. Other packages like scRepertoire will provide more analysis functions and visualizations.

If I run the script like barcoderep-filter.py, do I keep the original data or the tsv/airr files get overwritten?

For the TCR data, I don't think you need to run the barcoderep-filterlpy. It is mainly to deal with the excessive expression from plasma B cells that may cause leaked IGH/IGK/IGL into other cell's droplet. It will generate a new barcode report file, and you can rerun trust-airr on the new barcode report file to get the new airr file.

Hope this helps.