liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
282 stars 49 forks source link

Does any parameters on strandedness? #192

Open WangKang-Leo opened 1 year ago

WangKang-Leo commented 1 year ago

Hello, I have run the TRUST4 based on our bulk RNAseq cohort (PE150bp, ribo-zero), where fastq is input, but the results are not consistent with other immune metrics (GEP based). I want to optimize the analyses by adding some parameters, such as strandedness. Our data have significant batch effect, but it should be fine for TRUST4. Or do you have any ideas?

Thanks in advance

mourisl commented 1 year ago

Could you please elaborate more on the issues? For example, for the inconsistency with GEP, do you mean the V gene or C gene expression from gene expression data is different from TRUST4's prediction? For strandedness, do you mean your RNA-seq is stranded RNA-seq?

WangKang-Leo commented 1 year ago

I did not look at V/C gene, but using some typical immune biomarkers (CD3,CD8, CD20) in mRNA or protein level, or GEP immune signature score for T cell, B cell.

Our RNA-seq is reverseStranded, and I don't know whether it has influence or not on TRUST4.

mourisl commented 1 year ago

The strand-specific RNA-seq probably does not affect TRUST4 much, it is rare to have a read whose reverse-complement matches better to VDJ genes than its true orientation.

For the GEP analysis, do you mean when other data show no or few T or B cells, but TRUST4 still assembles many TCRs and BCRs?

WangKang-Leo commented 1 year ago

Yes, not well correlated between GEP-based T/B cell and those from TRUST4.....

mourisl commented 1 year ago

How did you calculate the abundance from TRUST4? Could you please show me a few high abundant TRBs and IGHs from the airr output? I can check to see whether those are misassemblies. Thank you.

WangKang-Leo commented 1 year ago

I used code from RIMA (https://github.com/liulab-dfci/RIMA_pipeline), the results were shown as following: 1681928636939 Also, I attached the sample with highest T cell infiltration (0.000025). The cohort includes 200 samples. UF-3016-1226-0_airr_align.txt

`module load bioinfo-tools module load TRUST4/1.0.8 module load R/4.1.1 module load R_packages/4.1.1 cd /home/kangwang/TRUST4/result

for ((i=2; i<=200; i++)); do Sample=$(awk -v i=$i 'NR==i{print $1}' /home/kangwang/TRUST4/data/samplesheet_rna_Baseline.tsv) fastq1=$(awk -v i=$i 'NR==i{print $2}' /home/kangwang/TRUST4/data/samplesheet_rna_Baseline.tsv) fastq2=$(awk -v i=$i 'NR==i{print $3}' /home/kangwang/TRUST4/data/samplesheet_rna_Baseline.tsv) cd /home/kangwang/TRUST4/result/Baseline mkdir $Sample cd /home/kangwang/TRUST4/result/Baseline/$Sample run-trust4 -t 64 \ -f /home/kangwang/TRUST4/reference/TRUST4_ref/hg38_bcrtcr.fa \ --ref /home/kangwang/TRUST4/reference/TRUST4_ref/human_IMGT+C.fa \ -1 $fastq1 \ -2 $fastq2 \ -o $Sample sed '1,1s/#count/count/g' ${Sample}_report.tsv > ${Sample}_report.processed.tsv Rscript /home/kangwang/TRUST4/src/trust4_bcr_process.R --cdr3 ${Sample}_report.processed.tsv \ --sampleid $Sample \ --stat /home/kangwang/TRUST4/data/RNA_markdup.sorted.bam.stats/${Sample}.markdup.sorted.bam.stats.txt \ --outdir /home/kangwang/TRUST4/result/Baseline/$Sample/$Sample Rscript /home/kangwang/TRUST4/src/trust4_tcr_process.R --cdr3 ${Sample}_report.processed.tsv \ --sampleid $Sample \ --stat /home/kangwang/TRUST4/data/RNA_markdup.sorted.bam.stats/${Sample}.markdup.sorted.bam.stats.txt \ --outdir /home/kangwang/TRUST4/result/Baseline/$Sample/$Sample done`

mourisl commented 1 year ago

Could you please share the UF-3016-1226-0_airr.tsv file, which is the full AIRR output? Thank you.

WangKang-Leo commented 1 year ago

UF-3016-1226-0_airr.txt Thanks so much, here you are!

WangKang-Leo commented 1 year ago

Hello again, any news or findings from my AIRR output?

mourisl commented 1 year ago

Those assemblies look authentic to me. I guess this sample is sequenced pretty deep, so even though the T or B cell infiltration level is low, it still captures a lot. Also, the abundance is affected by receptor gene expression. For example, the top IGH/IGK CDR3s may come from plasma B cells, which may have super high expression but the actual cell number could be much fewer. Hope this helps.

WangKang-Leo commented 1 year ago

Thanks for your prompt feedback. Do you have any suggestions to adjust or filter the AIRR results? Because we hope to report coherent results from different tools.

mourisl commented 1 year ago

I usually filter the non-productive chains in the results. For BCRs, you may cluster the similar CDR3s as they might be come from the same lineage during somatic hypermutation (which is another definition of clonotype), but this only changes the number of distinct clonotypes and won't affect the total abundances.