liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
277 stars 49 forks source link

TCR/BCR diversity and evenness #36

Closed wgmao closed 3 years ago

wgmao commented 3 years ago

Thank you for developing this great software. I run through TRUST4 pipeline on my own bulk rna-seq data and performed some preliminary analysis. Based on the TCR/BCR frequencies, I calculated basic statistics as diversity (Shannon entropy) and evenness (Pielou's evenness). There was a strong correlation between evenness and some qc metrics of the rna-seq which are percentage of trimmed reads, percentage of duplicated sequence in trimmed fastq files and duplication rate. Do you have any suggestion to handle this correlation? Do you recommend to deduplicate the fastq files before running TRUST? Thank you.

mourisl commented 3 years ago

In RNA-seq data, for the highly expressed mRNAs, it is normal to have a high read coverage and hence a high duplication rate. You can check that for those duplicated reads, are they duplicate on both read pairs or just one end. I'm not sure why it is correlated with the percentage of trimmed reads. Is it positive or negative?

I haven't conducted the experiment on deduplication, but I think the effect on the assembly would be small. One thing you can try is to take log or sqrt transform on the read count column before computing entropy or evenness.