Closed wgmao closed 3 years ago
In RNA-seq data, for the highly expressed mRNAs, it is normal to have a high read coverage and hence a high duplication rate. You can check that for those duplicated reads, are they duplicate on both read pairs or just one end. I'm not sure why it is correlated with the percentage of trimmed reads. Is it positive or negative?
I haven't conducted the experiment on deduplication, but I think the effect on the assembly would be small. One thing you can try is to take log or sqrt transform on the read count column before computing entropy or evenness.
Thank you for developing this great software. I run through TRUST4 pipeline on my own bulk rna-seq data and performed some preliminary analysis. Based on the TCR/BCR frequencies, I calculated basic statistics as diversity (Shannon entropy) and evenness (Pielou's evenness). There was a strong correlation between evenness and some qc metrics of the rna-seq which are
percentage of trimmed reads
,percentage of duplicated sequence in trimmed fastq files
andduplication rate
. Do you have any suggestion to handle this correlation? Do you recommend to deduplicate the fastq files before running TRUST? Thank you.