liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
277 stars 49 forks source link

Clonotype counting with lower cell cell count #321

Open marcoco90 opened 1 week ago

marcoco90 commented 1 week ago

Hello team,

We recently processed some T-cells samples with the Takara SMART-Seq Human TCR (with UMIs).

We used the command "run-trust4 --barcodeLevel molecule -f $Genome/hg38_bcrtcr.fa --ref $Genome/human_IMGT+C.fa -1 $read1 -2 $read2 --barcode $read2 --readFormat bc:0:11,r2:19:-1 -o $samid -t 10".

We accounted and corrected for UMIs to remove PCR duplicates. However when looking at the number of TRB clonotypes, we obtained way more then the cell input number used for the experiment.

Sample Name Cell Count Reads Obtained Clonotype TRB 121242762 291 1,644,198 51,754 121488195 629 718,832 14,654 120483253 696 390,422 8,183 121109290 4,079 126,394 414 121155824 382 136,938 2,185 121552040 282 75,900 1,175 121545458 255 388,964 10,504 121605112 859 71,206 2 121605113 1,339 540,868 12,340

This number is directly taken from the script trust-stats.py

In your experience have you observed a higher number of clonotypes respect to the number of cells? In theory we should always obtain clonotypes <= cell number, so when counting unique CDR3 that should be less than the cell number(?). Even if I count the unique CDR3 (clonotypes) within each sample, the number is still higher then the number of cells as input

Is something that am I missing? I am trying to work backwards to the explanation making sure the number of clonotype obtained from the TRUST4 is correct and not overestimating.

Thanks in advance for the help.

Best, Marco

mourisl commented 1 week ago

Have you tried to filter the results? Some UMIs may only have one or two reads and are likely to be errors.

marcoco90 commented 1 week ago

@mourisl No I have not. Just filtered the out of frame CDR3. How do you filter using based on reads? Is there any output that has the number of reads associated to a specific CDR3 or umi?

mourisl commented 1 week ago

If you are using the barcode_report/airr file, there is an entry in the format represent the number of reads support the CDR3 in this barcode/UMI. You can filter based on that. If you are using the aggregated report file, you can regenerate the report file using the trust-simplerep.pl script with the same parameter as the running log but with an additional option "--filterBarcoderepReadCnt" to ignore UMIs with less than specified threshold's read support.