Why do trust_report.tsv files have so many fewer lines than trust_cdr3.out files?

liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data

MIT License

278 stars 49 forks source link

Why do trust_report.tsv files have so many fewer lines than trust_cdr3.out files? #186

Open efinlay opened 1 year ago

efinlay commented 1 year ago

Thank you for TRUST4 I installed it this week and ran it with some sample 10X genomics files to get a feel for what it does. Why does the cdr3.out file have so many more lines than the report.txv file? I've read answers here referring to a consensus compressing similar sequences but when I compare the two files I see VDJ combinations in cdr3.out which are not present at all in report.tsv.

As an experiment I used the simple_rep.pl scrit with a cdr3.out file consisting of 1336 lines, it generated a report of 87 lines. When I ran TRUST4 it had previously produced a report.tsv of only 70 lines. What governs inclusion in the report? Thank you Emma

mourisl commented 1 year ago

The cdr3.out is directly corresponding to the contigs. It contains partial CDR3s and contigs that may differ outside of the VDJ region. The report file only reports complete CDR3s and coalesce the terms with the same VJC genes and CDR3 sequences, so it has much fewer items. Since your data is scRNA-seq, the report file is further cleaned to summarize the representative chains from the cell. The cell-level CDR3 information is in the barcode_report.tsv and barcode_airr.tsv file.

efinlay commented 1 year ago

That makes sense, thank you for answering so quickly

When there are multiple lines in the CDR3 file which have the same V/D/J identifiers but only one in the report how is the consensus id given in the report chosen? For example there are 3 consensus ids with the TRAV21 - TRAJ37 combination in the cdr3 but only one in the report grep TRAV21 TRUST_RA_T103A1_S1_L001_R2_001_cdr3.out | grep "TRAJ37" | awk '{print $1}' TACAGGTAGCTAGAGC_415005 AGTCATGGTTTACCTT_415006 TCACAAGTCCATTTGT_474960 grep TRAV21 TRUST_RA_T103A1_S1_L001_R2_001_report.tsv | grep "TRAJ37" | awk '{print $9}' TACAGGTAGCTAGAGC_415005

mourisl commented 1 year ago

It is based on the abundance value in the cdr3.out file.