liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
270 stars 46 forks source link

little productive clonetypes #59

Closed zhouxinseeu closed 3 years ago

zhouxinseeu commented 3 years ago

Hi, I analyzed a sample with trust, and used trust-barcoderep-to-10X.pl to convert barcode_report to 10X format (10X_report). According to the converted data, I counted the clonetypes of productive, but the number is very small. Among 5661 cells, only 27 cells have paired productive clonetypes. Is my statistics based on 10X_report correct?

mourisl commented 3 years ago

The numbers do seem too low. Is it on T cell (xxx_t.csv) or B cell (xxx_b.csv)? What is the number of productive clonotypes if only considering single-chain? Is your data 5' 10x data or 3' 10x data? Thank you.

zhouxinseeu commented 3 years ago

It is on T cell. The number of productive clonotypes if only considering single-chain is the same as paired. It is not 10X data and it is 3' data. Can you tell me how trust judges productive?

mourisl commented 3 years ago

The productive CDR3 means the nucleotide length must be multiple of three and there is no stop codon. For 3' data, there are very few reads covering the CDR3 region, so the sensitivity of calling CDR3 will be very low.

zhouxinseeu commented 3 years ago

Thank you. I just read your trust-barcoderep-to-10X.pl code and found that this line seems to be a problem(https://github.com/liulab-dfci/TRUST4/blob/f673564a44c60b664b305e17a22d998887e23f18/trust-barcoderep-to-10X.pl#L78). I guess it might be because chainCols is written as cols, so after converting to 10X, the number of productive becomes very small. I don't know if my idea is correct, can you confirm it for me?

mourisl commented 3 years ago

Thank you for catching this error! I just fixed this issue, and it should work now.

zhouxinseeu commented 3 years ago

Hi, I have another confusion. for the contig_id in the barcode report, I think there should be at least a corresponding read in assign.out, but I just checked a few contig_ids and found that it does not have a corresponding read in assign.out. This is strange.

mourisl commented 3 years ago

By default, the single-cell workflow does not align the reads as in the bulk data. The main purpose of the read alignment/assignment after assembly is to recover the CDR3s encoded in the consensus assemblies, and in single-cell setting, there is no need for this step. Therefore, the assign.out file should be empty when given barcode. One place for kind of the read assignment information is in the _assembled.fa file, it should have the barcode information in each header.

zhouxinseeu commented 3 years ago

Thank you, but the assembled.fa file does not show which contig the read is allocated to. Because I currently want to count the reads allocated to TRA or TRB, it seems that it cannot be achieved through assembled.fa. By the way, I compared the results of trust and cellranger. In terms of full length, trust does not perform as well as cellranger. And some cdr3 in trust are judged as productive, but the chain it is in is not full length. Will it be optimized for this problem in the future?

mourisl commented 3 years ago

Do you mean you applied cellranger on the 5' RNA-seq data set without VDJ amplication? From my tests before, the results from Cellranger was a subset of TRUST4's on the sole RNA-seq data. Full-length means the assembly contains all the framework region and CDR1,2,3s, so for those not full-length assemblies, they can still have complete and productive CDR3s.

If you mean the application of TRUST4 on the VDJ amplified kit, I only tested CDR3 performance before and it seems TRUST4 recovered more CDR3s. The full-length assemblies might lost due to the drastic trimming from the "--repseq" option, which I plan to optimize in future.

One way to count the number of reads allocated to TRA or TRB is through the _annot file. In the header, the second and third number is the length of the assembly and the read coverage: (read count * read_length)/500. So if you know the read length, you can convert the third number back to the number of reads used to assemble the contig/scaffold. Though you may need to divide the read count by two since it regards one mate-pair as two reads.

zhouxinseeu commented 3 years ago

Thank you so much! My current data are all TCR/BCR-seq data. I did not use the --repseq option when I tested it. However, I have compared the results of using this option with that of not using it. In terms of full length, the result of using --repseq is worse. Without this option, trust will take longer to process data. Anyway, thank you for your answers.