liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
276 stars 48 forks source link

Several questions for TRUST4, THANKS :) #318

Open NPTL1201 opened 2 weeks ago

NPTL1201 commented 2 weeks ago

Hi,

First of all, tremendous thanks for developing this wonderful tool. I have tried this tool with scRNA data recently and have dug into the previous issues, and I have summarized some questions below for which I would like your kind reply. Thanks! Sorry if these questions might not be well expressed and some might overlap.

1. Regarding Filtering and Obtaining the Valid Barcode:

1.1. I notice that a full-length VDJ is not required for a barcode to be valid/productive in TRUST4. That is, in the barcode_report.tsv file, entries with 0 values are preserved. I have read some of your comments saying that requiring a full-length VDJ is a relatively strict way to filter barcodes/contigs, so it might be okay to preserve them. Since in the Cell Ranger pipeline they require a valid contig to be full-length, I wonder if, for an amplified TCR/BCR scSeq library, full-length contigs is required; however, for reconstruction using scRNA data, non-full-length contigs are also OK to be preserved? In other words, is the definition of a "productive contig" less strict in the TRUST4 pipeline compared with cellranger one? Am I understanding this correctly?

1.2. Just to confirm, a CDR3_score of 0 means that the CDR3 is incomplete, and compared to the trust_cdr3.out file, the barcode_report.tsv has removed these contigs with incomplete CDR3. Can I interpret this as meaning that a full CDR3 is required, but a full VDJ is not required for a valid barcode in TRUST4?

1.3. Should the contig be further filtered based on CDR3_score? For example, should we only retain those with a strong CDR3_score (e.g., >80.00)?

1.4. Should the read_count be further filtered, and is there any recommendation on the cutoff? Is there any difference on the recommended cutoff between TCR and BCR in this regard? I saw your comment in other issues saying, "I think a cutoff value of 4 for BCR amplified library looks good (or even a bit conservative)."

1.5. Will the CDR3_germline_similarity provide any useful information for filtering contigs?

1.6. Currently, after running ./run-trust4, I run barcoderep-filter.py, then transfer to 10x format. Do you think this is okay for most analyses? Are there any other considerations for further filtering? Maybe using files besides barcode_report.tsv?

2. Regarding the Application of 3' scRNA-seq:

2.1. After checking some of your previous comments on other issues and trying to use 3' data to generate TCR data, I understand that the recovery rate of 3' scRNA-seq is truly low, making it hard to draw definite conclusions. However, rather than comparing TCR information between different groups, if we simply want to prove the existence of some clones/patterns in some samples—and I did successfully find these clonotypes—would it be appropriate to use TRUST4 for 3' scRNA-seq data in this way?

2.2. You have mentioned in other issues regarding the use of 3' data: "The CDR3 sequence is more robust to use, so I think you can use it for downstream analysis, along with the isotypes associated with those CDR3s." I assume using the CDR3 instead of the full VDJ is a more appropriate way to identify a TCR/BCR clone when using 3' data. Is that correct?

2.3. Do you have any other recommendations if I can only use 3' data to achieve what I mentioned (simply wanting to prove the existence of some clones/patterns in some samples)? Select samples with larger numbers of T/B cells? Are there any metrics to evaluate whether the reconstruction is relatively good or bad?

2.4. This following question is based on my observation when using 3' data but might exist in other situations. I saw some γδ T cells with γ and δ chains, but these cells seem to be αβ T cells based on scRNA clustering. Would it be possible that in some cells, 3' scRNA-seq is hard to capture reads for α and β chains, and then if minor γ and δ chains are found, they become the primary chains in those cells?

Thank you very much for your time and assistance!

mourisl commented 2 weeks ago

1.1. I notice that a full-length VDJ is not required for a barcode to be valid/productive in TRUST4. That is, in the barcode_report.tsv file, entries with 0 values are preserved. I have read some of your comments saying that requiring a full-length VDJ is a relatively strict way to filter barcodes/contigs, so it might be okay to preserve them. Since in the Cell Ranger pipeline they require a valid contig to be full-length, I wonder if, for an amplified TCR/BCR scSeq library, full-length contigs is required; however, for reconstruction using scRNA data, non-full-length contigs are also OK to be preserved? In other words, is the definition of a "productive contig" less strict in the TRUST4 pipeline compared with cellranger one? Am I understanding this correctly?

Yes, the "productive" in TRUST4 just means the CDR3 sequence has no stop codon, no frameshift mutation. I believe CellRanger consider all the regions for these type of variations to define productivity.

1.2. Just to confirm, a CDR3_score of 0 means that the CDR3 is incomplete, and compared to the trust_cdr3.out file, the barcode_report.tsv has removed these contigs with incomplete CDR3. Can I interpret this as meaning that a full CDR3 is required, but a full VDJ is not required for a valid barcode in TRUST4?

Exactly. The incomplete CDR3 contains many false receptors, such as unrecombined VDJ, so it is better to leave them out by default.

1.3. Should the contig be further filtered based on CDR3_score? For example, should we only retain those with a strong CDR3_score (e.g., >80.00)?

I don't think a further filter is needed. The CDR3 score is based on the motif score (YYC and W/FGxG), but some V and J genes don't follow the paradigm.

1.4. Should the read_count be further filtered, and is there any recommendation on the cutoff? Is there any difference on the recommended cutoff between TCR and BCR in this regard? I saw your comment in other issues saying, "I think a cutoff value of 4 for BCR amplified library looks good (or even a bit conservative)."

This totally depends on the data type. For unamplified data, there are many cases you only have one reads from the CDR3 region, so we shall keep them.

1.5. Will the CDR3_germline_similarity provide any useful information for filtering contigs?

Could be. For TCR, we don't expect much sequence variations, so low similarity may suggest something wrong. For BCR, due to somatic hypermuation, low germline similarity can happen. Neverthless, the alignment on the CDR3 region is not very reliable. So the filter based on germline_simialarity needs to be carefully examined.

1.6. Currently, after running ./run-trust4, I run barcoderep-filter.py, then transfer to 10x format. Do you think this is okay for most analyses? Are there any other considerations for further filtering? Maybe using files besides barcode_report.tsv?

I think more and more analysis software is compatible with the AIRR format now. You can consider using that file for downstream analysis.

2.1. After checking some of your previous comments on other issues and trying to use 3' data to generate TCR data, I understand that the recovery rate of 3' scRNA-seq is truly low, making it hard to draw definite conclusions. However, rather than comparing TCR information between different groups, if we simply want to prove the existence of some clones/patterns in some samples—and I did successfully find these clonotypes—would it be appropriate to use TRUST4 for 3' scRNA-seq data in this way?

Yes. While the sensitivity on 3' data can be low, the precision is quite good (on par with the sequencing error rate I believe). So if the CDR3 is identified, you can make the conclusion that is found in your sample.

2.2. You have mentioned in other issues regarding the use of 3' data: "The CDR3 sequence is more robust to use, so I think you can use it for downstream analysis, along with the isotypes associated with those CDR3s." I assume using the CDR3 instead of the full VDJ is a more appropriate way to identify a TCR/BCR clone when using 3' data. Is that correct?

Yes. Even for the 5' data, CDR3 is good. It is hard to find full VDJ data in the unamplified data.

2.3. Do you have any other recommendations if I can only use 3' data to achieve what I mentioned (simply wanting to prove the existence of some clones/patterns in some samples)? Select samples with larger numbers of T/B cells? Are there any metrics to evaluate whether the reconstruction is relatively good or bad?

Select samples with more immune cells definitely will help. I think any one is the sequencing depth for each cell. As you have more reads, there is higher chance to cover CDR3 region.

2.4. This following question is based on my observation when using 3' data but might exist in other situations. I saw some γδ T cells with γ and δ chains, but these cells seem to be αβ T cells based on scRNA clustering. Would it be possible that in some cells, 3' scRNA-seq is hard to capture reads for α and β chains, and then if minor γ and δ chains are found, they become the primary chains in those cells?

Yes, it is possible. Though in some studies, the gdT is clustered into the CD8 T cell space along with other abT cells. You can check the gene expression of the TRDC and TRGC gene to confirm.

Please let. me know if you have other questions.