liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
268 stars 46 forks source link

Enquiry about 10x Genomics data on TRUST4 #70

Closed cwxiix closed 3 years ago

cwxiix commented 3 years ago

Hello,

I have questions about the software outputs especially on 10x Genomics data.

I appreciate your help.

Sincerely, Chenxi

mourisl commented 3 years ago
  1. The weight is just the number of the reads supporting the nucleotide. Consensus means most reads support these nucleotides. consensus_id is the consensus sequence id. The format for other files are explained in detail in README.
  2. Yes, it the 10X vdj nextgem data.
  3. report.tsv shows how many cells each CDR3 show up. barcode_report.tsv shows the clonotypes(CDR3s) for each cell. One difference is that barcode_report.tsv selects at most two chains per cell based on abundance, and the remaining filtered/secondary CDR3s are still counted in report.tsv file.
  4. The number of lines in barcode_report.tsv is the number of cells detected.
  5. The comparison with against the vdj_nextgem_hspbmc3{b,t}_all_contig_annotations.csv file. Note that the analysis only considered the cells passed Seurat default QC.
  6. I used bam file from gex library.
cwxiix commented 3 years ago

Thank you for your reply,

I am still confused about the cell numbers. From the website used for downloading data, in V(D)J Enriched Libraries, the key cell metric for TCR Libraries, cell detected is 6648 and that for Ig libraries is 1309. However, the number of lines in TCR library barcode_report.tsv file is 38045 which is far more than website info. Did I make anything wrong? Do you think you can help me figure out a reason for this difference or they should not be same or similar?

Thanks again.

mourisl commented 3 years ago

When we do the comparison, we only considered the cells(barcodes) that passed Seurat scRNA-seq QC, which filtered cells based on the number of reads in the cell, number of genes in the cells, and etc. I think this is the common way for the analysis: people cluster/annotate the cell clusters based on Seurat (or other tools) results, and overlay the TCR/BCR information on that.

cwxiix commented 3 years ago

Thank you for your reply. Then I will work on Seurat first. I might still want to leave this issue open until I work everything out. Hope you won't mind this or I can close it and open a new one later when I have another new question.

cwxiix commented 3 years ago

Since you mentioned the difference between barcode_report.tsv file and report.tsv file with more CDR3 counts in the later file. Then does this mean I can not link barcode information with CDR3nt in both table? If I can connect two tables, then will CDR3nt or consensus id be the key? Also, for the meaning of abundance, may I consider it means the number of mapped reads in 10x Genomics data?

mourisl commented 3 years ago

You need to usse CDR3nt as the key two connect these two files. The extra CDR3s in report.tsv file are usually shown in the last two columns in barcode_report file as the secondary/noise/non-functional CDR3s. For the abundance, do you mean the abundance in the barcode_report.tsv file or report.tsv file? The abundance in barcode_report.tsv for each chain is the number of mapped reads or UMIs on the CDR3. The abundance in report.tsv file is the number of cells contains the cdr3.

cwxiix commented 3 years ago

I get it! Thank you again.

cwxiix commented 3 years ago

Also after I sum up the frequency, the sum is 6. I know from README, for frequency, the BCR and TCR chains are normalized respectively. I was thinking I should get frequency sum to be 1. Is this suitable for this normalized case? Also, what is the way of normalization? just subtract mean and divided by standard deviation?

cwxiix commented 3 years ago

Could you please explain more about "similarity" and "CDR3 scores"? I knew you mentioned that the score 1 means CDR3 with imputed nucleotides and other numbers are the motif signal. How can I understand more about motif signal and meaning of imputed nucleotides?

mourisl commented 3 years ago

"Similarity" is the alignment similarity between CDR3 and germline V, D, J sequences that overlaps with the CDR3 region. The amino acid motif of CDR3 is something like "YYC" on 5' and "F/WGxG" on 3' end. So the fraction of the 6 amino acids will be the score. Imputed sequence means a partial assembly may overlap with V or J gene, then we can use the germline sequence to fill the remaining portion for TCRs. Note that the score is divided 100 in the report.tsv file, so score 0.01 is for imputed CDR3.

cwxiix commented 3 years ago

I see, That is a lot clear to me. Do you think you can give me some hint to one of my earlier post?

Also after I sum up the frequency, the sum is 6. I know from README, for frequency, the BCR and TCR chains are normalized respectively. I was thinking I should get frequency sum to be 1. Is this suitable for this normalized case? Also, what is the way of normalization? just subtract mean and divided by standard deviation?

Thank you again

mourisl commented 3 years ago

The frequencies are the fraction of CDR3s in IGH, IGK+IGL, TRB, TRA, TRG, TRD respectively. So it sums up to 6.

cwxiix commented 3 years ago

I get it. It comletely makes sense to me now. Thank you again.

cwxiix commented 3 years ago

Hello,

May I know details about your analysis procedure? Please tell me if I think it in a right way or not. I also have questions about it:

  1. Download the bam file from 10x PBMC dataset link. And it is especially the bam file. Here did you also download the fastq files to compare the results? If so, what are the exact results of two files? How similar are they?
  2. You mentioned Seurat, so may I know how you processed the QC? Did you use the file processed by Cellranger( the feature/cell matrix HDF5 file)? If so, is that the one filtered or raw?
  1. Did you use any barcode whitelist file during your procedure for this specific 10x PBMC data? If so, is it 737K-august-2016.txt file from Cellranger software? How will with or without whitelist file affect results?
  2. Also, may I know how you reach the number of T-cells and B-cells? Did you count them in the barcode_report.tsv file in row number of each cell type?

The reason I ask this is because I didn't get exact same number of T-cells and B-cells so that I thought I did something wrong. I don't completely understand paper yet. After I run bam file data in TRUST4, I read.csv file _barcode_report.tsv file into R and the total number of lines is 8987 and if I table() the second column which is the cell type I got this `> table(tablebam_barcode_report[,2])

abT B gdT 5547 3331 109 ` The cell numbers are far more than what you mentioned in the paper,"TRUST4 made 5,091 T- and 1,318 B-cell calls (Fig. 2a and Supplementary Fig. 5a)." Thus ,these are the reason I ask all questions above.

I know there is an evaluation script of how you did the evaluation. I barely know python so do you think I should try to figure out your python code first so that I can make more sense? Could you please give me advice? I really appreciate your help.

cwxiix commented 3 years ago

Also, the barcodes in bam files outputs are barcode with "-1" at the end in each barcodes but those are missing in fastq file output barcode_report.tsv. Is this normal? or did I miss anything?

head(tablebam_barcode_report) X.barcode cell_type 1 AGAGTGGTCTATCCTA-1 abT 2 TCGAGGCCAAGTAGTA-1 abT 3 CCTTTCTGTTGGTTTG-1 abT 4 AAGTCTGCAATCTACG-1 abT 5 TCTATTGAGGACGAAA-1 abT 6 GCGCGATAGTTGCAGG-1 abT

head(tablegex_barcode_report) X.barcode cell_type 1 AGCTTGATCTAACCGA abT 2 AAACGGGCAATAACGA abT 3 CCTACACGTGCAATTT B 4 ACCAGTATCAAACCAC abT 5 GGATGTTTCGTTTAGG abT 6 CGAGCCACAGGTCTCG abT

mourisl commented 3 years ago
  1. In the paper, the main evaluation is based on the evaluation with BAM file. In supplementary, we also evaluated the results with FASTQ input with barcode correction based on 737K-august-2016.txt file. The results were highly similar.

  2. For Seurat, we used the default options to conduct clustering from the cellranger output, and I think it was raw_count matrix. We then extract the barcode in the final Seurat object to filter TRUST4's output. So there is no QC for TRUST4 itself. TRUST4 ran on the raw data, and we just considered valid barcodes from Seurat later.

  3. Answered above. The barcode correction improved the accuracy a little bit. In the paper, the main analysis was based on the cellranger bam, which contains the field for corrected barcode.

  4. I counted those numbers from the barcode_report.tsv file, and only counted the barcodes in Seurat result.

  5. The "-1" is added to the barcode from cellranger for something like cell-group (?). The information is not in the raw fastq file, so the barcode is just the barcode.

cwxiix commented 3 years ago

I get it. Thank you so much!

cwxiix commented 3 years ago

Hello,

Here still something I want to ask and make sure.

mourisl commented 3 years ago
  1. read_fragment is the number of reads supporting this CDR3 from the cell. If UMI is given, this number will be the number of UMIs.
  2. Since each cell can have multiple assembled sequences, the "_0", "_1" are used to index those sequences from the same cell. So if you remove the suffix, it is the barcode.
  3. If you are interested in the sequences outside of CDR3, you can extract those full-length assemblies from the annot.fa file based on the consensus_id for further analysis, such as receptor protein structure.
cwxiix commented 3 years ago

Thank you first, then if the read_fragment has decimals. Then is this normal?

mourisl commented 3 years ago

When TRUST4 tries to decode the CDR3 encoded in the consensus, a read partially overlapped with CDR3 region could be compatible with more than 1 CDR3 type. Therefore, TRUST4 applied EM algorithm to quantify the CDR3s from the same consensus and will generate decimal abundance.

cwxiix commented 3 years ago

Thanks a lot.

mourisl commented 3 years ago
  1. raw.out is the result of first-round assembly or raw contigs. final.out is the result of scaffolding, where it rearranges the contigs to satisfy the mate-pair information. If it is single-cell data, the most abundant clonotype in the cell should be assembled as a contig already, so the scaffolding usually does not improve the resulte. Therefore this step is skipped to save running time, and you will see the identical raw.out and final.out file.
  2. Average_coverage is actually the total coverage / 500, where 500 is roughly the size of the variable region. As a result, short assemblies will have lower average coverage. Coverage is the weight of that sequence.
  3. The 6 amino acid motif of CDR3 is "YYC" on 5' and "F/WGxG" on 3' end as answered before. It is used by IMGT .
cwxiix commented 3 years ago

I see. I was thinking there are 7 but since one of them can be anyone so it actually doesn't count. That make sense for me now. Thank you very much.

cwxiix commented 3 years ago

Another enquiry is:

Is there a clear connection among _annot.fa, _cdr3.out, _report.tsv, and barcode_report.tsv files both in scRNA-seq and bulk RNA-seq data?

Besides above confusions, May I get any advice on if I want to know:

Sorry for I can't figure all these on my own and I really appreciate your replies and help.

mourisl commented 3 years ago
  1. annot.fa=>(read alignment for abundance estimation, decode minor CDR3 sequences)=>cdr3.out=>report.tsv/barcode_report.tsv

  2. During assembly, some of the contigs got merged and their id will be skipped in the output.

  3. From annot.fa to cdr3.out, there is no filtration. Some (Many) assemblies in annot.fa has no CDR3 region, so they will not be output to cdr3.out .

  4. report.tsv merges the entries in cdr3.out with the same V, J, C genes and CDR3 sequences. Only one contig id will show up here, and it is the most abundant one for that entry.

  5. barcode_report.tsv is also based on the cdr3.out file. The barcode is in the contig name, so it is also presented in the first column in the cdr3.out.

As for the other information, these should be done with methods like STAR_SOLO. TRUST4 only focuses on the reads from TCRs and BCRs.

cwxiix commented 3 years ago

Thank you very much for your reply. This is much clear for me and I will do some research on STARsolo.