Enquiry about 10x Genomics data on TRUST4

cwxiix commented 3 years ago

Hello,

I have questions about the software outputs especially on 10x Genomics data.

First, will there be more specific explainations about the meaning of each column names in all output files? For example, I know raw.out and final.out are contigs and nucleotide weights. Then how I can understand this as a student in Statistics major or should that be common sense for Biology students? If in this case, I can do more research by myself. I am also not sure how to understand consensus assembly and consensus_id.
Second, from TRUST4 paper, in evaluation of performance of 10x Genomics data, "TRUST4 made 5,091 T- and 1,318 B-cell calls". Is this link the data you were using and how to count cell numbers?
Third, there are barcode_report.tsv and report.tsv file for 10x Genomics case. Then how to connect the information in this two table in another word, what is the key for these two tables so that I can use information for both barcode and CDR3 sequence?
Fourth, can I consider number of lines in barcode_report.tsv as number of cells detected? Or if I sum the first column, the number of counts, will sum be the number of cells since you mentioned that count in 10x case are number of barcodes?
Fifth, for comparing TRUST result with CellRanger result from 10x Genomics, may I get a clear link or file which file are you using for doing comparison?
Sixth, in the paper for PBMC 10x Genomics data, are you using bam files of VDJ libraries or using bam file from gex library?

I appreciate your help.

Sincerely, Chenxi

mourisl commented 3 years ago

The weight is just the number of the reads supporting the nucleotide. Consensus means most reads support these nucleotides. consensus_id is the consensus sequence id. The format for other files are explained in detail in README.
Yes, it the 10X vdj nextgem data.
report.tsv shows how many cells each CDR3 show up. barcode_report.tsv shows the clonotypes(CDR3s) for each cell. One difference is that barcode_report.tsv selects at most two chains per cell based on abundance, and the remaining filtered/secondary CDR3s are still counted in report.tsv file.
The number of lines in barcode_report.tsv is the number of cells detected.
The comparison with against the vdj_nextgem_hspbmc3{b,t}_all_contig_annotations.csv file. Note that the analysis only considered the cells passed Seurat default QC.
I used bam file from gex library.

cwxiix commented 3 years ago

Thank you for your reply,

I am still confused about the cell numbers. From the website used for downloading data, in V(D)J Enriched Libraries, the key cell metric for TCR Libraries, cell detected is 6648 and that for Ig libraries is 1309. However, the number of lines in TCR library barcode_report.tsv file is 38045 which is far more than website info. Did I make anything wrong? Do you think you can help me figure out a reason for this difference or they should not be same or similar?

Thanks again.

mourisl commented 3 years ago

When we do the comparison, we only considered the cells(barcodes) that passed Seurat scRNA-seq QC, which filtered cells based on the number of reads in the cell, number of genes in the cells, and etc. I think this is the common way for the analysis: people cluster/annotate the cell clusters based on Seurat (or other tools) results, and overlay the TCR/BCR information on that.

cwxiix commented 3 years ago

Thank you for your reply. Then I will work on Seurat first. I might still want to leave this issue open until I work everything out. Hope you won't mind this or I can close it and open a new one later when I have another new question.

cwxiix commented 3 years ago

Since you mentioned the difference between barcode_report.tsv file and report.tsv file with more CDR3 counts in the later file. Then does this mean I can not link barcode information with CDR3nt in both table? If I can connect two tables, then will CDR3nt or consensus id be the key? Also, for the meaning of abundance, may I consider it means the number of mapped reads in 10x Genomics data?

mourisl commented 3 years ago

You need to usse CDR3nt as the key two connect these two files. The extra CDR3s in report.tsv file are usually shown in the last two columns in barcode_report file as the secondary/noise/non-functional CDR3s. For the abundance, do you mean the abundance in the barcode_report.tsv file or report.tsv file? The abundance in barcode_report.tsv for each chain is the number of mapped reads or UMIs on the CDR3. The abundance in report.tsv file is the number of cells contains the cdr3.

cwxiix commented 3 years ago

I get it! Thank you again.

cwxiix commented 3 years ago

Also after I sum up the frequency, the sum is 6. I know from README, for frequency, the BCR and TCR chains are normalized respectively. I was thinking I should get frequency sum to be 1. Is this suitable for this normalized case? Also, what is the way of normalization? just subtract mean and divided by standard deviation?

cwxiix commented 3 years ago

Could you please explain more about "similarity" and "CDR3 scores"? I knew you mentioned that the score 1 means CDR3 with imputed nucleotides and other numbers are the motif signal. How can I understand more about motif signal and meaning of imputed nucleotides?

mourisl commented 3 years ago

"Similarity" is the alignment similarity between CDR3 and germline V, D, J sequences that overlaps with the CDR3 region. The amino acid motif of CDR3 is something like "YYC" on 5' and "F/WGxG" on 3' end. So the fraction of the 6 amino acids will be the score. Imputed sequence means a partial assembly may overlap with V or J gene, then we can use the germline sequence to fill the remaining portion for TCRs. Note that the score is divided 100 in the report.tsv file, so score 0.01 is for imputed CDR3.

cwxiix commented 3 years ago

I see, That is a lot clear to me. Do you think you can give me some hint to one of my earlier post?

Also after I sum up the frequency, the sum is 6. I know from README, for frequency, the BCR and TCR chains are normalized respectively. I was thinking I should get frequency sum to be 1. Is this suitable for this normalized case? Also, what is the way of normalization? just subtract mean and divided by standard deviation?

Thank you again

mourisl commented 3 years ago

The frequencies are the fraction of CDR3s in IGH, IGK+IGL, TRB, TRA, TRG, TRD respectively. So it sums up to 6.

cwxiix commented 3 years ago

I get it. It comletely makes sense to me now. Thank you again.

cwxiix commented 3 years ago

Hello,

May I know details about your analysis procedure? Please tell me if I think it in a right way or not. I also have questions about it:

Download the bam file from 10x PBMC dataset link. And it is especially the bam file. Here did you also download the fastq files to compare the results? If so, what are the exact results of two files? How similar are they?
You mentioned Seurat, so may I know how you processed the QC? Did you use the file processed by Cellranger( the feature/cell matrix HDF5 file)? If so, is that the one filtered or raw?

If process the fastqs or bam files with QC before TRUST4, then how did you achieve this?
If you apply QC with output file of CellRanger in Seurat, then how did you apply the QC output to the outputs of TRUST4 generated by original fastq or bam files?

Did you use any barcode whitelist file during your procedure for this specific 10x PBMC data? If so, is it 737K-august-2016.txt file from Cellranger software? How will with or without whitelist file affect results?
Also, may I know how you reach the number of T-cells and B-cells? Did you count them in the barcode_report.tsv file in row number of each cell type?

The reason I ask this is because I didn't get exact same number of T-cells and B-cells so that I thought I did something wrong. I don't completely understand paper yet. After I run bam file data in TRUST4, I read.csv file _barcode_report.tsv file into R and the total number of lines is 8987 and if I table() the second column which is the cell type I got this `> table(tablebam_barcode_report[,2])

abT B gdT 5547 3331 109 ` The cell numbers are far more than what you mentioned in the paper,"TRUST4 made 5,091 T- and 1,318 B-cell calls (Fig. 2a and Supplementary Fig. 5a)." Thus ,these are the reason I ask all questions above.

I know there is an evaluation script of how you did the evaluation. I barely know python so do you think I should try to figure out your python code first so that I can make more sense? Could you please give me advice? I really appreciate your help.

cwxiix commented 3 years ago

Also, the barcodes in bam files outputs are barcode with "-1" at the end in each barcodes but those are missing in fastq file output barcode_report.tsv. Is this normal? or did I miss anything?

head(tablebam_barcode_report) X.barcode cell_type 1 AGAGTGGTCTATCCTA-1 abT 2 TCGAGGCCAAGTAGTA-1 abT 3 CCTTTCTGTTGGTTTG-1 abT 4 AAGTCTGCAATCTACG-1 abT 5 TCTATTGAGGACGAAA-1 abT 6 GCGCGATAGTTGCAGG-1 abT

head(tablegex_barcode_report) X.barcode cell_type 1 AGCTTGATCTAACCGA abT 2 AAACGGGCAATAACGA abT 3 CCTACACGTGCAATTT B 4 ACCAGTATCAAACCAC abT 5 GGATGTTTCGTTTAGG abT 6 CGAGCCACAGGTCTCG abT

mourisl commented 3 years ago

In the paper, the main evaluation is based on the evaluation with BAM file. In supplementary, we also evaluated the results with FASTQ input with barcode correction based on 737K-august-2016.txt file. The results were highly similar.
For Seurat, we used the default options to conduct clustering from the cellranger output, and I think it was raw_count matrix. We then extract the barcode in the final Seurat object to filter TRUST4's output. So there is no QC for TRUST4 itself. TRUST4 ran on the raw data, and we just considered valid barcodes from Seurat later.
Answered above. The barcode correction improved the accuracy a little bit. In the paper, the main analysis was based on the cellranger bam, which contains the field for corrected barcode.
I counted those numbers from the barcode_report.tsv file, and only counted the barcodes in Seurat result.
The "-1" is added to the barcode from cellranger for something like cell-group (?). The information is not in the raw fastq file, so the barcode is just the barcode.

cwxiix commented 3 years ago

I get it. Thank you so much!

cwxiix commented 3 years ago

Hello,

Here still something I want to ask and make sure.

In cdr3.out file, How can I understand the column read_fragment_count?
For consensus id, in 10x Genomics data case, the consensus id are composed of barcode and '_0' and other numbers. Then will consensus_id the same thing with barcode or actually they are different index?
Is the column full_length_assembly means 0 if it is not full length and 1 if it is full length assemble? How can I utilize this information?

mourisl commented 3 years ago

read_fragment is the number of reads supporting this CDR3 from the cell. If UMI is given, this number will be the number of UMIs.
Since each cell can have multiple assembled sequences, the "_0", "_1" are used to index those sequences from the same cell. So if you remove the suffix, it is the barcode.
If you are interested in the sequences outside of CDR3, you can extract those full-length assemblies from the annot.fa file based on the consensus_id for further analysis, such as receptor protein structure.

cwxiix commented 3 years ago

Thank you first, then if the read_fragment has decimals. Then is this normal?

mourisl commented 3 years ago

When TRUST4 tries to decode the CDR3 encoded in the consensus, a read partially overlapped with CDR3 region could be compatible with more than 1 CDR3 type. Therefore, TRUST4 applied EM algorithm to quantify the CDR3s from the same consensus and will generate decimal abundance.

cwxiix commented 3 years ago

Thanks a lot.

Also may I know the connection and differences of _raw.out file and _final.out file? It seems that they have same records of consensus id, then maybe the differences in the sequences or weights?
In _annot.fa file, may I understand the calculation of average_coverage is the by average the weights of that sequence?
In much earlier reply, you told me the fraction of 6 amino acids is the score. Why 6? and which 6 amino acids are the motif signal strength?

mourisl commented 3 years ago

raw.out is the result of first-round assembly or raw contigs. final.out is the result of scaffolding, where it rearranges the contigs to satisfy the mate-pair information. If it is single-cell data, the most abundant clonotype in the cell should be assembled as a contig already, so the scaffolding usually does not improve the resulte. Therefore this step is skipped to save running time, and you will see the identical raw.out and final.out file.
Average_coverage is actually the total coverage / 500, where 500 is roughly the size of the variable region. As a result, short assemblies will have lower average coverage. Coverage is the weight of that sequence.
The 6 amino acid motif of CDR3 is "YYC" on 5' and "F/WGxG" on 3' end as answered before. It is used by IMGT .

cwxiix commented 3 years ago

I see. I was thinking there are 7 but since one of them can be anyone so it actually doesn't count. That make sense for me now. Thank you very much.

cwxiix commented 3 years ago

Another enquiry is:

Is there a clear connection among _annot.fa, _cdr3.out, _report.tsv, and barcode_report.tsv files both in scRNA-seq and bulk RNA-seq data？

For example, for the 10x Genomics data I am exploring at, there are 145635 records there in the annot.fa file but the consensus id are given with *****_155197 in the last line with "tail‘ code. Here, I was thinking the consensus id were given consecutively. May I know why total number of records are not consistent with the ids? Why some of them are filtered or not appear in annot.fa file?
From _annot.fa file to cdr3.out file, what are the ways for filtering the records. Is that the CDR3_score or something else? In cdr3.out file, there are 29799 number of lines. Thus, I want to know what made 145635 records be filtered down to 29799 records. or did I make any mistakes during these calculation that ,the total number of rows in _cdr3.out should match number of records in _annot.fa?
I noticed you mentioned _report.tsv is a summary of cdr3.out. Then how was it summarized? There are again smaller number of consensus_id as 11021 ids here. Is that only the one with CDR3 sequences from _cdr3.out will be kept in _report.tsv?
About barcode_report.tsv, again what is the connection between this file with others and how is barcode added to each sequences? By matching back the sequence to the reads?

Besides above confusions, May I get any advice on if I want to know:

How many reads are assigned to each cell?
How many genes are covered by each cell? Is this something I can get from output of TRUST4 or I have to figure out on bam file level with other software or from 10x Genomics data websites?

Sorry for I can't figure all these on my own and I really appreciate your replies and help.

mourisl commented 3 years ago

annot.fa=>(read alignment for abundance estimation, decode minor CDR3 sequences)=>cdr3.out=>report.tsv/barcode_report.tsv
During assembly, some of the contigs got merged and their id will be skipped in the output.
From annot.fa to cdr3.out, there is no filtration. Some (Many) assemblies in annot.fa has no CDR3 region, so they will not be output to cdr3.out .
report.tsv merges the entries in cdr3.out with the same V, J, C genes and CDR3 sequences. Only one contig id will show up here, and it is the most abundant one for that entry.
barcode_report.tsv is also based on the cdr3.out file. The barcode is in the contig name, so it is also presented in the first column in the cdr3.out.

As for the other information, these should be done with methods like STAR_SOLO. TRUST4 only focuses on the reads from TCRs and BCRs.

cwxiix commented 3 years ago

Thank you very much for your reply. This is much clear for me and I will do some research on STARsolo.

liulab-dfci / TRUST4

Enquiry about 10x Genomics data on TRUST4 #70