Closed cwxiix closed 3 years ago
Thank you for your reply,
I am still confused about the cell numbers. From the website used for downloading data, in V(D)J Enriched Libraries, the key cell metric for TCR Libraries, cell detected is 6648 and that for Ig libraries is 1309. However, the number of lines in TCR library barcode_report.tsv file is 38045 which is far more than website info. Did I make anything wrong? Do you think you can help me figure out a reason for this difference or they should not be same or similar?
Thanks again.
When we do the comparison, we only considered the cells(barcodes) that passed Seurat scRNA-seq QC, which filtered cells based on the number of reads in the cell, number of genes in the cells, and etc. I think this is the common way for the analysis: people cluster/annotate the cell clusters based on Seurat (or other tools) results, and overlay the TCR/BCR information on that.
Thank you for your reply. Then I will work on Seurat first. I might still want to leave this issue open until I work everything out. Hope you won't mind this or I can close it and open a new one later when I have another new question.
Since you mentioned the difference between barcode_report.tsv file and report.tsv file with more CDR3 counts in the later file. Then does this mean I can not link barcode information with CDR3nt in both table? If I can connect two tables, then will CDR3nt or consensus id be the key? Also, for the meaning of abundance, may I consider it means the number of mapped reads in 10x Genomics data?
You need to usse CDR3nt as the key two connect these two files. The extra CDR3s in report.tsv file are usually shown in the last two columns in barcode_report file as the secondary/noise/non-functional CDR3s. For the abundance, do you mean the abundance in the barcode_report.tsv file or report.tsv file? The abundance in barcode_report.tsv for each chain is the number of mapped reads or UMIs on the CDR3. The abundance in report.tsv file is the number of cells contains the cdr3.
I get it! Thank you again.
Also after I sum up the frequency, the sum is 6. I know from README, for frequency, the BCR and TCR chains are normalized respectively. I was thinking I should get frequency sum to be 1. Is this suitable for this normalized case? Also, what is the way of normalization? just subtract mean and divided by standard deviation?
Could you please explain more about "similarity" and "CDR3 scores"? I knew you mentioned that the score 1 means CDR3 with imputed nucleotides and other numbers are the motif signal. How can I understand more about motif signal and meaning of imputed nucleotides?
"Similarity" is the alignment similarity between CDR3 and germline V, D, J sequences that overlaps with the CDR3 region. The amino acid motif of CDR3 is something like "YYC" on 5' and "F/WGxG" on 3' end. So the fraction of the 6 amino acids will be the score. Imputed sequence means a partial assembly may overlap with V or J gene, then we can use the germline sequence to fill the remaining portion for TCRs. Note that the score is divided 100 in the report.tsv file, so score 0.01 is for imputed CDR3.
I see, That is a lot clear to me. Do you think you can give me some hint to one of my earlier post?
Also after I sum up the frequency, the sum is 6. I know from README, for frequency, the BCR and TCR chains are normalized respectively. I was thinking I should get frequency sum to be 1. Is this suitable for this normalized case? Also, what is the way of normalization? just subtract mean and divided by standard deviation?
Thank you again
The frequencies are the fraction of CDR3s in IGH, IGK+IGL, TRB, TRA, TRG, TRD respectively. So it sums up to 6.
I get it. It comletely makes sense to me now. Thank you again.
Hello,
May I know details about your analysis procedure? Please tell me if I think it in a right way or not. I also have questions about it:
The reason I ask this is because I didn't get exact same number of T-cells and B-cells so that I thought I did something wrong. I don't completely understand paper yet. After I run bam file data in TRUST4, I read.csv file _barcode_report.tsv file into R and the total number of lines is 8987 and if I table() the second column which is the cell type I got this `> table(tablebam_barcode_report[,2])
abT B gdT 5547 3331 109 ` The cell numbers are far more than what you mentioned in the paper,"TRUST4 made 5,091 T- and 1,318 B-cell calls (Fig. 2a and Supplementary Fig. 5a)." Thus ,these are the reason I ask all questions above.
I know there is an evaluation script of how you did the evaluation. I barely know python so do you think I should try to figure out your python code first so that I can make more sense? Could you please give me advice? I really appreciate your help.
Also, the barcodes in bam files outputs are barcode with "-1" at the end in each barcodes but those are missing in fastq file output barcode_report.tsv. Is this normal? or did I miss anything?
head(tablebam_barcode_report) X.barcode cell_type 1 AGAGTGGTCTATCCTA-1 abT 2 TCGAGGCCAAGTAGTA-1 abT 3 CCTTTCTGTTGGTTTG-1 abT 4 AAGTCTGCAATCTACG-1 abT 5 TCTATTGAGGACGAAA-1 abT 6 GCGCGATAGTTGCAGG-1 abT
head(tablegex_barcode_report) X.barcode cell_type 1 AGCTTGATCTAACCGA abT 2 AAACGGGCAATAACGA abT 3 CCTACACGTGCAATTT B 4 ACCAGTATCAAACCAC abT 5 GGATGTTTCGTTTAGG abT 6 CGAGCCACAGGTCTCG abT
In the paper, the main evaluation is based on the evaluation with BAM file. In supplementary, we also evaluated the results with FASTQ input with barcode correction based on 737K-august-2016.txt file. The results were highly similar.
For Seurat, we used the default options to conduct clustering from the cellranger output, and I think it was raw_count matrix. We then extract the barcode in the final Seurat object to filter TRUST4's output. So there is no QC for TRUST4 itself. TRUST4 ran on the raw data, and we just considered valid barcodes from Seurat later.
Answered above. The barcode correction improved the accuracy a little bit. In the paper, the main analysis was based on the cellranger bam, which contains the field for corrected barcode.
I counted those numbers from the barcode_report.tsv file, and only counted the barcodes in Seurat result.
The "-1" is added to the barcode from cellranger for something like cell-group (?). The information is not in the raw fastq file, so the barcode is just the barcode.
I get it. Thank you so much!
Hello,
Here still something I want to ask and make sure.
Thank you first, then if the read_fragment has decimals. Then is this normal?
When TRUST4 tries to decode the CDR3 encoded in the consensus, a read partially overlapped with CDR3 region could be compatible with more than 1 CDR3 type. Therefore, TRUST4 applied EM algorithm to quantify the CDR3s from the same consensus and will generate decimal abundance.
Thanks a lot.
I see. I was thinking there are 7 but since one of them can be anyone so it actually doesn't count. That make sense for me now. Thank you very much.
Another enquiry is:
Is there a clear connection among _annot.fa, _cdr3.out, _report.tsv, and barcode_report.tsv files both in scRNA-seq and bulk RNA-seq data?
Besides above confusions, May I get any advice on if I want to know:
Sorry for I can't figure all these on my own and I really appreciate your replies and help.
annot.fa=>(read alignment for abundance estimation, decode minor CDR3 sequences)=>cdr3.out=>report.tsv/barcode_report.tsv
During assembly, some of the contigs got merged and their id will be skipped in the output.
From annot.fa to cdr3.out, there is no filtration. Some (Many) assemblies in annot.fa has no CDR3 region, so they will not be output to cdr3.out .
report.tsv merges the entries in cdr3.out with the same V, J, C genes and CDR3 sequences. Only one contig id will show up here, and it is the most abundant one for that entry.
barcode_report.tsv is also based on the cdr3.out file. The barcode is in the contig name, so it is also presented in the first column in the cdr3.out.
As for the other information, these should be done with methods like STAR_SOLO. TRUST4 only focuses on the reads from TCRs and BCRs.
Thank you very much for your reply. This is much clear for me and I will do some research on STARsolo.
Hello,
I have questions about the software outputs especially on 10x Genomics data.
I appreciate your help.
Sincerely, Chenxi