Closed lucy-tian closed 1 year ago
So for question3, i do not need to multiply abundance_1
by 2 in homozygous case, right?
For question 4 -
run-t1k -b {input_bam} -f _rna_seq.fa -c _rna_coord.fa --abnormalUnmapFlag -t 24 --preset hla --alleleDigitUnits 2 --alleleDelimiter :
Hi,
The files looks fine to me. Is your data single-cell or bulk? For the ambiguous alleles, what is the estimated abundance and quality score? If everything looks normal, you may take the first entry in the list as the allele.
The data is bulk.
For the ambiguous alleles, the situation happens quite commonly and does not seem to have an association with the quality score (cases expand across scores ranging from 1 to 45). I will need to examine the full trend once I expand the pipeline to all my samples. What trend of quality score and abundance would you deem as "normal"?
Also, since I already know the "true genotype" from my WGS genotyping run, (for example, HLA-C*03:03:01
), in cases of ambiguity, I can simply assign abundance based on my genotypes, disregarding the information within allele1
and allele2
from the RNA-seq run, unless these ambiguous alleles fail to represent the true genotype? The WGS genotype quality scores show a normal distribution with a mean around 30.
HLA-C usually has high expression, so the quality score and the abundance should be high in general. If you have the raw fastq file, could you please try T1K on one or two samples to see whether the results are consistent with results from using BAM file? This is just to make sure the BAM file contains the right read information.
Unfortunately, I only have the BAM file.
Could you please show me the output from T1K where you have many ambiguous alleles in HLA-C?
T1K_sample1_genotype.txt Please see attached the T1K output for one of my sample. You can see many ambiguous alleles for each of the genes.
I think the abundances seems too few for RNA-seq data. Could you please show me the first a few alignments of your bam file?
alignment_5.txt Attached is the first 5 alignments for chr6. Thank you!
Actually, I found the problem. There was a mistake with inputting the wrong coordinate file. I'm attaching the new result, and it seems all good! T1K-sample1-genotype.txt
Yes, this looks great!
Hi Li,
Thank you for your prompt responses regarding my previous inquiries! Hope they did not take too much of your time.
I've been using T1K for genotyping using WGS data, as well as expression quantification using RNA-seq of the same set of individuals, and I have some related questions that I would like to double-check for your opinion.
Multiple alleles for
allele_1
orallele_2
columns: in the case where T1K is outputting a set of alleles instead of single allele for the allele columns, is it suggesting that the abundance estimated is the same for all the alleles in the set such that T1K is not able to differentiate the true allele?For my genotype and RNA-seq expression results, there are cases where given a heterozygous situation (let's say
allele_1
isDRB1*03:01
andallele_2
isDRB1*04:01
) from WGS genotyping, expression quantification using RNA-seq of the same individual only outputDRB1*03:01
forallele_1
and NO detection ofallele_2
. Can I interpret this situation asDRB1*04:01
is not expressed at the time of RNA-sequencing?When quantifying gene expression using RNA-seq, is the total abundance of the gene the sum of
abundance_1
andabundance_2
? Under homozygous cases, should I double the abundance forabundance_1
?There is generally much more ambiguity in the
allele_[1/2]
columns when using RNA-seq as input and trying to estimate abundance at the 4 digits level. An example would be genotypeHLA-C*03:03:01
from WGS hadHLA-C*03:03,HLA-C*03:126,HLA-C*03:357,HLA-C*03:370,HLA-C*03:372,HLA-C*03:418,HLA-C*03:421N,HLA-C*03:422,HLA-C*03:427,HLA-C*03:460,HLA-C*03:471,HLA-C*03:481,HLA-C*03:495,HLA-C*03:524,HLA-C*03:528,HLA-C*03:616,HLA-C*03:622
when using RNA-seq. How would you recommend I approach situations like this? Since I know the true genotype isHLA-C*03:03:01
, I can just interpret the ambiguity in RNA-seq as a consequence of input data quality, and the abundance level should be attributed to expression ofHLA-C*03:03:01
?Sorry for posting this many questions, but I would really appreciate your feedback which comes from a more professional perspective regarding the nature of the tool. I would also be willing to explain the situations in more detail if you think a zoom meeting is more efficient.
Again, thank you for all the patience!