mourisl / T1K

T1K is a versatile methods to genotype highly polymorphic genes (e.g. KIR, HLA) with bulk or single-cell RNA-seq, WGS or WES data.
MIT License
52 stars 8 forks source link

multiple alleles and rna expression quantification #21

Closed lucy-tian closed 1 year ago

lucy-tian commented 1 year ago

Hi Li,

Thank you for your prompt responses regarding my previous inquiries! Hope they did not take too much of your time.

I've been using T1K for genotyping using WGS data, as well as expression quantification using RNA-seq of the same set of individuals, and I have some related questions that I would like to double-check for your opinion.

  1. Multiple alleles for allele_1 or allele_2 columns: in the case where T1K is outputting a set of alleles instead of single allele for the allele columns, is it suggesting that the abundance estimated is the same for all the alleles in the set such that T1K is not able to differentiate the true allele?

  2. For my genotype and RNA-seq expression results, there are cases where given a heterozygous situation (let's say allele_1 is DRB1*03:01 and allele_2 is DRB1*04:01) from WGS genotyping, expression quantification using RNA-seq of the same individual only output DRB1*03:01 for allele_1 and NO detection of allele_2. Can I interpret this situation as DRB1*04:01 is not expressed at the time of RNA-sequencing?

  3. When quantifying gene expression using RNA-seq, is the total abundance of the gene the sum of abundance_1 and abundance_2? Under homozygous cases, should I double the abundance for abundance_1?

  4. There is generally much more ambiguity in the allele_[1/2] columns when using RNA-seq as input and trying to estimate abundance at the 4 digits level. An example would be genotype HLA-C*03:03:01 from WGS had HLA-C*03:03,HLA-C*03:126,HLA-C*03:357,HLA-C*03:370,HLA-C*03:372,HLA-C*03:418,HLA-C*03:421N,HLA-C*03:422,HLA-C*03:427,HLA-C*03:460,HLA-C*03:471,HLA-C*03:481,HLA-C*03:495,HLA-C*03:524,HLA-C*03:528,HLA-C*03:616,HLA-C*03:622 when using RNA-seq. How would you recommend I approach situations like this? Since I know the true genotype is HLA-C*03:03:01, I can just interpret the ambiguity in RNA-seq as a consequence of input data quality, and the abundance level should be attributed to expression of HLA-C*03:03:01?

Sorry for posting this many questions, but I would really appreciate your feedback which comes from a more professional perspective regarding the nature of the tool. I would also be willing to explain the situations in more detail if you think a zoom meeting is more efficient.

Again, thank you for all the patience!

mourisl commented 1 year ago
  1. This is because the same set of reads is compatible with these alleles. Therefore, these alleles could not be distinguished from the data.
  2. Yes, it's possible that the DRB1*04:01 is not expressed or has relatively low expression. T1K filtered the allele if its abundance is less than 15% (--frac) of the dominant allele.
  3. Yes, abundance_1+abundance_2 will be the gene's abundance. For homozygous case, the abundance_1 is usually about twice as large as alleles from the heterozygous case.
  4. What was your running command of T1K?
lucy-tian commented 1 year ago

So for question3, i do not need to multiply abundance_1 by 2 in homozygous case, right?

For question 4 - run-t1k -b {input_bam} -f _rna_seq.fa -c _rna_coord.fa --abnormalUnmapFlag -t 24 --preset hla --alleleDigitUnits 2 --alleleDelimiter :

mourisl commented 1 year ago
  1. Right, you don't need to multiply that by 2.
  2. Could you please check whether the bam file contains unmapped reads? And could you please also use "samtools view -H" to show me the chrome ids in your data? Thank you.
lucy-tian commented 1 year ago

Hi,

  1. Yes, the bam files contain unmapped reads.
  2. output from header: @HD VN:1.4 SO:coordinate @SQ SN:chr1 LN:248956422 @SQ SN:chr2 LN:242193529 @SQ SN:chr3 LN:198295559 @SQ SN:chr4 LN:190214555 @SQ SN:chr5 LN:181538259 @SQ SN:chr6 LN:170805979 @SQ SN:chr7 LN:159345973 @SQ SN:chr8 LN:145138636 @SQ SN:chr9 LN:138394717 @SQ SN:chr10 LN:133797422 @SQ SN:chr11 LN:135086622 @SQ SN:chr12 LN:133275309 @SQ SN:chr13 LN:114364328 @SQ SN:chr14 LN:107043718 @SQ SN:chr15 LN:101991189 @SQ SN:chr16 LN:90338345 @SQ SN:chr17 LN:83257441 @SQ SN:chr18 LN:80373285 @SQ SN:chr19 LN:58617616 @SQ SN:chr20 LN:64444167 @SQ SN:chr21 LN:46709983 @SQ SN:chr22 LN:50818468 @SQ SN:chrX LN:156040895 @SQ SN:chrY LN:57227415 @SQ SN:chrM LN:16569 @SQ SN:GL000008.2 LN:209709 @SQ SN:GL000009.2 LN:201709 @SQ SN:GL000194.1 LN:191469 @SQ SN:GL000195.1 LN:182896 @SQ SN:GL000205.2 LN:185591 @SQ SN:GL000208.1 LN:92689 @SQ SN:GL000213.1 LN:164239 @SQ SN:GL000214.1 LN:137718 @SQ SN:GL000216.2 LN:176608 @SQ SN:GL000218.1 LN:161147 @SQ SN:GL000219.1 LN:179198 @SQ SN:GL000220.1 LN:161802 @SQ SN:GL000221.1 LN:155397 @SQ SN:GL000224.1 LN:179693 @SQ SN:GL000225.1 LN:211173 @SQ SN:GL000226.1 LN:15008 @SQ SN:KI270302.1 LN:2274 @SQ SN:KI270303.1 LN:1942 @SQ SN:KI270304.1 LN:2165 @SQ SN:KI270305.1 LN:1472 @SQ SN:KI270310.1 LN:1201 @SQ SN:KI270311.1 LN:12399 @SQ SN:KI270312.1 LN:998 @SQ SN:KI270315.1 LN:2276 @SQ SN:KI270316.1 LN:1444 @SQ SN:KI270317.1 LN:37690 @SQ SN:KI270320.1 LN:4416 @SQ SN:KI270322.1 LN:21476 @SQ SN:KI270329.1 LN:1040 @SQ SN:KI270330.1 LN:1652 @SQ SN:KI270333.1 LN:2699 @SQ SN:KI270334.1 LN:1368 @SQ SN:KI270335.1 LN:1048 @SQ SN:KI270336.1 LN:1026 @SQ SN:KI270337.1 LN:1121 @SQ SN:KI270338.1 LN:1428 @SQ SN:KI270340.1 LN:1428 @SQ SN:KI270362.1 LN:3530 @SQ SN:KI270363.1 LN:1803 @SQ SN:KI270364.1 LN:2855 @SQ SN:KI270366.1 LN:8320 @SQ SN:KI270371.1 LN:2805 @SQ SN:KI270372.1 LN:1650 @SQ SN:KI270373.1 LN:1451 @SQ SN:KI270374.1 LN:2656 @SQ SN:KI270375.1 LN:2378 @SQ SN:KI270376.1 LN:1136 @SQ SN:KI270378.1 LN:1048 @SQ SN:KI270379.1 LN:1045 @SQ SN:KI270381.1 LN:1930 @SQ SN:KI270382.1 LN:4215 @SQ SN:KI270383.1 LN:1750 @SQ SN:KI270384.1 LN:1658 @SQ SN:KI270385.1 LN:990 @SQ SN:KI270386.1 LN:1788 @SQ SN:KI270387.1 LN:1537 @SQ SN:KI270388.1 LN:1216 @SQ SN:KI270389.1 LN:1298 @SQ SN:KI270390.1 LN:2387 @SQ SN:KI270391.1 LN:1484 @SQ SN:KI270392.1 LN:971 @SQ SN:KI270393.1 LN:1308 @SQ SN:KI270394.1 LN:970 @SQ SN:KI270395.1 LN:1143 @SQ SN:KI270396.1 LN:1880 @SQ SN:KI270411.1 LN:2646 @SQ SN:KI270412.1 LN:1179 @SQ SN:KI270414.1 LN:2489 @SQ SN:KI270417.1 LN:2043 @SQ SN:KI270418.1 LN:2145 @SQ SN:KI270419.1 LN:1029 @SQ SN:KI270420.1 LN:2321 @SQ SN:KI270422.1 LN:1445 @SQ SN:KI270423.1 LN:981 @SQ SN:KI270424.1 LN:2140 @SQ SN:KI270425.1 LN:1884 @SQ SN:KI270429.1 LN:1361 @SQ SN:KI270435.1 LN:92983 @SQ SN:KI270438.1 LN:112505 @SQ SN:KI270442.1 LN:392061 @SQ SN:KI270448.1 LN:7992 @SQ SN:KI270465.1 LN:1774 @SQ SN:KI270466.1 LN:1233 @SQ SN:KI270467.1 LN:3920 @SQ SN:KI270468.1 LN:4055 @SQ SN:KI270507.1 LN:5353 @SQ SN:KI270508.1 LN:1951 @SQ SN:KI270509.1 LN:2318 @SQ SN:KI270510.1 LN:2415 @SQ SN:KI270511.1 LN:8127 @SQ SN:KI270512.1 LN:22689 @SQ SN:KI270515.1 LN:6361 @SQ SN:KI270516.1 LN:1300 @SQ SN:KI270517.1 LN:3253 @SQ SN:KI270518.1 LN:2186 @SQ SN:KI270519.1 LN:138126 @SQ SN:KI270521.1 LN:7642 @SQ SN:KI270522.1 LN:5674 @SQ SN:KI270528.1 LN:2983 @SQ SN:KI270529.1 LN:1899 @SQ SN:KI270530.1 LN:2168 @SQ SN:KI270538.1 LN:91309 @SQ SN:KI270539.1 LN:993 @SQ SN:KI270544.1 LN:1202 @SQ SN:KI270548.1 LN:1599 @SQ SN:KI270579.1 LN:31033 @SQ SN:KI270580.1 LN:1553 @SQ SN:KI270581.1 LN:7046 @SQ SN:KI270582.1 LN:6504 @SQ SN:KI270583.1 LN:1400 @SQ SN:KI270584.1 LN:4513 @SQ SN:KI270587.1 LN:2969 @SQ SN:KI270588.1 LN:6158 @SQ SN:KI270589.1 LN:44474 @SQ SN:KI270590.1 LN:4685 @SQ SN:KI270591.1 LN:5796 @SQ SN:KI270593.1 LN:3041 @SQ SN:KI270706.1 LN:175055 @SQ SN:KI270707.1 LN:32032 @SQ SN:KI270708.1 LN:127682 @SQ SN:KI270709.1 LN:66860 @SQ SN:KI270710.1 LN:40176 @SQ SN:KI270711.1 LN:42210 @SQ SN:KI270712.1 LN:176043 @SQ SN:KI270713.1 LN:40745 @SQ SN:KI270714.1 LN:41717 @SQ SN:KI270715.1 LN:161471 @SQ SN:KI270716.1 LN:153799 @SQ SN:KI270717.1 LN:40062 @SQ SN:KI270718.1 LN:38054 @SQ SN:KI270719.1 LN:176845 @SQ SN:KI270720.1 LN:39050 @SQ SN:KI270721.1 LN:100316 @SQ SN:KI270722.1 LN:194050 @SQ SN:KI270723.1 LN:38115 @SQ SN:KI270724.1 LN:39555 @SQ SN:KI270725.1 LN:172810 @SQ SN:KI270726.1 LN:43739 @SQ SN:KI270727.1 LN:448248 @SQ SN:KI270728.1 LN:1872759 @SQ SN:KI270729.1 LN:280839 @SQ SN:KI270730.1 LN:112551 @SQ SN:KI270731.1 LN:150754 @SQ SN:KI270732.1 LN:41543 @SQ SN:KI270733.1 LN:179772 @SQ SN:KI270734.1 LN:165050 @SQ SN:KI270735.1 LN:42811 @SQ SN:KI270736.1 LN:181920 @SQ SN:KI270737.1 LN:103838 @SQ SN:KI270738.1 LN:99375 @SQ SN:KI270739.1 LN:73985 @SQ SN:KI270740.1 LN:37240 @SQ SN:KI270741.1 LN:157432 @SQ SN:KI270742.1 LN:186739 @SQ SN:KI270743.1 LN:210658 @SQ SN:KI270744.1 LN:168472 @SQ SN:KI270745.1 LN:41891 @SQ SN:KI270746.1 LN:66486 @SQ SN:KI270747.1 LN:198735 @SQ SN:KI270748.1 LN:93321 @SQ SN:KI270749.1 LN:158759 @SQ SN:KI270750.1 LN:148850 @SQ SN:KI270751.1 LN:150742 @SQ SN:KI270752.1 LN:27745 @SQ SN:KI270753.1 LN:62944 @SQ SN:KI270754.1 LN:40191 @SQ SN:KI270755.1 LN:36723 @SQ SN:KI270756.1 LN:79590 @SQ SN:KI270757.1 LN:71251 @PG ID:STAR PN:STAR VN:STAR_2.6.1d CL:STAR --runMode alignReads --runThreadN 16 --genomeDir star_index --readFilesIn /cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L001.R1.fastq.gz,/cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L002.R1.fastq.gz,/cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L003.R1.fastq.gz,/cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L004.R1.fastq.gz /cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L001.R2.fastq.gz,/cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L002.R2.fastq.gz,/cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L003.R2.fastq.gz,/cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L004.R2.fastq.gz --readFilesCommand zcat --outFileNamePrefix output_dir/BF-1001-SVM0_5T1. --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outSAMunmapped Within --outSAMattrRGline ID:HFKF7DSXX.NCCGAACA+NGAGGTTC.1 PL:ILLUMINA SM:BF-1001-SVM0_5T1 , ID:HFKF7DSXX.NCCGAACA+NGAGGTTC.2 PL:ILLUMINA SM:BF-1001-SVM0_5T1 , ID:HFKF7DSXX.NCCGAACA+NGAGGTTC.3 PL:ILLUMINA SM:BF-1001-SVM0_5T1 , ID:HFKF7DSXX.NCCGAACA+NGAGGTTC.4 PL:ILLUMINA SM:BF-1001-SVM0_5T1 --outFilterType BySJout --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --chimSegmentMin 15 --chimJunctionOverhangMin 15 --chimOutType Junctions --twopassMode Basic @PG ID:samtools PN:samtools PP:STAR VN:1.17 CL:samtools view -H BF-1001-SVM0_5T1.star.bam @RG ID:HFKF7DSXX.NCCGAACA+NGAGGTTC.1 PL:ILLUMINA SM:BF-1001-SVM0_5T1 @RG ID:HFKF7DSXX.NCCGAACA+NGAGGTTC.2 PL:ILLUMINA SM:BF-1001-SVM0_5T1 @RG ID:HFKF7DSXX.NCCGAACA+NGAGGTTC.3 PL:ILLUMINA SM:BF-1001-SVM0_5T1 @RG ID:HFKF7DSXX.NCCGAACA+NGAGGTTC.4 PL:ILLUMINA SM:BF-1001-SVM0_5T1 @CO user command line: STAR --genomeDir star_index --runMode alignReads --twopassMode Basic --outFileNamePrefix output_dir/BF-1001-SVM0_5T1. --readFilesCommand zcat --readFilesIn /cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L001.R1.fastq.gz,/cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L002.R1.fastq.gz,/cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L003.R1.fastq.gz,/cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L004.R1.fastq.gz /cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L001.R2.fastq.gz,/cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L002.R2.fastq.gz,/cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L003.R2.fastq.gz,/cromwell_root/amp-pd-data-rna-stage/transcriptomes/fastq/unblindedPhase3/BF-1001-SVM0_5T1.V02.L004.R2.fastq.gz --outSAMtype BAM SortedByCoordinate --outFilterType BySJout --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --chimOutType Junctions --chimSegmentMin 15 --chimJunctionOverhangMin 15 --runThreadN 16 --outSAMstrandField intronMotif --outSAMunmapped Within --outSAMattrRGline ID:HFKF7DSXX.NCCGAACA+NGAGGTTC.1 PL:ILLUMINA SM:BF-1001-SVM0_5T1 , ID:HFKF7DSXX.NCCGAACA+NGAGGTTC.2 PL:ILLUMINA SM:BF-1001-SVM0_5T1 , ID:HFKF7DSXX.NCCGAACA+NGAGGTTC.3 PL:ILLUMINA SM:BF-1001-SVM0_5T1 , ID:HFKF7DSXX.NCCGAACA+NGAGGTTC.4 PL:ILLUMINA SM:BF-1001-SVM0_5T1
mourisl commented 1 year ago

The files looks fine to me. Is your data single-cell or bulk? For the ambiguous alleles, what is the estimated abundance and quality score? If everything looks normal, you may take the first entry in the list as the allele.

lucy-tian commented 1 year ago

The data is bulk.

For the ambiguous alleles, the situation happens quite commonly and does not seem to have an association with the quality score (cases expand across scores ranging from 1 to 45). I will need to examine the full trend once I expand the pipeline to all my samples. What trend of quality score and abundance would you deem as "normal"?

Also, since I already know the "true genotype" from my WGS genotyping run, (for example, HLA-C*03:03:01), in cases of ambiguity, I can simply assign abundance based on my genotypes, disregarding the information within allele1 and allele2 from the RNA-seq run, unless these ambiguous alleles fail to represent the true genotype? The WGS genotype quality scores show a normal distribution with a mean around 30.

mourisl commented 1 year ago

HLA-C usually has high expression, so the quality score and the abundance should be high in general. If you have the raw fastq file, could you please try T1K on one or two samples to see whether the results are consistent with results from using BAM file? This is just to make sure the BAM file contains the right read information.

lucy-tian commented 1 year ago

Unfortunately, I only have the BAM file.

mourisl commented 1 year ago

Could you please show me the output from T1K where you have many ambiguous alleles in HLA-C?

lucy-tian commented 1 year ago

T1K_sample1_genotype.txt Please see attached the T1K output for one of my sample. You can see many ambiguous alleles for each of the genes.

mourisl commented 1 year ago

I think the abundances seems too few for RNA-seq data. Could you please show me the first a few alignments of your bam file?

lucy-tian commented 1 year ago

alignment_5.txt Attached is the first 5 alignments for chr6. Thank you!

lucy-tian commented 1 year ago

Actually, I found the problem. There was a mistake with inputting the wrong coordinate file. I'm attaching the new result, and it seems all good! T1K-sample1-genotype.txt

mourisl commented 1 year ago

Yes, this looks great!