For the same sample, HLA-C was mistyped at 2M reads but correct at 1M and 3M reads

liu930724 commented 1 week ago

Hi, while analyzing another HLA reference sample(HLA-C*05:01:01), we observed tha the HLA-C typing results were incorrect at 2M reads (HLA-C*08:02:01). But when we switch to other data sizes of 1-3M, even if we only change it to 2.1M, all of the results are correct.

T1K v1.1.7-r225 was used and different numbers of reads were obtained through the --reads_to_process parameter of fastp.

running command:

run-t1k -1 21_1.trimmed.fq -2 21_2.trimmed.fq -t 30 --preset hla -f T1K_ref_dna_seq.fa --cov 30 -o HLA-1101-FA01_21

log of 2M reads:

[Wed Nov 13 13:36:52 2024] run-t1k v1.1.7-r225 begins.
[Wed Nov 13 13:36:52 2024] SYSTEM CALL: /r5/u/tianliu/2.pipeline/T1K-master/fastq-extractor -t 30 -f /mnt/data65/tianliu2/project/1.HLA/test/t1k/hlaidx/T1K_ref_dna_seq.fa -o /mnt/data65/tianliu2/project/1.HLA/t1k_241104/5.t1k/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21_candidate  -1 /mnt/data65/tianliu2/project/1.HLA/t1k_241104/1.trimmed/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21_1.trimmed.fq -2 /mnt/data65/tianliu2/project/1.HLA/t1k_241104/1.trimmed/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21_2.trimmed.fq
[Wed Nov 13 13:36:54 2024] Start to extract candidate reads from read files.
[Wed Nov 13 13:38:51 2024] Finish extracting reads.
[Wed Nov 13 13:38:51 2024] SYSTEM CALL: /r5/u/tianliu/2.pipeline/T1K-master/genotyper  --cov 30 -s 0.97 -o /mnt/data65/tianliu2/project/1.HLA/t1k_241104/5.t1k/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21 -t 30 -f /mnt/data65/tianliu2/project/1.HLA/test/t1k/hlaidx/T1K_ref_dna_seq.fa -1 /mnt/data65/tianliu2/project/1.HLA/t1k_241104/5.t1k/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21_candidate_1.fq -2 /mnt/data65/tianliu2/project/1.HLA/t1k_241104/5.t1k/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21_candidate_2.fq
[Wed Nov 13 13:38:55 2024] Found 858361 read fragments. Start read assignment.
[Wed Nov 13 14:05:46 2024] Finish read end assignments.
[Wed Nov 13 14:06:21 2024] Finish read fragment assignments. 393416 read fragments can be assigned (average 563.99 alleles/read).
[Wed Nov 13 14:07:04 2024] Finish allele quantification in 102 EM iterations.
[Wed Nov 13 14:08:20 2024] Genotyping finishes.
[Wed Nov 13 14:08:23 2024] SYSTEM CALL: /r5/u/tianliu/2.pipeline/T1K-master/analyzer  -s 0.97 -o /mnt/data65/tianliu2/project/1.HLA/t1k_241104/5.t1k/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21 -t 30 -f /mnt/data65/tianliu2/project/1.HLA/test/t1k/hlaidx/T1K_ref_dna_seq.fa -a /mnt/data65/tianliu2/project/1.HLA/t1k_241104/5.t1k/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21_allele.tsv -1 /mnt/data65/tianliu2/project/1.HLA/t1k_241104/5.t1k/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21_aligned_1.fa -2 /mnt/data65/tianliu2/project/1.HLA/t1k_241104/5.t1k/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21/20241101170317_RA01230401003_5P240928009US292653DX_PE150-HLA-1101-FA01_21_aligned_2.fa
[Wed Nov 13 14:08:23 2024] Found 462282 read fragments. Start read assignment.
[Wed Nov 13 14:08:25 2024] Finish read end assignments.
[Wed Nov 13 14:08:25 2024] Finish read fragment assignments. 415149 read fragments can be assigned (average 1.62 alleles/read).
[Wed Nov 13 14:08:25 2024] Finish allele quantification in 4 EM iterations.
[Wed Nov 13 14:08:30 2024] Post analysis finishes.
[Wed Nov 13 14:08:31 2024] Finish.

It seems different from the last issue about HLA-C typing error, so I open a new issue. Thank you in advance for your time and help.

mourisl commented 1 week ago

Seems the abundances for the true alleles and the wrong alleles are not very high, so there could be some tricky issues. As before, could you please share the candidate reads, and I can look into it. Thank you!

Meanwhile, the current github v1.0.7-r225 also fixes an issue in the t1k-build regarding the exonization in an HLA-C allele. Could you please recreate the T1K's reference using t1k-build from the hla.dat file, and it may fix this issue.

liu930724 commented 1 week ago

After recreating the T1K reference, the result for HLA-C with 2M reads is correct now. It's possible that this is the issue, we will test with more samples in the future. Thank you!

mourisl commented 1 week ago

Is the abundance estimation comparable between C01:02 and C05:01 in the new run? If they still differ a lot, I think there are still some hidden issues.

liu930724 commented 1 week ago

The abundances of C01:02 and C05:01 are different but consistent with the trend of different reads amounts. And the abundance of C05:01 is much higher than C08:02.

We amplified the full length of HLA genes, which may be the reason for the large difference in abundance of different alleles. With this in mind, this difference in abundance is acceptable. Thank you.

mourisl / T1K

For the same sample, HLA-C was mistyped at 2M reads but correct at 1M and 3M reads #40