mourisl / T1K

T1K is a versatile methods to genotype highly polymorphic genes (e.g. KIR, HLA) with bulk or single-cell RNA-seq, WGS or WES data.
MIT License
42 stars 7 forks source link

Ratio is the log-ratio of the likelihood between the most likely copy number and the second likely copy number. I'm still trying to optimize t1k-copynumber.py, so please interpret its result with caution. #12

Open fernandogs97BR opened 1 year ago

fernandogs97BR commented 1 year ago
          Ratio is the log-ratio of the likelihood between the most likely copy number and the second likely copy number. I'm still trying to optimize t1k-copynumber.py, so please interpret its result with caution.

We don't remove the duplicated reads. The duplicated reads will contribute to the allele abundance estimation (or other type of allele score in other HLA genotypers), therefore it is expected that the deduplication will affect the genotyping results. Hope this helps.

Originally posted by @mourisl in https://github.com/mourisl/T1K/issues/11#issuecomment-1552364903

fernandogs97BR commented 1 year ago

Could you further explain how the calculation of the ratio in t1k-copynumber works? I can see that value would be very helpful for me to discriminate between false positives and true positives in KIR2DL5. PD:By the way, congratulations for publishing the pipeline in Genome Biology!

fernandogs97BR commented 1 year ago

By the way, is possible to extract the KIR2DL5 A/B reference reads that match with my sequenced reads?

mourisl commented 1 year ago

Could you further explain how the calculation of the ratio in t1k-copynumber works? I can see that value would be very helpful for me to discriminate between false positives and true positives in KIR2DL5. PD:By the way, congratulations for publishing the pipeline in Genome Biology!

Thank you. For the copynumber script, it applies a square-root transform of the abundance values (FPK), and then fit a normal distribution to model the single-copy allele distributions. Since the normal distribution is additive, we can use the parameter from the single-copy allele to calculate the distribution for two-copy, three-copy,... until ten-copy. We can calculate 10 likelihood values from each copy number distribution for an allele's abundance. The log-likelihood ratio is based on the best likelihood value and the second best likelihood value.

By the way, is possible to extract the KIR2DL5 A/B reference reads that match with my sequenced reads?

Do you mean you want to know which reads are assigned to 2DL5?

fernandogs97BR commented 1 year ago

Thank you very much for resolving the first question! Regarding the second one, yes, I would like to know which reference sequences my reads align to for KIR2DL5, and what these reads are. I believe the issue I'm having with false positives for KIR2DL5 is the generation of nonspecific reads in my sequencing. Therefore, I want to compare the regions of the reference sequences to which reads from truly positive and negative samples for KIR2DL5 align, and be able to modify the reference based on this.

mourisl commented 1 year ago

I just added the option "--outputReadAssignment" to the github repo, which will output the allele assignment to the {prefix}_assign.tsv file. Each row is one assignment, with the format of read_id allele_id allele_start allele_end. Will this help?

fernandogs97BR commented 1 year ago

Thank you very much, I will try this option now! Will keep informed