rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
74 stars 19 forks source link

Output haplotypes #10

Closed aziouez25 closed 5 years ago

aziouez25 commented 5 years ago

Hello Robert,

Is there a way to output the estimated recombination rate positions?

Thanks a

rwdavies commented 5 years ago

Hey,

Recombination rate is estimated between the positions you supply STITCH. This output is not "supported" currently but is available in a grab-bag of internal parameters and results saved in RData/EM.all..RData.

You want "sigmaCurrent" and can line it up using "pos" as follows

> head(pos)
    CHR      POS REF ALT
1 chr19 10000105   T   G
2 chr19 10000345   T   C
3 chr19 10000612   C   T
4 chr19 10000968   G   T
5 chr19 10001946   A   G
6 chr19 10002175   T   C
> head(sigmaCurrent)
[1] 0.9998975 0.9998979 0.9995245 0.9990623 0.9997978 0.9999408

so here, 0.9998975 is between 10000105 and 10000345. These values are stored internally in sigmaCurrent as exp(-nGen rate{cM/Mb} / 100 / 1000000 distance{between SNPs in bp}) therefore if you want an estimate of the rate in cM/Mb between SNPs, then do something like

log(sigmaCurrent) / -nGen 100 1000000 / diff(pos[, 2])

Note that values are capped internally during updating by minRate and maxRate which have default values but you can choose different values for those

Best, Robbie

aziouez25 commented 5 years ago

Great! Thanks for your quick reply. My problem is that I want to tag a multi-allelic indel (5 alleles) in the UKTwins with haplotypes instead of variants because 2 alleles are rare. I took the 2000 closest SNVs to that indel regardless of their MAF and ran STITCH generating 30 haplotypes (--K=30 and default parameters). I removed the SNVs involving with the same frequency in all haplotypes or frequency <0.05 or >0.95 in any haplotypes (last eHapsCurrent_t) keeping 30 haplotypes of 557 SNVs and ran a decision tree for tagging. Obviously it did not work so I wanted to play with recombination sites and other heuristics like --K, MAF, number of SNVs..

I know STITCH is not dedicated to this kind of tasks, but any advise would be more than welcome!

Thanks again a

rwdavies commented 5 years ago

I would keep all SNVs especially if closer together as more SNPs helps assignment of individuals to haplotypes. Can you include higher coverage samples with known alleles to help the imputation and later decision making process?

What’s the coverage? 4X, right? Could you ever run something like Platypus here if the two indels are really close together? Platypus should be able to genotype given a VCF and I wonder if you can write in the 5 alleles as a single entry. GATK HC might also be an option but I know platypus has a good method for this problem. You would need to ensure you get genotype likelihoods out of platypus, I can’t remember if that’s a default or not, I remember it being more designed for higher coverage

On Tue, Sep 18, 2018 at 12:14 PM aziouez25 notifications@github.com wrote:

Great! Thanks for your quick reply. My problem is that I want to tag a multi-allelic indel (5 alleles) in the UKTwins with haplotypes instead of variants because 2 alleles are rare. I took the 2000 closest SNVs to that indel regardless of their MAF and ran STITCH generating 30 haplotypes (--K=30 and default parameters). I removed the SNVs involving with the same frequency in all haplotypes or frequency <0.05 or >0.95 in any haplotypes (last eHapsCurrent_t) keeping 30 haplotypes of 557 SNVs and ran a decision tree for tagging. Obviously it did not work so I wanted to play with recombination sites and other heuristics like --K, MAF, number of SNVs..

I know STITCH is not dedicated to this kind of tasks, but any advise would be more than welcome!

Thanks again a

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/rwdavies/STITCH/issues/10#issuecomment-422336570, or mute the thread https://github.com/notifications/unsubscribe-auth/AH6N-c_P8s0kdZFfXWKFE9KAMBPemmaSks5ucMdmgaJpZM4Wtaoz .

rwdavies commented 5 years ago

Oh and also if you do go the platypus route more nearby variants will also likely help for reasons similar to STITCH

On Wed, Sep 19, 2018 at 10:22 AM Robbie Davies < robertwilliamdavies@gmail.com> wrote:

I would keep all SNVs especially if closer together as more SNPs helps assignment of individuals to haplotypes. Can you include higher coverage samples with known alleles to help the imputation and later decision making process?

What’s the coverage? 4X, right? Could you ever run something like Platypus here if the two indels are really close together? Platypus should be able to genotype given a VCF and I wonder if you can write in the 5 alleles as a single entry. GATK HC might also be an option but I know platypus has a good method for this problem. You would need to ensure you get genotype likelihoods out of platypus, I can’t remember if that’s a default or not, I remember it being more designed for higher coverage

On Tue, Sep 18, 2018 at 12:14 PM aziouez25 notifications@github.com wrote:

Great! Thanks for your quick reply. My problem is that I want to tag a multi-allelic indel (5 alleles) in the UKTwins with haplotypes instead of variants because 2 alleles are rare. I took the 2000 closest SNVs to that indel regardless of their MAF and ran STITCH generating 30 haplotypes (--K=30 and default parameters). I removed the SNVs involving with the same frequency in all haplotypes or frequency <0.05 or >0.95 in any haplotypes (last eHapsCurrent_t) keeping 30 haplotypes of 557 SNVs and ran a decision tree for tagging. Obviously it did not work so I wanted to play with recombination sites and other heuristics like --K, MAF, number of SNVs..

I know STITCH is not dedicated to this kind of tasks, but any advise would be more than welcome!

Thanks again a

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/rwdavies/STITCH/issues/10#issuecomment-422336570, or mute the thread https://github.com/notifications/unsubscribe-auth/AH6N-c_P8s0kdZFfXWKFE9KAMBPemmaSks5ucMdmgaJpZM4Wtaoz .

aziouez25 commented 5 years ago

Hi Roby, Many thanks for your reply. May be my message was not very clear. Sorry for that. The coverage is pretty high: >30X. The detection of the alleles was done by Pacbio on different samples. Calling the alleles on UKtwins was not the problem. I wrote a fasta file with 5 fake chromosomes; one fake chromosome per allele. I run bwa and called with GATK HC. HWE was respected then, but I will rerun Platypus to check again. Concerning STITCH, I want to use it for Haplotype construction only. I am not interested in the imputation. My idea was to tag the 5 alleles with haplotypes instead of SNPs for future use on SNP arrays. So, I first extracted the 2,000 closest SNPs to my indel but that was too much. I then extracted the 50 closest SNPs and ran STITCH with --K=50. The problem is that I have many "duplicated" haplotypes and also haplotypes with "unknown" genotypes (eHapsCurrent_t[i,]>0.1 & e_hapsCurrent_t[i,] <0.9).

My question is how to exclude these "unknown" genotypes from the last e_hapsCurrent_t?

Thanks A