Open swvanderlaan opened 7 years ago
Agreed 100% -- this is why I put the data up on GitHub for when I had time and someone else had some interest in completing this project :-). Looks like both have arrived.
--t
On Thu, Jul 27, 2017 at 3:35 AM, Sander W. van der Laan < notifications@github.com> wrote:
HI,
By lack of another spot to put this request, here goes.
I had gotten the information of these SNPs from ENSEMBL. Like so:
cat("\n* Loading ENSEMBL information on the 65 hm450K variants...")
This method is based on the links below.
Ref: https://bioconductor.org/packages/devel/bioc/vignettes/biomaRt/inst/doc/biomaRt.html
Ref: https://support.bioconductor.org/p/19727/
Ref: https://nsaunders.wordpress.com/2015/04/28/some-basics-of-biomart/
Ref: https://davetang.org/muse/2012/04/27/learning-to-use-biomart/
library("biomaRt") snp.grch37 = useMart(biomart="ENSEMBL_MART_SNP", dataset="hsapiens_snp", host="grch37.ensembl.org", path="/biomart/martservice") filters = listFilters(snp.grch37) attributes = listAttributes(snp.grch37)
Note: you would need an object with some 'called' genotypes of the methylation chip - like in your example.
hm450k.snps.list <- rownames(methAlleles)
hm450k.snps.ensembl89.b37 = getBM(attributes = c("refsnp_id", "chr_name", "chrom_start", "chrom_end", "allele", "allele_1", "minor_allele", "minor_allele_freq", "minor_allele_count"), filters=c("snp_filter"), values = hm450k.snps.list, mart = snp.grch37)
Then the list of variants look like this:
show(hm450k.snps.ensembl89.b37) refsnp_id chr_name chrom_start chrom_end allele allele_1 minor_allele minor_allele_freq minor_allele_count 1 rs3936238 1 4131726 4131726 A/G A G 0.409345 2050 2 rs877309 1 11489678 11489678 A/G G G 0.359824 1802 3 rs213028 1 21652177 21652177 C/T T T 0.452476 2266 4 rs11249206 1 25277982 25277982 C/T T C 0.488419 2446 5 rs654498 1 82173048 82173048 C/T T C 0.429113 2149 6 rs3818562 1 110300441 110300441 G/A A A 0.473243 2370 7 rs715359 1 177431168 177431168 T/C T C 0.377596 1891 8 rs2804694 1 181331833 181331833 G/A A A 0.425519 2131 9 rs6426327 1 246089811 246089811 G/A G G 0.473043 2369 10 rs10796216 10 14691889 14691889 A/G G A 0.446486 2236 11 rs10882854 10 98639097 98639097 T/C C C 0.451278 2260 12 rs11034952 11 38788615 38788615 T/C T T 0.321086 1608 13 rs1945975 11 105299724 105299724 C/T C T 0.442492 2216 14 rs10846239 12 16016108 16016108 T/C T T 0.443490 2221 15 rs2468330 12 43198722 43198722 G/A/C G G 0.493810 2473 16 rs1495031 12 63498136 63498136 C/T C T 0.434505 2176 17 rs10774834 12 110108063 110108063 C/T T C 0.434305 2175 18 rs951295 15 45999823 45999823 A/G G G 0.499800 2503 19 rs2959823 15 76413529 76413529 G/A A A 0.490016 2454 20 rs1510189 16 56100524 56100524 T/C C C 0.447085 2239 21 rs1941955 18 35162838 35162838 T/C C T 0.379992 1903 22 rs966367 2 12148220 12148220 C/T C T 0.492412 2466 23 rs4331560 2 49084673 49084673 A/G G G 0.428514 2146 24 rs1510480 2 60825441 60825441 G/A G A 0.492612 2467 25 rs6546473 2 69260357 69260357 A/G G A 0.471246 2360 26 rs264581 2 159971363 159971363 A/G A G 0.413139 2069 27 rs2235751 20 1969934 1969934 A/G G A 0.448882 2248 28 rs845016 21 33998284 33998284 C/T T C 0.430911 2158 29 rs2032088 21 38477330 38477330 G/A A G 0.407348 2040 30 rs1467387 22 25931372 25931372 T/C T C 0.438498 2196 31 rs133860 22 26144760 26144760 C/T T T 0.477436 2391 32 rs739259 22 27356579 27356579 A/G A G 0.457668 2292 33 rs2857639 22 30055674 30055674 A/G A G 0.299920 1502 34 rs2208123 22 48214812 48214812 A/G A A 0.442891 2218 35 rs939290 3 14658866 14658866 T/C/G C T 0.405351 2030 36 rs9839873 3 86662155 86662155 T/C T T 0.326677 1636 37 rs1520670 3 98972926 98972926 A/G G A 0.452476 2266 38 rs10936224 3 160902531 160902531 G/A G G 0.457668 2292 39 rs10033147 4 14357130 14357130 A/G A G 0.367812 1842 40 rs2125573 4 128607832 128607832 T/C C T 0.402157 2014 41 rs7660805 4 131044050 131044050 G/A G A 0.435903 2183 42 rs10155413 4 138037282 138037282 T/C C C 0.430911 2158 43 rs9292570 5 35000312 35000312 T/C/G T C 0.494609 2477 44 rs348937 5 112825677 112825677 C/T T C 0.494609 2477 45 rs1019916 5 146648359 146648359 G/A G G 0.418530 2096 46 rs7746156 6 47130819 47130819 T/C C T 0.428115 2144 47 rs9363764 6 68232042 68232042 G/A A A 0.494010 2474 48 rs10457834 6 149042800 149042800 T/C T T 0.358027 1793 49 rs6982811 8 36547313 36547313 T/C C C 0.288738 1446 50 rs1484127 8 51725654 51725654 G/A A A 0.469050 2349 51 rs6471533 8 96470300 96470300 G/A G G 0.469649 2352 52 rs472920 8 96848744 96848744 A/G A A 0.483027 2419 53 rs6991394 8 121782402 121782402 C/T T C 0.465655 2332 54 rs2385226 8 126681996 126681996 C/T C T 0.484824 2428 55 rs4742386 9 7711758 7711758 A/G G A 0.399361 2000 56 rs1040870 9 85415537 85415537 C/T C T 0.392372 1965 57 rs1414097 9 121359692 121359692 T/C C C 0.430312 2155 58 rs2521373 X 9476990 9476990 A/G G A 0.389669 1471 59 rs798149 X 15885431 15885431 A/G G A 0.421457 1591 60 rs5926356 X 28247331 28247331 A/G A A 0.461457 1742 61 rs5936512 X 69066821 69066821 A/G G G 0.499073 1884 62 rs5987737 X 114710721 114710721 T/C T C 0.459073 1733 63 rs5931272 X 137077857 137077857 A/G A A 0.432848 1634 64 rs1416770 X 145219165 145219165 T/C T T 0.446887 1687 65 rs6626309 X 147204373 147204373 C/T T C 0.406887 1536
Looking at that and combining it from the information you provide:
- there are three SNPs that are tri-allelic (rs2468330, rs939290, rs9292570) - I throw these out; these are marked with **.
- there is one SNP, rs13369115, that is now merged into 'rs10155413'. See:
- there are 18 SNPs (rs10936224, rs10033147, rs2125573, rs1019916, rs7746156, rs9363764, rs6982811, rs6471533, rs472920, rs6991394, rs2385226, rs4742386, rs1040870, rs2521373, rs5926356, rs5987737, rs1416770, rs6626309) common to both the 450K and EPIC-chips, where the base corresponding to the M and U alleles could not be determined; these are marked with *.
So this leaves us with 65 - 3 tai-allelic - 18 undetermined alleles = 44 SNPs for fingerprinting.
What we do not consider here is MAF: it is probably good to check the frequencies of the SNPs in your data to those in the reference (in this case ENSEMBL).
It would be great if there is some function that enables:
- fingerprinting of the EPIC/450K samples using these 44 SNPs
- fingerprinting of a combined EPIC/450K with GWAS datasets using these 44 SNPs, if GWAS dataset is indeed available for (a subset of) the samples.
What do you think?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ttriche/infiniumSnps/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AAARImNA1hSvxiIiOz0zhB7rd87KQvDdks5sSGgHgaJpZM4OlGOL .
nb. Triallelic isn't necessarily a huge problem -- we can always represent the "non-reference" alleles with their IUPAC degenerate code, e.g. Y or R. So, not a showstopper for those SNPs.
MAF shouldn't much matter for the SNPs I was able to validate in the NA12878 pedigree -- those are as close to absolute as any genome I know of -- but it certainly will in a given population. Would be handy for weighting. Most of the rs probes included by Illumina are high-MAF across most human populations, but there will be exceptions. There are also a large pile of SNP-containing and ethnicity-informative probes among the "methylation" CpG probes, as described in https://academic.oup.com/nar/article/45/4/e22/2290930/Comprehensive-characterization-annotation-and
A (naive) approach to a multi-ethnic cohort on the rs probes alone reliably recovered ethnicity, so you must be correct about the varying MAFs. A larger, more sophisticated approach along with X/Y CN and DNAme ratios should recover far more information, especially in karyotypically normal specimens (I am used to working on specimens where chrX and chrY fall off with alarming regularity).
--t
On Thu, Jul 27, 2017 at 10:14 AM, Tim Triche, Jr. tim.triche@gmail.com wrote:
Agreed 100% -- this is why I put the data up on GitHub for when I had time and someone else had some interest in completing this project :-). Looks like both have arrived.
--t
On Thu, Jul 27, 2017 at 3:35 AM, Sander W. van der Laan < notifications@github.com> wrote:
HI,
By lack of another spot to put this request, here goes.
I had gotten the information of these SNPs from ENSEMBL. Like so:
cat("\n* Loading ENSEMBL information on the 65 hm450K variants...")
This method is based on the links below.
Ref: https://bioconductor.org/packages/devel/bioc/vignettes/biomaRt/inst/doc/biomaRt.html
Ref: https://support.bioconductor.org/p/19727/
Ref: https://nsaunders.wordpress.com/2015/04/28/some-basics-of-biomart/
Ref: https://davetang.org/muse/2012/04/27/learning-to-use-biomart/
library("biomaRt") snp.grch37 = useMart(biomart="ENSEMBL_MART_SNP", dataset="hsapiens_snp", host="grch37.ensembl.org", path="/biomart/martservice") filters = listFilters(snp.grch37) attributes = listAttributes(snp.grch37)
Note: you would need an object with some 'called' genotypes of the methylation chip - like in your example.
hm450k.snps.list <- rownames(methAlleles)
hm450k.snps.ensembl89.b37 = getBM(attributes = c("refsnp_id", "chr_name", "chrom_start", "chrom_end", "allele", "allele_1", "minor_allele", "minor_allele_freq", "minor_allele_count"), filters=c("snp_filter"), values = hm450k.snps.list, mart = snp.grch37)
Then the list of variants look like this:
show(hm450k.snps.ensembl89.b37) refsnp_id chr_name chrom_start chrom_end allele allele_1 minor_allele minor_allele_freq minor_allele_count 1 rs3936238 1 4131726 4131726 A/G A G 0.409345 2050 2 rs877309 1 11489678 11489678 A/G G G 0.359824 1802 3 rs213028 1 21652177 21652177 C/T T T 0.452476 2266 4 rs11249206 1 25277982 25277982 C/T T C 0.488419 2446 5 rs654498 1 82173048 82173048 C/T T C 0.429113 2149 6 rs3818562 1 110300441 110300441 G/A A A 0.473243 2370 7 rs715359 1 177431168 177431168 T/C T C 0.377596 1891 8 rs2804694 1 181331833 181331833 G/A A A 0.425519 2131 9 rs6426327 1 246089811 246089811 G/A G G 0.473043 2369 10 rs10796216 10 14691889 14691889 A/G G A 0.446486 2236 11 rs10882854 10 98639097 98639097 T/C C C 0.451278 2260 12 rs11034952 11 38788615 38788615 T/C T T 0.321086 1608 13 rs1945975 11 105299724 105299724 C/T C T 0.442492 2216 14 rs10846239 12 16016108 16016108 T/C T T 0.443490 2221 15 rs2468330 12 43198722 43198722 G/A/C G G 0.493810 2473 16 rs1495031 12 63498136 63498136 C/T C T 0.434505 2176 17 rs10774834 12 110108063 110108063 C/T T C 0.434305 2175 18 rs951295 15 45999823 45999823 A/G G G 0.499800 2503 19 rs2959823 15 76413529 76413529 G/A A A 0.490016 2454 20 rs1510189 16 56100524 56100524 T/C C C 0.447085 2239 21 rs1941955 18 35162838 35162838 T/C C T 0.379992 1903 22 rs966367 2 12148220 12148220 C/T C T 0.492412 2466 23 rs4331560 2 49084673 49084673 A/G G G 0.428514 2146 24 rs1510480 2 60825441 60825441 G/A G A 0.492612 2467 25 rs6546473 2 69260357 69260357 A/G G A 0.471246 2360 26 rs264581 2 159971363 159971363 A/G A G 0.413139 2069 27 rs2235751 20 1969934 1969934 A/G G A 0.448882 2248 28 rs845016 21 33998284 33998284 C/T T C 0.430911 2158 29 rs2032088 21 38477330 38477330 G/A A G 0.407348 2040 30 rs1467387 22 25931372 25931372 T/C T C 0.438498 2196 31 rs133860 22 26144760 26144760 C/T T T 0.477436 2391 32 rs739259 22 27356579 27356579 A/G A G 0.457668 2292 33 rs2857639 22 30055674 30055674 A/G A G 0.299920 1502 34 rs2208123 22 48214812 48214812 A/G A A 0.442891 2218 35 rs939290 3 14658866 14658866 T/C/G C T 0.405351 2030 36 rs9839873 3 86662155 86662155 T/C T T 0.326677 1636 37 rs1520670 3 98972926 98972926 A/G G A 0.452476 2266 38 rs10936224 3 160902531 160902531 G/A G G 0.457668 2292 39 rs10033147 4 14357130 14357130 A/G A G 0.367812 1842 40 rs2125573 4 128607832 128607832 T/C C T 0.402157 2014 41 rs7660805 4 131044050 131044050 G/A G A 0.435903 2183 42 rs10155413 4 138037282 138037282 T/C C C 0.430911 2158 43 rs9292570 5 35000312 35000312 T/C/G T C 0.494609 2477 44 rs348937 5 112825677 112825677 C/T T C 0.494609 2477 45 rs1019916 5 146648359 146648359 G/A G G 0.418530 2096 46 rs7746156 6 47130819 47130819 T/C C T 0.428115 2144 47 rs9363764 6 68232042 68232042 G/A A A 0.494010 2474 48 rs10457834 6 149042800 149042800 T/C T T 0.358027 1793 49 rs6982811 8 36547313 36547313 T/C C C 0.288738 1446 50 rs1484127 8 51725654 51725654 G/A A A 0.469050 2349 51 rs6471533 8 96470300 96470300 G/A G G 0.469649 2352 52 rs472920 8 96848744 96848744 A/G A A 0.483027 2419 53 rs6991394 8 121782402 121782402 C/T T C 0.465655 2332 54 rs2385226 8 126681996 126681996 C/T C T 0.484824 2428 55 rs4742386 9 7711758 7711758 A/G G A 0.399361 2000 56 rs1040870 9 85415537 85415537 C/T C T 0.392372 1965 57 rs1414097 9 121359692 121359692 T/C C C 0.430312 2155 58 rs2521373 X 9476990 9476990 A/G G A 0.389669 1471 59 rs798149 X 15885431 15885431 A/G G A 0.421457 1591 60 rs5926356 X 28247331 28247331 A/G A A 0.461457 1742 61 rs5936512 X 69066821 69066821 A/G G G 0.499073 1884 62 rs5987737 X 114710721 114710721 T/C T C 0.459073 1733 63 rs5931272 X 137077857 137077857 A/G A A 0.432848 1634 64 rs1416770 X 145219165 145219165 T/C T T 0.446887 1687 65 rs6626309 X 147204373 147204373 C/T T C 0.406887 1536
Looking at that and combining it from the information you provide:
- there are three SNPs that are tri-allelic (rs2468330, rs939290, rs9292570) - I throw these out; these are marked with **.
- there is one SNP, rs13369115, that is now merged into 'rs10155413'. See:
- there are 18 SNPs (rs10936224, rs10033147, rs2125573, rs1019916, rs7746156, rs9363764, rs6982811, rs6471533, rs472920, rs6991394, rs2385226, rs4742386, rs1040870, rs2521373, rs5926356, rs5987737, rs1416770, rs6626309) common to both the 450K and EPIC-chips, where the base corresponding to the M and U alleles could not be determined; these are marked with *.
So this leaves us with 65 - 3 tai-allelic - 18 undetermined alleles = 44 SNPs for fingerprinting.
What we do not consider here is MAF: it is probably good to check the frequencies of the SNPs in your data to those in the reference (in this case ENSEMBL).
It would be great if there is some function that enables:
- fingerprinting of the EPIC/450K samples using these 44 SNPs
- fingerprinting of a combined EPIC/450K with GWAS datasets using these 44 SNPs, if GWAS dataset is indeed available for (a subset of) the samples.
What do you think?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ttriche/infiniumSnps/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AAARImNA1hSvxiIiOz0zhB7rd87KQvDdks5sSGgHgaJpZM4OlGOL .
HI,
By lack of another spot to put this request, here goes.
I had gotten the information of these SNPs from ENSEMBL. Like so:
Then the list of variants look like this:
Looking at that and combining it from the information you provide:
So this leaves us with 65 - 3 tai-allelic - 18 undetermined alleles = 44 SNPs for fingerprinting.
What we do not consider here is MAF: it is probably good to check the frequencies of the SNPs in your data to those in the reference (in this case ENSEMBL).
It would be great if there is some function that enables:
What do you think?