an idea - use of ISOGG's R tree to create reference tables -- that we can use to prioritize clades and rules, as well as check results

jazdrv commented 6 years ago

Re: Iain's comment in issue #23 (as it concerns the faulty promotion of positive ambiguous calls based on our rule engine) -- "The best solution may be to allow the user to provide an input set of priority mutations, with the implication that this is kept to a minimum (e.g. M269, L151, P312, U106) to provide a framework for fixing bad calls. These would be dealt with first, before moving on to the other "perfect" variants." ... Would it be a good idea to use something like https://isogg.org/tree/ISOGG_HapgrpR.html ... to create a couple tables? ... (1) a recursive table to store the relationship of the snps cited there. (2) a clade table to show which are the priority clades ... which are given preference according to ISOGG's list. perhaps a guideline we can follow too. ... these reference tables could be recreated whenever ISOGG's list changes. much like what we're doing with the hg38 csv and our tables for that file. ... and when we have them, we could: ... (1) have a priority mechanism for our clades based on ISOGG's data as well as be consistent with them in the process. (useful for issue #23) (2) have a way to prioritize our unk rules processing for those variants with unknown data. we could do the furthest upstream ISOGG variants first. the furthest downstream ones last. (useful for issue #21) (3) have a way to double-check our own determined haplotree with ISOGG's and check for any inconsistencies. (useful for issue #20) ... I see a couple things would need to be worked out:

at least for me, there are some characters I don't understand in ISOGG's naming rules. ie: what does "^^" mean? ie: what does "/" mean when they separate snpnames? ie: how could we correlate .1/.2 names with our own definitions?
what do we do if we see snpnames in that list, that aren't in the hg38 reference?

jazdrv commented 6 years ago

talking with Iain ... sounds like there might be merit in this issue still.

Another idea is to use: http://www.jb.man.ac.uk/~mcdonald/genetics/build37.html

We need Jef's blessing before we move forward with anything like this.

jeftreece commented 6 years ago

Speaking generally, since there are a lot of details to be done.

There are quite a few potential sources for trees. These may include the Big Tree and the various haplogroup projects, FTDNA's, Iain's, Hap-R, and others.

One downside in pulling in one of these sources to build our tree is we end up proving each others' work. In other words, if there is an error somewhere, we just propagate that error rather than potentially making a new discovery that could come from independently solving the problem.

I think this is a minor consideration for us at this point. So I'm in favor of leveraging any reputable source of information we want to use, assuming it's freely available to use. We need to check copyright and see if the license permits using it in the way we intend to deliver a GPL-licensed product and how they want the work cited.

It would be nice if we had tree storage worked out because we probably should store such things in some standardized tree data structure - one that various open source tools can operate on.

Implicit in the tree storage is where does this data live. I guess we could consider it part of the data layer. Perhaps it's better as part of the analysis layer? Not sure - I'll have to think about that more.

jazdrv / dnaTools

an idea - use of ISOGG's R tree to create reference tables -- that we can use to prioritize clades and rules, as well as check results #29