Is the XML file mal-formatted?

mocksu commented 4 years ago

It looks like that the XML file is not formatted correctly. For instance,

Given the following, "H2a2a2" is at the same depth level as "H2a2a", while the former should be a sub-clade of the latter:

152C 6716G

263G ....
"H2a2a1" as the first haplogroup documented in the file, is actually the root haplogroup of all other haplogroups, while in biology, this is not true.
If you use a XML parser (e.g. python xml.dom.minidom), you can tell the only haplogroup child of the document is "H2a2a1", which should not be right.

Your help would be highly appreciated!

Thanks so much!

haansi commented 4 years ago

Dear mocksu,

Thanks for checking - you refer to this file: https://raw.githubusercontent.com/seppinho/haplogrep-cmd/master/data/phylotree/phylotree17.xml

The point is that the haplogroup classification is done per default according the rCRS in HaploGrep - which corresponds to the H2a2a1 haplogroup with accessionNr: NC_012920. All other Haplogroups are below this Root - since this is handled as root - both H2a2a (child) and H2a2a2 (neighbor) are handled on same level - this might be a bit confusing - but only via H2a2a all subsequent branches can be traversed, while H2a2a2 is rather a "sibling". The navigation of the tree always starts at H2a2a1, so both H2a2a2 and H2a2a now are handled as child nodes. Screenshot from 2019-09-10 11-38-27

We do also provide the RSRS oriented version of Phylotree: https://raw.githubusercontent.com/seppinho/haplogrep-cmd/master/data/phylotree/phylotree17_rsrs.xml which is (from the evolutionary point of view) the correct version. There is an ongoing debate on which version to use, see this blog-post http://haplogrep.uibk.ac.at/blog/rcrs-vs-rsrs-vs-hg19/

mocksu commented 4 years ago

Thanks so much for your quick reply. Really appreciate it. It helped A LOT!

A quick question regarding the mutation annotation: I understand "1234G" means there is a Ref => G mutation at the position 1234. But what does "1234.XC" mean? What does "1234G!" mean?

Although using rCRS makes a lot of sense, do you think you can still make the XML to adopt the evolutional phylotree while using rCRS as the reference sequence? You might only need to change the ref/alt alleles and positions for some of the polymorphisms for the file https://raw.githubusercontent.com/seppinho/haplogrep-cmd/master/data/phylotree/phylotree17_rsrs.xml and keep the haplogroup toplogy the same, I guess? Some algorithms like the one used in yHaplo (23andme) rely on the evolutional tree topology to work.

haansi commented 4 years ago

1234.XC means there are a number of C insertions, there are "hotspots" for insertions (e.g. on 5899, 960, 573), where the number of Cs (homopolymeric c-stretch), makes it hard to distinguish the correct numbers of insertions,).

12345G! represents a so called backmutation - see phylotree.org -

Mutations that are reversions to an ancestral state (back mutations) are indicated with an exclamation mark (!), two exclamation marks for a double back mutation (!!), etc., e.g. "A15301G!".

Regarding the adaptation of the XML -> we have to stick either to rCRS (for both, reference and tree root) or RSRS (also for both, tree root and reference). If you use fasta sequences, there shouldn't be any difference in haplogroup classification, both trees and corresponding references yield to the same haplogroup - with different resulting mitochondrial profiles - with differences to expect on those sides to be found on the phylotree site

Mixing RSRS as tree root and rCRS as reference would have some side effects in the classification (especially for back-mutations) and variants differing between rCRS and RSRS (see list from above).

mocksu commented 4 years ago

Thanks so much for the explanation about the mutation annotation. I though "!" might mean insertion, but didn't ever think it could be backmutation.

Regarding the tree structure of phylotree, you have my agreement that if one treats the mutations equally important, then the topology of the tree does not really matter. But how if one wants to treat the mutations differently -- for instance, weigh the mutations closer to the top of the evolutionary tree top more than those closer to the leaves? This is actually what yHaplo from 23andme does.

haansi commented 4 years ago

I see your point, our model differs here: we calculate the weight of every mutation based on their occurrence in different branches in the tree - a mutation which occurs only on 1 branch, gets a higher weight than a mutation defining haplogroups in different branches. You can find the information in the corresponding weights files and the formula, how we calculate the weights in the supplemental material of the paper: https://academic.oup.com/nar/article/44/W1/W58/2499296#supplementary-data. Not sure if changing the weights would improve the haplogroup classification - looking at the mutations from RSRS to R0 for example: Screenshot from 2019-09-12 09-24-47

Here's an excerpt from weights17.txt

mutation	occurrence
152C	197
16311C	138
146C	122
195C	113
16189C	110
16362C	81
...	...

As you can see, mutations from the most recent common ancestor can already be found in the top hits of highest fluctuating mutations. Hope this helps, best Hansi

mocksu commented 4 years ago

I read the reference papers about haplogrep and haplogrep2. We used haplogrep2 to infer MT haplogroups, but the results were not satisfactory. For instance, for African Americans, haplogrep2 only inferred < 10% of the samples as "L", while we were expecting 90+%. My guess is that haplogrep2 works well if the MT marker set is close to that of the 1000 Genomes? Our markers are different.

nuin commented 4 years ago

@mocksu I think the issue here is not haplogrep by Phylotree, where haplogrep gets its haplogroups. There's no explanation on how Phylotree generates the relationships and how it defines branching mutations and other differences, not mentioning that Phylotree is based on 20k genomes that might be or not correctly sequenced or annotated, to say the least.

We use haplogrep to find new mutations on sequenced genomes, so we never correlate ethnicity with haplogroup ID, so we don't see that part of the problem.

mocksu commented 4 years ago

Thanks so much for your reply!

haansi commented 4 years ago

@nuin - thanks for the comment :+1: - I agree with you that haplogrep is only as good as the underlying "database" - which is phylotree. Mannis once presented his approach, how he upgrades the phylotree (which he does in his spare-time!!! - so highest respect to him for this effort). For a haplogroup to be added to phylotree he at least takes 2-3 sequences from different sources/labs in GenBank/HapMap, to minimize the risk of sequencing errors from single labs.

@mocksu regarding your dataset

or instance, for African Americans, haplogrep2 only inferred < 10% of the samples as "L", while we were expecting 90+%.

This sounds to me that you used a MicroArray - which was probably according the yoruba sequence (present until hg18/19). I doubt if the data derived from a sequencing approach would yield to this result. Feel free to contact me via mail hansi.weissensteiner@i-med.ac.at

nuin commented 4 years ago

@hansi Coming from a Phylogenetic background, there are some decisions in Phylotree that don't make sense. I agree with the praise for the creators and maintainers, but at some point we need a more scientific approach for this "problem", and maybe a better documentation and file format while we are at it.

MT genome capture in 1000 Genomes/HapMap is far from ideal, as there might be contamination from nuclear genes that have similar sequence. We do extra long range PCRs to be sure to capture only MT before we subject to NGS.

haansi commented 4 years ago

@nuin thanks for this comment! I highly agree with you, that a transparent systematic approach is needed, in order to update the mt-phylogeny!

Regarding 1000G data: @mocksu forgot to mention that 1000G is mostly not included in Phylotree, except some samples overlapping from HapMap

We analyzed the data from 1000G, which got mapped to nDNA and mtDNA, where the data basically looks very good - some contamination issues - but expected from the study protocol (soon on biorxiv). Basically extra-long range PCR should avoid issues with NUMTs, but from our experiments and also literature - in some rare cases - can lead to problems - e.g.: https://www.frontiersin.org/articles/10.3389/fgene.2019.00518/full

mocksu commented 4 years ago

@haansi Yes, we use a microarray dataset. For WGS, haplogrep2 worked fine for us.

seppinho / haplogrep-cmd

Is the XML file mal-formatted? #29