Open hyanwong opened 5 months ago
Raw data is available for 1000G at https://www.biorxiv.org/content/10.1101/2024.04.18.590093v1
Richard Durbin also has a new preprint out about TE insertions in real data: https://www.biorxiv.org/content/10.1101/2024.04.05.588311v1.full
Inferring a GIG from real data is probably going to be the most difficult part of the entire GIG project. This issue is to collect ideas and references.
For a start, I've just come across the paper/software below which references various approaches for constructing simple trees from k-mers. It strikes me that we might have to use a k-mer approach for GIG inference, as this is the only way we will be robust to different coordinate systems, so I wonder if there is anything we can use from these ideas. A web search for alignment-free phylogeny will probably go a long way here:
https://pubmed.ncbi.nlm.nih.gov/38547397/
Also PanMAN gives a nice example of running an algorithm to produce an ancestry with structural variation, for limited recombinant ancestries such as for SARS-CoV2
https://www.biorxiv.org/content/10.1101/2024.07.02.601807v1.full.pdf+html