Open duncanMR opened 1 year ago
I think this is something we'll want to include as a "debugging tool", which helps people track down dodgy samples. We won't run it automatically, but give some guidance in a tutorial about how you would use it diagnose problems.
So, we don't need to worry about it being particularly efficient.
Nice!!
Given the overwhelming complexity of big inferred tree sequences, I find it difficult to visualise their fine-grain structure without simplifying them. One approach I've developed with @jeromekelleher is to plot parts of the edges table. We pick a sample node, then for each tree in the ts we check what the parent of the sample node is, on what the parent of that parent is. We record the parent and grandparent nodes in an array, along with their times. Here is the code I wrote to do that:
The function that turns the numpy array into a dataframe of intervals is as follows, along with the mutation function:
Here is an example of a plot, using the same data as #23:
In this case, the sample is part of a trio, so there are only two immediate ancestors (one for each of the parent's recombinant chromosomes). The plots from a sample of a real trio (taken from the 100 000 Genomes Project) are much more fragmented:
I'm not sure if the algorithm is efficient enough to be included in the QC notebook. I do think the plots could be improved, e.g. I'd rather use mbp units on the x axis, and the text summary strings and titles might be too much information. Any feedback would be appreciated!