szhan / tsimpute

Genome-wide genotype imputation using tree sequences.
MIT License
0 stars 0 forks source link

Visualise LS HMM copying paths #64

Closed szhan closed 1 year ago

szhan commented 1 year ago

Add some plotting routines to help diagnose potential issues with sample paths. This is the working version.

def plot_sample_path(path, site_pos, tracks=None, window=None):
    fig, ax = plt.subplots(1, 1, figsize=(20, 5))
    ax.plot(
        site_pos,
        path,
    )
    # Add tracks
    if tracks is not None:
        for i in np.arange(len(tracks)):
            ax.plot(
                tracks[i][0],
                np.repeat(-(i + 1) * 1_000, len(tracks[i][0])),
                marker="|",
                color=tracks[i][1],
                linestyle=""
            )
    if window is not None:
        assert len(window) == 2
        ax.set_xlim(window[0], window[1])
    ax.set_ylabel("Index of sample")
    ax.set_xlabel("Genomic position");

Example output. Screen Shot 2023-06-05 at 9 46 40 AM

szhan commented 1 year ago

Tracks of information can be added. For example, sites of disagreement between lshmm and BEAGLE and chip-like sites.

Screen Shot 2023-06-05 at 9 47 38 AM

szhan commented 1 year ago

It may also be useful to visually compare sample paths. For example, an HMM path for the same sample obtained under different precision values when running lshmm.

Screen Shot 2023-06-05 at 9 54 08 AM

szhan commented 1 year ago

Make the plots interactive using bokeh.

szhan commented 1 year ago

Initially, I ran into a websocket error. Setting the environmental variable as follows solves the problem for me.

import os
os.environ["BOKEH_ALLOW_WS_ORIGIN"] = '0aaf0agotd3etfja916liv2etcl4ul9j3fk8kav1m1a16m18da6b'

It is from this thread https://github.com/bokeh/bokeh/issues/8096#issuecomment-406815954.

szhan commented 1 year ago

Managed to show and interact with a sample path. Next steps are to show (1) sites where discrepancies between imputed genotypes and true genotypes occur and (2) locations of chip-like markers.

Screen Shot 2023-06-24 at 8 47 09 PM
szhan commented 1 year ago

Another developmental version.

https://github.com/szhan/tsimpute/assets/5580375/d5dd8f0b-89c8-42ea-8788-d83b6edcf9cc

szhan commented 1 year ago

Extend it to use information in a ref. panel tree sequence, e.g., relative node ages and sample status.

szhan commented 1 year ago

This differentiates sample nodes (black squares) and non-sample nodes (grey circles) in the copying path. I should modify them to be consistent with the tree displays in tskit.

Screen Shot 2023-06-26 at 5 03 44 PM
szhan commented 1 year ago

Not sure if it is a good idea to order parent nodes by time rather than id, so I'm thinking to add node times in the tooltips.

szhan commented 1 year ago

I think the following additional tracks could be useful for examining copying paths:

The simplest way to visualise all these tracks is to add them as separate plots below the main plot, as is done above. But is there a more elegant way to do this?

szhan commented 1 year ago

Another dev version. The tracks are mutable via an interactive legend.

Screen Shot 2023-06-26 at 11 09 11 PM
szhan commented 1 year ago

I'm thinking adding a companion plot showing all the samples wrt their properties, such as number of wrongly imputed alleles and number of switches in its copying path. This plot will interact with the above plot to allow users to select a sample in the companion plot and display its path in the above plot.

szhan commented 1 year ago

Another idea is overlaying a sample path on top of the forward probability matrix, which is represented as a heatmap. This may not be useful when the number of nodes in the tree sequence is large, because parent nodes with similar likelihood values are not necessarily clustered together by node id. See a prototype below.

Screen Shot 2023-06-27 at 9 44 53 AM

szhan commented 1 year ago

Accidentally closed this issue.