tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 70 forks source link

windowed GNN C implementation #1237

Open awohns opened 3 years ago

awohns commented 3 years ago

683 provides a python implementation of the windowed GNN statistic. Next steps are to modify the C implementation of genealogical_nearest_neighbours to support windows and time_windows and to update the documentation.

I've reproduced the documentation I wrote for a windowed_genealogical_nearest_neighbours function below. Bits of this can just be spliced into the original genealogical_nearest_neighbours documentation.

A related issue is to expand the stats API to support time windows, probably as they're defined in GNN, (which will negate the need for some of the documentation below), but that's a bigger undertaking.

def windowed_genealogical_nearest_neighbours(
        self,
        focal,
        reference_sets,
        windows=None,
        time_windows=None,
        span_normalise=True,
        time_normalise=True,
    ):
        """
        Returns genealogical nearest neighbour proportions partitioned
        into span-based `windows` and time-based `time_windows`.
        If neither `windows` nor `time_windows` are specified, the output is
        identical to that of :meth:`TreeSequence.genealogical_nearest_neighbours`.
        Passing arguments to `windows` and `time_windows` each increase the dimensions
        of the output array by 1. If `windows` is not None, GNN proportions are computed
        in windows defined by user-provided coordinates, as defined in the
        :ref:`stats API <sec_stats_windows>`. If `time_windows` is not none,
        GNN proportions are computed in user-provided time windows.
        For each focal node in each tree, the relevant time window is found based on the
        age of the most recent common ancestral node :math:`a` between the focal node
        and any other node present in the reference sets, as defined in
        :meth:`TreeSequence.genealogical_nearest_neighbours`.
        See the :ref:`statistics interface <sec_stats_interface>` section for details on
        :ref:`windows <sec_stats_windows>` and
        :ref:`span normalise <sec_stats_span_normalise>`. While `time_windows` are
        analagous to `windows`, note that `time_windows` do not need to extend to the
        age of the oldest root node. The `time_normalise` parameter is analagous to
        `span_normalise` with normalisation constants determined by the span of
        sequence assigned to each `time_window`.
        .. warning:: The interface for this method is preliminary and may be subject to
            backwards incompatible changes in the near future. The long-term stable
            API for this method will be consistent with other :ref:`sec_stats`.
        :param list focal: A list of :math:`n` nodes whose GNNs should be calculated.
        :param list sample_sets: A list of :math:`m` lists of node IDs.
        :param list windows: An increasing list of breakpoints between the :math:`s`
            windows to compute the statistic in.
        :param list time_windows: An increasing list of time breakpoints between the
            :math:`r` time windows to compute the statistic in.
        :param bool span_normalise: Whether to divide the result by the span of the
            window (defaults to True).
        :param bool time_normalise: Whether to divide the result by the span of the
            time window (defaults to True).
        :return: If neither `windows` nor `time_windows` are specified, the output is a
            2d array of :math:`n` by :math:`m`, where :math:`n` is the number of focal
            nodes whose GNNs are being calculated and :math:`m` is the number of
            reference sets. If only `windows` is used, the output is a 3d array of
            :math:`r` by :math:`n` by :math:`m`, where :math:`r` is the number of
            `windows`. If only `time_windows` is used, the output is a 3d array of
            :math:`s` by :math:`n` by :math:`m`, where :math:`s` is the number of
            `time_windows`. If both `windows` and `time_windows` are used, the output
            is a 4d array of :math:`r` by :math:`s` by :math:`n` by :math:`m`.
        :rtype: numpy.ndarray
benjeffery commented 3 years ago

Bumping this to the next release, let me know if it is imminent!