tskit-dev / tskit

Population-scale genomics
MIT License
151 stars 71 forks source link

Function to delete all mutations at a site but leave the site in the table #1034

Open hyanwong opened 3 years ago

hyanwong commented 3 years ago

With @awohns , we find that we need a function like ts.delete sites but which only deletes the mutations at the site, and leaves the site table untouched. This is basically just a slimmed down version of delete_sites, that looks something like the code below.

For the moment we will include the function below in tsdate, but it would be nice to get this functionality into tskit, I think. Perhaps it could be done with an extra parameter to the existing delete_sites function, or perhaps for didactic reasons we should deprecate delete_sites and introduce delete_mutations(site_ids=XXX, delete_site=True) which does the same as the current delete_sites function? If the latter, I guess in the longer term we could add functionality to delete a specific list of mutations, which need not be all of them at a site, but that's a bit complex and can be kicked down the road.

    def delete_site_mutations(self, site_ids, record_provenance=True):
        """
        Remove the mutations at the specified sites entirely from the mutations table in
        this collection.

        :param list[int] site_ids: A list of site IDs specifying the sites whose
            mutations will be removed.
        :param bool record_provenance: If ``True``, add details of this operation
            to the provenance table in this TableCollection. (Default: ``True``).
        """
        keep_sites = np.ones(len(self.sites), dtype=bool)
        site_ids = util.safe_np_int_cast(site_ids, np.int32)
        if np.any(site_ids < 0) or np.any(site_ids >= len(self.sites)):
            raise ValueError("Site ID out of bounds")
        keep_sites[site_ids] = 0
        keep_mutations = keep_sites[self.mutations.site]
        new_ds, new_ds_offset = keep_with_offset(
            keep_mutations,
            self.mutations.derived_state,
            self.mutations.derived_state_offset,
        )
        new_md, new_md_offset = keep_with_offset(
            keep_mutations, self.mutations.metadata, self.mutations.metadata_offset
        )
        # Mutation numbers will change, so the parent references need altering
        mutation_map = np.cumsum(keep_mutations, dtype=self.mutations.parent.dtype) - 1
        # Map parent == -1 to -1, and check this has worked (assumes tskit.NULL == -1)
        mutation_map = np.append(mutation_map, -1).astype(self.mutations.parent.dtype)
        assert mutation_map[tskit.NULL] == tskit.NULL
        self.mutations.set_columns(
            site=self.mutations.site[keep_mutations],
            node=self.mutations.node[keep_mutations],
            time=self.mutations.time[keep_mutations],
            derived_state=new_ds,
            derived_state_offset=new_ds_offset,
            parent=mutation_map[self.mutations.parent[keep_mutations]],
            metadata=new_md,
            metadata_offset=new_md_offset,
        )
        if record_provenance:
            # TODO replace with a version of https://github.com/tskit-dev/tskit/pull/243
            parameters = {"command": "delete_site_mutations", "TODO": "add parameters"}
            self.provenances.add_row(
                record=json.dumps(provenance.get_provenance_dict(parameters))
            )
petrelharp commented 3 years ago

How about just ts.delete_mutations( ), for greater symmetry with existing methods? Then you could do

tables.delete_mutations(np.isin(tables.mutations.site, site_ids))

This would be more flexible as well.

jeromekelleher commented 3 years ago

ps @hyanwong - I edited your comment above so that the code is rendered with Python highlighting - you just need to put "python" at the end of the three backticks.

hyanwong commented 3 years ago

How about just ts.delete_mutations( ), for greater symmetry with existing methods? Then you could do

tables.delete_mutations(np.isin(tables.mutations.site, site_ids))

That would be great, but we'd have to figure out how to deal with deleting only some mutations at a site. This might actually be impossible (e.g. because we end up having two sequential mutations that change to the same state). I guess we can zap the mutation parents and re-make them, and then if we fail when remaking, so be it - we just warn the user in the docs that it isn't always possible?

petrelharp commented 3 years ago

I guess we can zap the mutation parents and re-make them, and then if we fail when remaking, so be it - we just warn the user in the docs that it isn't always possible?

Oh, right. Yes, I like this suggestion, and we could even ask the user to open an issue if it's something they actually need to do.

jeromekelleher commented 3 years ago

Sounds like a plan. We should have a quick check that the mutation parents have actually been set first, though, as we could fail needlessly on tree sequence requirements on a set of tables that has never actually computed the parents in the first place. Easy enough to check not(np.all(tables.mutations.parent == -1))

jeromekelleher commented 1 year ago

I've just been working on an algorithm that did some shuffling of mutations around the tree by adding some new rows to the table and deleting a bunch of other rows. Deleting mutations is far too hard at the moment.

I tried using variants of @hyanwong's code above couldn't get it to work, and ended up assigning times to all mutations so that they would sort in the correct order and I could then call compute_mutation_parents. Note the assumption that mutations are usually added in topological order is broken by the pattern of adding and deleting rows like this.

I think a general delete_rows(numpy row spec) method for the tables would be quite useful, especially if it could be done in place (although this would probably be tricky). In the case of the mutation table, there is a well-defined before and after mapping for each mutation, so that should be OK. The best thing to do would be to raise an error if someone tries to delete a mutation that has one or more references to it.

Other tables should be easy enough.