hyanwong / giglib

MIT License
4 stars 2 forks source link

Account for chromosome IDs when tracing intervals #103

Closed hyanwong closed 4 months ago

hyanwong commented 6 months ago

A relatively large task is to update the sample resolving and MRCA finding algorithms to account for different chromosomes (see #11 for the approach). At the moment we assume that an interval from e.g. 0...100 below a node can be intersected with an interval above the node e.g. from 50..200. This is only true if both intervals are on the same chromosome. We therefore need to keep a list of chromosome intervals (Portion objects) on the intervals stack, rather than just having a single Portion object per node. We should probably store this as a dictionary (keyed by numerical chromosome ID) rather than a list, because we can't be guaranteed that the chromosomes for a given node will be numbered from 0..N.

However, for the time being, we can raise an error if we have any chromosome numbers other than the default.

Additionally, I think it would be neater to default to a chromosome of 0 (not -1). After all, even if we don't specify a chromosome, we are assuming there is one. I suppose -1 (or some other negative number) could be reserved for circular chromosomes.

hyanwong commented 6 months ago

Default now set to 0 (I guess most species number from 1 anyway, so 0 is fine for indicating it's not been explicitly set).

We should check that specifying chromosome=1 (without defining a chromosome 0) works: no reason it shouldn't.

However, the sample-resolving etc algorithms will currently break for different chromosomes.

hyanwong commented 6 months ago

It should be possible to set the chromosome numbers to arbitrary values for different individuals, and the recombination-breakpoint-finding routines should "just work". This would be a good unit-test.

Details for chromosome metadata should I think, be stored in the node metadata, when we implement it. There is no point having a chromosome table, as the identity of chromosomes can differ between individuals.

hyanwong commented 6 months ago

We will need to change the format of the MRCAdict to allow different intervals to exist on different chromosomes.

hyanwong commented 4 months ago

Done in #112