tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
172 stars 84 forks source link

Recombination map (read_hapmap) behaviour if first position is not 0 #1255

Closed hyanwong closed 3 years ago

hyanwong commented 3 years ago

@awohns and I have a problem because the recombination map for human genome build 38, lifted over from build 37, contains the following first few lines for chromosome 21:

chrom   pos     recomb_rate     pos_cm
chr21   10326676        0       0.584144
chr21   12968320        0.6288  0.584144
chr21   12970435        0.6113  0.585474

And we have a good number (about 5000) variants before position 10326676. According to the docs, "If the first starting position is not equal to zero, then a zero-recombination region is inserted at the start of the chromosome.". That means the first 5000 sites are in a zero recombination region.

BUT - we know that they aren't, because the pos_cm column starts at 0.584144. So I think a zero-recombination region should only be inserted if the 4th column starts at 0. Otherwise we should insert a region with the appropriate rate to set the pos_cm number for the first listed position to the correct value.

I think @petrelharp commented about recombination maps in another issue, so perhaps he has an opinion here. I would rather like to change the current behaviour, as it would mean less bespoke code to deal with this situation in our upcoming paper, and less need to explain exactly what is going on.

jeromekelleher commented 3 years ago

This is just the file being wrong, though, isn't it? I.e., the recomb_rate and pos_cm columns disagree with each other.

hyanwong commented 3 years ago

I don't think it's quite a "disagreement" @jeromekelleher. The two could be in perfect agreement, but simply with the first "rate" thought of as missing. Previously we have assumed that the first position is at pos_cm=0, but I guess it needn't be, as in this case.

This is present in maps "out in the wild", so either we work around it before calling read_hapmap, by pre-processing the file to add another row in at the start, or we deal with it when actually reading in a hapmap file. Since this isn't strictly an inconsistency, and others may encounter it too, my preference would be for the latter. It shouldn't change previous behaviour that much, because nearly all the standard hapmap files have a first line that states pos_cm=0.

jeromekelleher commented 3 years ago

Right, I see, this is basically a missing line at the start of the file. We should probably try to fix it so I guess.

hyanwong commented 3 years ago

We should probably try to fix it so I guess.

Do you mean fix it in the hapmap file, or add some extra logic to read_hapmap? I'm happy to do the latter, as I'm in the middle of messing with the intervals.py file so that we can use a copy of it in tsinfer for the recombination_rate param.

jeromekelleher commented 3 years ago

Yes, read_hapmap should be aware of this common problem and at least offer an option to fix it. I guess this means we'll need to read in the pos_cm column which we haven't been doing before. Perhaps we should check that this is consistent with the rate column as well?

Possibly this is the underlying issue that meant we ended up adding the map_start parameter and we could fix this by using the pos_cm column rather than the rate one.

jeromekelleher commented 3 years ago

Given we're doing a 1.0, we could also change to reading the centimorgans column instead of the rate one, if this is the one that actually contains the correct data.

hyanwong commented 3 years ago

That's not a bad idea, because I suspect using the rate column suffers from compounding precision errors, so that at the moment, the cumulative rate may not equal the pos_cm value by the end of the genome.

One slight fly in the ointment is that, at the moment, tsinfer requires us to replace all the rates of 0 with epsilon, as we can't do matching if there is a zero recombination rate between variants. I guess we should simply fix that in tsinfer, but I don't know if I have time to do that at the moment, so having a workaround would be useful.

jeromekelleher commented 3 years ago

One slight fly in the ointment is that, at the moment, tsinfer requires us to replace all the rates of 0 with epsilon, as we can't do matching if there is a zero recombination rate between variants. I guess we should simply fix that in tsinfer, but I don't know if I have time to do that at the moment, so having a workaround would be useful.

I think that's just something we need to workaround in our own code by modifying the map arrays. It's not something we need to complicate msprime with.

hyanwong commented 3 years ago

I think that's just something we need to workaround in our own code by modifying the map arrays. It's not something we need to complicate msprime with.

Yes, agreed. I wondered how to do that after the fact, but I see we can access the position and rate value of a rate map, so we can simply create a new one from the one produced by read_hapmap by doing:

rm = read_hapmap(...)
min_nonzero = np.min(rm.rate[rm.rate>0])
epsilon = min_nonzero / 100
non_zero_rates = np.where(rm.rate==0.0, epsilon, rm.rate)
new_rm = msprime.RateMap(position = rm.position, rate = non_zero_rates, map_start=rm.map_start)
jeromekelleher commented 3 years ago

Yep, or you can just overwrite the value s in the array like

rm.rate[rm.rate == 0] = epsilon

The RateMap is just a pair of numpy arrays.

hyanwong commented 3 years ago

Yep, or you can just overwrite the value(s) in the array

If we directly overwrite, then we won't update the cumulative values, so there will be a mismatch between rm.rate and rm.cumulative, which could cause ugly behaviour later.

I guess we could either freeze rm.rate, or (better) use a setter so that when it is overwritten, we recalculate rm.cumulative. I'm not sure how we would do that for overwriting only a few items in a numpy array, though.

jeromekelleher commented 3 years ago

Yes, I didn't think of that, forgot about the cumalative array. We should probably freeze them somehow all right (can numpy arrays be marked read-only?)

hyanwong commented 3 years ago

I think you can make them unwritable, which is fine for us, if we also make self.rate read-only using a getter. I guess we should do this for position and cumulative too, and force a new RateMap to be made if any changes are required?

hyanwong commented 3 years ago

I'm refactoring this now. I think we should do the following:

read_hapmap should use the pos_cm value, as @jeromekelleher suggests, and if the first position is < 1, then we should insert a position at 1, which starts at genetic position 0 cM. We can then calculate the rates. The read_hapmap function should also allow us to specify a sequence_length. If the last physical position in the file is less than the sequence length, we append an extra position at the end of the RateMap with the same rate as the previous entry. We then stick all these positions & rates into a RateMap.

The RateMap class should allow the first position to be 1 (or other non-zero value), and internally insert a rate of zero from 0 to the first position. When we report e.g. the mean rate over the chromosome, we only report the values over the input positions (e.g. for a hapmap file, we ignore the rate of zero between positions 0 and 1).

This sounds a little fiddly, but I think it is the most reasonable interpretation of the genetic maps we might want to use. The other way to do this is to require and RateMap to start at 0, and have some sort of parameter on initialization, like we do at the moment, that says to ignore the first position+rate when reporting means etc. Comments?

petrelharp commented 3 years ago

What a mess! Thanks for sorting this out.

the recombination map for human genome build 38, lifted over from build 37, contains the following first few lines for chromosome 21

The original recombination map file didn't have this problem (right?) - so, the problem was introduced while lifting over. I think to make sure the solution here is right you should look at the original file and try to see what the right answer should be. In particular, I think the first position on these things is never 1, so you're actually making a bigger region of lower recombination rate than we really should have. (but then, I'm not sure what the right answer is besides "do the lift over better"?)

When we report e.g. the mean rate over the chromosome, we only report the values over the input positions (e.g. for a hapmap file, we ignore the rate of zero between positions 0 and 1).

What do you mean here, about "when we report" - like, this is in the output of which methods? It kinda sounds like you're wanting to implement a mask attribute, saying "ignore these bits of the chromosome" (which would be really useful, but a whole different thing...).

jradrion commented 3 years ago

The "original" (uploaded to AWS) human HapMap files for GRCh37 have nonzero starting positions. My liftOver files for hg38 also have nonzero start positions, and they do not include the Map(cM) column. I'm a little unclear where the file in the first post came from?

awohns commented 3 years ago

The file was downloaded from http://csg.sph.umich.edu/locuszoom/download/recomb-hg38.tar.gz

There's a README in the download. The LocusZoom people used the Beagle liftover of the build GRCh37 genetic map to GRCh38 (which has the genetic map positions and can be found here), and then calculated a recombination rate from genetic map positions.

hyanwong commented 3 years ago

The "original" (uploaded to AWS) human HapMap files for GRCh37 have nonzero starting positions. My liftOver files for hg38 also have nonzero start positions, and they do not include the Map(cM) column. I'm a little unclear where the file in the first post came from?

Oh, that's interesting @jradrion . I had assumed that the Map(cM) column was the canonical one, from which the rates had been back-calculated. I presumed that it was relatively easy to calculate large genetic distances between different sites by conventional means, and then refine them to get more local estimates of recombination rates. But perhaps I'm wrong here (in which case, how does one ensure that the local rates add up to the "standard" linkage map distances of old?)

I also thought that using the Map(cM) value would mean that we wouldn't accumulate rounding error in the cumulative total (i.e. the final genetic distance) by the end of the chromosome, although I suspect any accumulated error like this will be minor compared to the precision of the estimates.

Edit - amusingly I see that in the readme in https://ftp.hapmap.org/hapmap/recombination/2008-03_rel22_B36/ it says "Contact mcvean@stats.ox.ac.uk for queries". So we could ask Gil :)

jeromekelleher commented 3 years ago

Perhaps we should look at the source for LDHat? I'm assuming that's where most of these maps came from originally.

hyanwong commented 3 years ago

What a mess! Thanks for sorting this out.

No problem

the recombination map for human genome build 38, lifted over from build 37, contains the following first few lines for chromosome 21

The original recombination map file didn't have this problem (right?) - so, the problem was introduced while lifting over.

Yes, but it is not (IMO) unreasonable to start the first row a a position of e.g. 0.05cM. It is a meaningful thing, which has no alternative explanation.

When we report e.g. the mean rate over the chromosome, we only report the values over the input positions (e.g. for a hapmap file, we ignore the rate of zero between positions 0 and 1).

What do you mean here, about "when we report" - like, this is in the output of which methods?

Sorry, I mean when we call RateMap.mean_rate. But it's a very minor point, TBH.

jradrion commented 3 years ago

@hyanwong @jeromekelleher

Perhaps we should look at the source for LDHat? I'm assuming that's where most of these maps came from originally.

Yes, the HapMap maps were originally generated using LDHat.

I had assumed that the Map(cM) column was the canonical one, from which the rates had been back-calculated.

I'm think the Map(cM) column must have been calculated from the rate column, not the other way around, but I'm not 100% sure. Asking Gil seems like the best way to get a definitive answer.

jeromekelleher commented 3 years ago

"Contact mcvean@stats.ox.ac.uk for queries". So we could ask Gil :)

@hyanwong - I think you've volunteered yourself for this one!

hyanwong commented 3 years ago

I have just asked Gil, and, although the derivation was complicated (e.g. the instantaneous rate was calculated first, then the cumulative values rescaled to fit know cM values) he agrees with me that the Map(cM) column should be the one to treat as the primary data source. He wasn't terribly surprised that the start position wasn't at 0cM in some maps. I forgot to ask whether a zero-recombination section was sensible at the end of the chromosome (between the last position in the file and the total sequence_length.

hyanwong commented 3 years ago

Also, I discussed this with @grahamgower and we have a sensible route forward for stdpopsim, which involved using the legacy msprime.RecombinationMap.read_hapmap code to read maps where the cM column is missing (as it is in some of the current stdpopsim maps), but to require the new RateMap.read_hapmap() function to use the cM column, which should allow for a first line with a non-zero cM position.

jeromekelleher commented 3 years ago

Can this be closed @hyanwong?

hyanwong commented 3 years ago

Yep - thanks for reminding me.