popsim-consortium / stdpopsim

A library of standard population genetic models
GNU General Public License v3.0
121 stars 86 forks source link

Recombination map metadata - which reference build? #422

Open hyanwong opened 4 years ago

hyanwong commented 4 years ago

I can access the recombination map for a species (e.g. HomSap) using e.g.

species = stdpopsim.get_species("HomSap")
contig = species.get_contig("chr22")
map = contig.recombination_map

But this map is presumably tied to a particular human genome reference build (I see that for HomSap there are two possibilities: https://stdpopsim.readthedocs.io/en/latest/catalog.html#sec_catalog_HomSap_genetic_maps). Might it be possible to add metadata to the recombination map so that it is at least possible to find out which build it is from?

grahamgower commented 4 years ago

By default you'll get a uniform recombination map, with rate corresponding to the chromosome in question. To get one of the recombination maps you've linked to, one needs to specifically request it with the genetic_map argument to get_contig().

But I admit, your point still remains. The recombination rate for a uniform map came from somewhere, and there is currently no way to inspect this. (FYI for HomSap the mean recombination rates for each chromosome were calculated from the GRCh37 map).

hyanwong commented 4 years ago

Ah, I didn't realise that the default map was uniform. I think that probably backs up my point that it would be good to be able to inspect the maps for provenance of some description. Actually, I see that information is present at a higher level (not in the msprime.simulations.RecombinationMap object, but in the stdpopsim.genetic_maps.GeneticMap object which contains the maps). However, I had to work out how to get a genetic map by trial and error: it's not obvious. Perhaps this could be documented somewhere?

>>> map = stdpopsim.get_species("HomSap").get_genetic_map(id="HapMapII_GRCh37")
>>> map.download()
>>> print(map.long_description)
        This genetic map is from the Phase II Hapmap project
        and based on 3.1 million genotyped SNPs
        from 270 individuals across four populations (YRI, CEU, CHB and JPT).
        Genome wide recombination rates were estimated using LDHat.
        This version of the HapMap genetic map was lifted over to GRCh37
        (and adjusted in regions where the genome assembly had rearranged)
        for use in the 1000 Genomes project. Please see the README file on
        the 1000 Genomes download site for details of these adjustments.
>>> map.get_chromosome_map("chr20")  # get one of the chromosomes - how do I list the possibilities?
hyanwong commented 4 years ago

Also, I can't work out how to programmatically list the genetic maps available for a species. It looks like stdpopsim.get_species("HomSap").genetic_maps is an array, whereas I would expect it to be a Python dict, whose keys are the genetic map IDs. Failing that, something like stdpopsim.get_species("HomSap").list_genetic_map_ids() would be useful.

grahamgower commented 4 years ago

The Species.genetic_maps appears to be missing a docstring, I've started a new issue for that. There is also a CLI option to list the available maps, eg. stdpopsim HomSap --help-genetic-maps . Probably a dictionary also makes sense for this as you've suggested, but we aren't overflowing with large numbers of genetic maps for any species (I think HomSap is the only species with more than one).

grahamgower commented 4 years ago

Oh, and here's an id list.

>>> import stdpopsim
>>> humans = stdpopsim.get_species("HomSap")
>>> [gm.id for gm in humans.genetic_maps]
['HapMapII_GRCh37', 'DeCodeSexAveraged_GRCh36']
jeromekelleher commented 4 years ago

See also #385 - we want to improve the linkage of species definitions with upstream genome builds.

grahamgower commented 4 years ago

And #198.

hyanwong commented 4 years ago

Oh, and here's an id list.

>>> import stdpopsim
>>> humans = stdpopsim.get_species("HomSap")
>>> [gm.id for gm in humans.genetic_maps]
['HapMapII_GRCh37', 'DeCodeSexAveraged_GRCh36']

Ah, it's in the id property! I hadn't thought of that. Sorry.