popsim-consortium / stdpopsim

A library of standard population genetic models
GNU General Public License v3.0
125 stars 87 forks source link

How does one verify population of simulated sample ? #1438

Closed sidAvad closed 4 months ago

sidAvad commented 1 year ago

So I ran a coalescent simulation of the OutOfAfrica model using the following

python -m stdpopsim --msprime-model dtwf --msprime-change-model 10 hudson \ HomSap -g HapMapII_GRCh38 -c $chrom -o ${SOURCE_DIR}/OutOfAfrica_4J17_${chrom}.ts -d OutOfAfrica_4J17 YRI:10000 CEU:10000 Followed by tskit vcf ${SOURCE_DIR}/OutOfAfrica_4J17_${chrom}.ts > ${SOURCE_DIR}/OutOfAfrica_4J17_${chrom}.vcf

The problem is that the vcf output has sample headings like tsk_1 , tsk_2 etc. I'm assuming that the first 10,000 ids are from the YRI population and the next 10,000 ids are from the CEU population , but is there any way to verify this ? Some way to ensure that the sample ids match population for instance ?

nspope commented 1 year ago

You can look at the population ids/names in the tree sequence, using tskit (python) --

import tskit
ts = tskit.load("foo.trees")
pop_id_to_name = {pop.id : pop.metadata["name"] for pop in ts.populations()}
print(pop_id_to_name)
# {0: 'YRI', 1: 'CEU', 2: 'CHB', 3: 'JPT'}

Then, you can map the population names onto to individuals,

indiv_pop = [pop_id_to_name[ind.population] for ind in ts.individuals()]
print(indiv_pop)
# ['YRI', 'YRI', 'YRI', ...]

Finally, you can rename the individuals appropriately when writing out to vcf,

indiv_name = [pop + str(i) for i, pop in enumerate(indiv_pop)]
print(indiv_name)
# ['YRI0', 'YRI1', 'YRI2', ...]
ts.write_vcf(open("my.vcf", "w"), individual_names=indiv_name)
sidAvad commented 1 year ago

Thank you very much !

nspope commented 4 months ago

Closing