ivan-krukov / aligning-genealogies

The genealogy-coalescent alignment project
3 stars 0 forks source link

Matching IDs in genealogy and the tree sequence #11

Closed ivan-krukov closed 4 years ago

ivan-krukov commented 4 years ago

msprime uses the minimal integers to label the nodes. We use the IDs provided in the genealogy to establish node identity. To start the alignment, we need to match our IDs with msprime IDs

shz9 commented 4 years ago

Something else related to this point: The tree sequences currently have haplotypes, so if we have 4 probands in the pedigree, the tree sequence will have 8 leaves. Not sure how to proceed with this...

LukeAndersonTrocme commented 4 years ago

Not sure if this is helpful, but when working with the TS of Genizon data, I just had a dictionary of node ID pointing to sample IDs (where two node IDs point to the same sample IDs) For slicing a tree or accessing samples from the tree, I flipped the dictionary to have sample IDs pointing to node IDs.

shz9 commented 4 years ago

How did you generate the dictionary, @LukeAndersonTrocme?

LukeAndersonTrocme commented 4 years ago
#set dictionary for each tree
node_keys = {}

#loop through each chrom
for c in chroms:
    #set dict for this chrom
    node_key = {}
    id_key = defaultdict(list)

    #loop through each node
    for n in chroms[c].nodes():
        #make sure it's a sample
        if n.is_sample():
            #get individual index
            ind_index = n.individual
            #get individual ID (VCF NAME)
            ind_meta = chroms[c].individual(ind_index).metadata
            ind_id = json.loads(ind_meta.decode('utf-8'))['individual_id']
            #add individual ID to node index
            node_key[n.id] = ind_id             
    node_keys[c] = node_key

    #FLIP THE DICT for each ID, add node index
    for key, value in node_key.items():
        id_key[value].append(key)        
    id_keys[c] = id_key
LukeAndersonTrocme commented 4 years ago

It's a bit of a hack but it works. Let me know if you need more context to this, for the metadata I had to add this myself, wasn't too bad, just modified the convert.py file from somewhere in the tskit docs.

shz9 commented 4 years ago

Thanks to what Luke pointed out about metadata, I was able to solve this problem in the following way:

ts_nodes_to_ped_map = {}
for n in sim.nodes():
    if n.individual != -1:
        ind_info = sim.individual(n.individual)
        ts_nodes_to_ped_map[n.id] = int(ind_info.metadata.decode())

I updated the Pedigree class to return this mapping along with the simulations themselves (here). I also updated the method to allow users to optionally convert from TreeSequence to Traversal objects. Could be useful for plotting and other functionality.

Not sure this is the most appropriate place for those additions, but it should work for now.