popgenmethods / momi2

Infer demographic history with the Moran model
GNU General Public License v3.0
47 stars 11 forks source link

Convert dict to momi.sfs object #63

Open trevorcousins opened 1 year ago

trevorcousins commented 1 year ago

If I simulate some data, I can load the SFS with sfs = momi.Sfs.load(), and then convert this to dict with sfs_dict= sfs.to_dict().

I have the SFS for some real data stored in a python dict, the format of which is exactly like the sfs_dict object above. Is it possible to turn this dict into a momi.sfs object? The momi.sfs object is required for important things such as inference.

There is the function momi.site_freq_spectrum(), which is possibly the solution. This has arguments sampled_popsand freqs_by_locus. The argument sampled_popsis clear; but I'm not sure what freqs_by_locus is?

jackkamm commented 1 year ago

You can create the Sfs object like so:

site_freq_spectrum(population_names, [sfs_dict])

Notice sfs_dict is wrapped in a list here.

This is because site_freq_spectrum expects a list of dicts as the argument. Each dict in the list corresponds to the SFS for a separate locus (e.g. a chromosome). In case you just have a single SFS for the whole genome, it is fine to pass in a list of length 1.

trevorcousins commented 1 year ago

That works great, thanks!

How would I then go about resampling, for bootstrapping, if there are no blocks.

I have the data in sfs_dict sampled into blocks as a python dict sfs_blocks_dict. Trying to convert this to a momi object with site_freq_spectrum(population_names, [sfs_blocks_dict]) of course does not work. I tried using the resample() function with sfs_dict: sfs_dict = function_to_load_my_dictionary sfs = momi.site_freq_spectrum(population_names, [sfs_dict]) resampled_sfs = sfs.resample() resampled_sfs_dict = resampled_sfs.to_dict() but resampled_sfs_dict does not seem to be any different from sfs_dict, observing with:

for key in sfs_dict.keys():
    print(sfs_dict[key])
    print(resampled_sfs_dict[key])

However, curiously, if I load the momi sfs object from file (generated from a simulation) and do the same thing: sfs = momi.Sfs.load(file_to_saved_momi_sfs) sfs_dict = sfs.to_dict() resampled_sfs = sfs.resample() resampled_sfs_dict = resampled_sfs.to_dict() Then the following values are different, as I would expect:

for key in sfs_dict.keys():
    print(sfs_dict[key])
    print(resampled_sfs_dict[key])

Why might this be? To reiterate, I just want a blocked version of my sfs_dict for bootstrapping, perhaps there is a better way.

trevorcousins commented 1 year ago

This is quite long winded but writes the sfs for each block into a new dict, each of which could be used for bootstrapping:

def list_to_tuple(z): # for some config (which is a list), convert to tuple
    zz = tuple([tuple(i) for i in z])
    return zz

num_blocks = sfs_dict['(locus,config_id,count)'][-1][0] # number of given blocks
LCC = np.array(sfs_dict['(locus,config_id,count)']) # the value of the "locus,config_id,count"
configs = sfs_dict['configs'] # list of configs
for i in range(0,num_blocks): # for every block
    newdict = {} # create new dict per block
    indices = [j for j in range(0,len(LCC)) if LCC[j][0]==i] # indices in LCC where our desired block is
    zLCC = LCC[indices,:] # get subset of LCC, for our desired block only
    for j in range(0,len(zLCC)): # save these to new dict
#         print(zLCC[j,1])
        newdict[list_to_tuple(configs[int(zLCC[j,1])])]=zLCC[j,2]
#     newdict[]

This feels particularly cumbersome, and I'm not sure this is appropriate for resampling anyway. Is there a simple explanation as to how, given a blocked SFS file, momi would infer the parameters ? Use all the data across the blocks, then use some kind of jackknife in the resampling for a bootstrap?

jackkamm commented 1 year ago

You might be interested in the SnpAlleleCounts object, which stores a list of the SNP allele counts along with chromosome & position. It has the extract_sfs() method which returns a blocked SFS which can be used for resampling. The SnpAlleleCounts can be created from a VCF via SnpAlleleCounts.read_vcf(), or directly created with the function snp_allele_counts(). Would this work for you?

jackkamm commented 1 year ago

I think I may have misread, it sounds like you already have your SFS split into blocks.

I am not clear the format of your sfs_blocks_dict. But, if it is a list of dicts (with SFS for each block), then you can pass it directly to site_freq_spectrum(population_names, sfs_blocks_dict), no need to wrap it. If it's in some other format then you will have to convert it to a list of dicts first.

If you want to extract the blocked SFS from the Sfs object, there's a couple options:

trevorcousins commented 1 year ago

Thanks for these! It helped to solved my problem. I was missing the fact that the data needs to be in blocked format for resampling to be work at all. Let me describe what happened in case anyone else is reading.

I have my data in two formats: both a regular dict sfs_dict (no blocks, keys are tuples of (num_ancestral,num_derived) for each population and values are the counts) and a blocked sfs_dict_blocks (in blocks, keys are ['sampled_pops', 'folded', 'length', 'configs', '(locus,config_id,count)']. Summing sfs_dict_blocks across blocks is equivalent to sfs_dict.

I want to 1) run inference, 2) resample the data, 3) run inference on resampled data, 4) repeat steps 2 and 3 100s of times. If load the sfs_dict (no blocks) into momi, then resample.

I was concerned because, after converting sfs_dict to a momi object, resampling did not seem to change any of the values. Clearly a necessary condition for resampling is for the data to be in blocks. Thus, as you said, I needed to convert my sfs_dict_blocks into a list of dicts, which I did with this:


def list_to_tuple(z): # for some config (which is a list), convert to tuple
    zz = tuple([tuple(i) for i in z])
    return zz

def convert_sfs_blocks_to_list(sfs_dict_blocks):
    blocked_sfs_dict = [] # initiate empty list, which will store an sfs_dict for each block
    num_blocks = sfs_dict['(locus,config_id,count)'][-1][0] # number of given blocks
    LCC = np.array(sfs_dict['(locus,config_id,count)']) # the value of the "locus,config_id,count"
    configs = sfs_dict['configs'] # list of configs
    for i in range(0,num_blocks): # for every block
        newdict = {} # create new dict per block
        indices = [j for j in range(0,len(LCC)) if LCC[j][0]==i] # indices in LCC where our desired block is
        zLCC = LCC[indices,:] # get subset of LCC, for our desired block only
        for j in range(0,len(zLCC)): # save these to new dict
             newdict[list_to_tuple(configs[int(zLCC[j,1])])]=zLCC[j,2]
        blocked_sfs_dict.append(newdict)
    return blocked_sfs_dict

after which resampling works as expected.

Thanks for the help.