pstjohn / emll

some code for linlog model simulation
GNU General Public License v2.0
9 stars 6 forks source link

priors for overexpressed enzymes #3

Open djinnome opened 5 years ago

djinnome commented 5 years ago

To ensure that measured overexpressed enzymes have a log laplace distribution and other measured enzymes have a log normal distribution, we replace

e_measured = pm.Normal('log_e_measured', mu=en.values, sd=0.2,
                           shape=(n_exp, len(e_inds)))

with this:

for i, strain in enumerate(enzyme_measurements.index):
    for j, enzyme in enumerate(enzyme_measurements.columns):
        if enzyme_measurements.loc[strain,enzyme] == 1:
            e_measured[i,j] = pm.Laplace('log_e_overexpressed_{}_{}'.format(strain,rxn),
                                        mu=1, b=0.1)
        else:
            e_measured[i,j] = pm.Normal('log_e_not_overexpressed_{}_{}'.format(strain, rxn),
                                        mu=0, sd=0.2)

Where enzyme_measurements is a strain x enzyme dataframe where each element is 1 if the enzyme is overexpressed for the strain and 0.5 otherwise.

Does this comport with our discussion?

djinnome commented 5 years ago

Hmm... I just realized that I was mixing unnormalized and normalized data together.

Since the e_measured tensor is relative to the reference and the enzyme_measurements dataframe is not relative to the reference, the code proposed above is incorrect

There are actually 4 cases to consider:

  1. If non-reference enzyme is overexpressed and reference enzyme is overexpressed, then the distribution should be Laplace with mean 0
  2. If non-reference enzyme is overexpressed and reference enzyme is not overexpressed, then the distribution should be Laplace with mean 1
  3. If non-reference enzyme is not overexpressed and reference enzyme is overexpressed, then the distribution should be Normal with mean -1
  4. If non-reference enzyme is not overexpressed, non-reference enzyme is not overexpressed, then the distribution should be Normal with mean 0.

Therefore, the code should be:

for i, strain in enumerate(enzyme_measurements.index):
    for j, enzyme in enumerate(enzyme_measurements.columns):
        if enzyme_measurements.loc[strain,enzyme] == 1: 
            if enzyme_measurements.loc[reference_strain, enzyme] == 1:
                e_measured[i,j] = pm.Laplace('log_e__{}_{}'.format(strain,rxn),
                                        mu=0, b=0.1)
            else:
                e_measured[i,j] = pm.Laplace('log_e_{}_{}'.format(strain,rxn),
                                        mu=1, b=0.1)

        else:
            if enzyme_measurements.loc[reference_strain, enzyme] == 1:
                e_measured[i,j] = pm.Normal('log_e_{}_{}'.format(strain,rxn),
                                        mu=-1, sd=0.2)
            else:
                e_measured[i,j] = pm.Normal('log_e_{}_{}'.format(strain,rxn),
                                        mu=0, sd=0.2)
djinnome commented 5 years ago

OK, I guess it is the case that Theano doesn't support item assignment

So this version at least compiles:

with pm.Model() as pymc_model:
    e_measured = theano.shared( np.zeros( (n_exp, len( reactions ) ) ) )
    for i, strain in enumerate(enzyme_measurements.index):
        for j, enzyme in enumerate(enzyme_measurements.columns):
            if enzyme_measurements.loc[strain,enzyme] == 1: 
                if enzyme_measurements.loc[reference_state, enzyme] == 1:
                    mu = 0
                else:
                    mu = 1
                e_measured = T.set_subtensor( e_measured[i,j], 
                                              pm.Laplace( 'log_e_{}_{}'.format( strain, enzyme ),
                                                           mu = mu, b=0.1 ) )

            else:
                if enzyme_measurements.loc[reference_state, enzyme] == 1:
                    mu = -1
                else:
                    mu = 0
                e_measured = T.set_subtensor( e_measured[i,j], 
                                              pm.Normal( 'log_e_{}_{}'.format( strain, enzyme ),
                                                          mu = mu, sd=0.2) )
pstjohn commented 5 years ago

Are they overexpressed relative to the reference state? Or is the reference state contain the overexpressions?

For the contador example, I used a uniform distribution to capture the uncertainty in enzyme overexpression of the reference:

with pm.Model() as pymc_model:

    Ex_t = pm.Deterministic('Ex', initialize_elasticity(
        ll.N,  b=0.05, sd=1, alpha=None,
        m_compartments=m_compartments,
        r_compartments=r_compartments
    ))
    Ey_t = T.zeros_like(ll.Ey, dtype='float32')

    # We don't know the expression (here, underexpression vs. reference) of
    # each protein as a result of the plasmid integration. So, we use a
    # symbolic variable for each protein in the enzyme expression matrix.
    e_hat = T.ones((n_exp, nr))
    e_hat_entries = pm.Uniform('e_hat_entries', lower=0, upper=1, shape=(5,))

    for i, rxns in enumerate(ox_order):
        for rxn in rxns:
            e_hat = T.set_subtensor(
                e_hat[i, model.reactions.index(rxn)],
                e_hat_entries[rxn_index[rxn]])

    # Store the fitted e_hat in the trace object
    e_hat = pm.Deterministic('e_hat', e_hat)      

with pymc_model:
    trace_prior = pm.sample_prior_predictive()

with pymc_model:

    # Calculate steady-state concentrations and fluxes from elasticities
    xn, vn = ll.steady_state_theano(Ex_t, Ey_t, e_hat, np.ones((n_exp, ll.ny)))

    # Error distribution for observed steady-state lysine out
    lys_mean = pm.Deterministic('lys_mean', vn[:, lys_index])
    ox_result = pm.Normal('ox_result', mu=lys_mean, sd=0.01, observed=ox_results)

That worked well because I knew enzyme activity was somewhere between 0 and 1. Here I don't allow the non-changed enzymes to vary, which is probably a good first attempt. Otherwise you might have too many degrees of freedom for the amount of data you're providing.

pstjohn commented 5 years ago

If they were indeed overexpressed relative to the reference, you could also try a uniform just for the overexpressed enzymes: Uniform(1, 10), for instance, if you thought a 10x overexpression was a realistic upper bound.

djinnome commented 5 years ago

Hi Peter,

Thanks for this! Which priors would you recommend I use for the other two cases:

  1. When both non-reference and reference enzymes are overexpressed
  2. When both non-reference and reference enzymes are not overexpressed.

Sincerely,

Jeremy

pstjohn commented 5 years ago

It would probably make sense to begin by assuming the overexpressions had the same effect in all the strains.

The reference e_hat are all 1 by definition.

So for a given enzyme/strain,

  1. if it was overexpressed in reference but not in the current strain; I would use Uniform(0, 1)
  2. if it was overexpressed in the current strain by not in the reference, I would use Uniform(1, 10)
  3. If it was either overexpressed in both reference and current strain OR neither overexpressed in reference or current strain; use just 1.
djinnome commented 5 years ago

Hi Peter,

Just a clarification. In the run_putida_inference.ipynb you generate priors for log_en_t and then you exponentiate when you call ll.steady_state_theano(). If I do not log-transform, I get a very small -ELBO= yarrowia_tinymodel_elbo If I do log-transform, then shouldn't the log-transformed uniform priors either be a negative exponential (if non-reference is not overexpressed and reference is overexpressed) or a positive exponential (if non-reference is overexpressed and reference is not overexpressed)? When I try this, I get the following -ELBO: yarrowia_tinymodel_log_en_t_elbo

Sincerely,

Jeremy

djinnome commented 5 years ago

To be clear, this is the code that performs the non-log-transformed enzymes:

with pm.Model() as pymc_model:

    # Initialize elasticities
    Ex_t = pm.Deterministic(
        'Ex', initialize_elasticity(N.values, 'ex', b=0.05, sd=1, alpha=None,
                                    m_compartments=m_compartments,
                                    r_compartments=r_compartments))

    Ey_t = pm.Deterministic('Ey', initialize_elasticity(-Ey.T.values, 'ey', b=0.05, sd=1, alpha=None))

    e_measured = theano.shared( np.ones( enzyme_measurements.shape ) )
    for i, strain in enumerate(enzyme_measurements.index):
        for j, enzyme in enumerate(enzyme_measurements.columns):
            if enzyme_measurements.loc[reference_state, enzyme] == 1 and enzyme_measurements.loc[strain,enzyme] != 1:
                e_measured = T.set_subtensor( e_measured[i,j], 
                                              pm.Uniform( 'e_{}_{}'.format( strain, enzyme ),
                                                          lower = 0, upper=1) )
            elif enzyme_measurements.loc[reference_state,enzyme] != 1 and enzyme_measurements.loc[strain, enzyme] == 1:
                e_measured = T.set_subtensor( e_measured[i,j], 
                                              pm.Uniform( 'e_{}_{}'.format( strain, enzyme ),
                                                          lower = 1, upper=10) )

    e_unmeasured = pm.Laplace('e_unmeasured', mu=1, b=0.1,
                              shape=(n_exp, len(e_laplace_inds)))
    en_t = T.concatenate(
        [e_measured, e_unmeasured,
         T.ones((n_exp, len(e_zero_inds)))], axis=1)[:, e_indexer]

    pm.Deterministic('en_t', en_t)

    yn_t = T.as_tensor_variable(yn.values)

    chi_ss, vn_ss = ll.steady_state_theano(Ex_t, Ey_t, en_t, yn_t)
    pm.Deterministic('chi_ss', chi_ss)
    pm.Deterministic('vn_ss', vn_ss)

    vn_subset = T.clip(vn_ss[:, v_inds], 0, 2)

    chi_clip = T.clip(chi_ss[:, x_inds], -3, 3)  

    #chi_obs = pm.Normal('chi_obs', mu=chi_clip, sd=0.2,
    #                    observed=xn.values)
    vn_obs = pm.Normal('vn_obs', mu=vn_subset, sd=0.025,
                           observed=vn.values)

and this is the code that performs the log-transformed enzymes:

with pm.Model() as pymc_model:

    # Initialize elasticities
    Ex_t = pm.Deterministic(
        'Ex', initialize_elasticity(N.values, 'ex', b=0.05, sd=1, alpha=None,
                                    m_compartments=m_compartments,
                                    r_compartments=r_compartments))

    Ey_t = pm.Deterministic('Ey', initialize_elasticity(-Ey.T.values, 'ey', b=0.05, sd=1, alpha=None))

    e_measured = theano.shared( np.ones( enzyme_measurements.shape ) )
    for i, strain in enumerate(enzyme_measurements.index):
        for j, enzyme in enumerate(enzyme_measurements.columns):
            if enzyme_measurements.loc[reference_state, enzyme] == 1 and enzyme_measurements.loc[strain,enzyme] != 1:
                e_measured = T.set_subtensor( e_measured[i,j], 
                                              -pm.Exponential( 'log_e_{}_{}'.format( strain, enzyme ),
                                                          lam=2) )
            elif enzyme_measurements.loc[reference_state,enzyme] != 1 and enzyme_measurements.loc[strain, enzyme] == 1:
                e_measured = T.set_subtensor( e_measured[i,j], 
                                              pm.Exponential( 'log_e_{}_{}'.format( strain, enzyme ),
                                                          lam=2) )

    e_unmeasured = pm.Laplace('log_e_unmeasured', mu=0, b=0.1,
                              shape=(n_exp, len(e_laplace_inds)))
    log_en_t = T.concatenate(
        [e_measured, e_unmeasured,
         T.zeros((n_exp, len(e_zero_inds)))], axis=1)[:, e_indexer]

    pm.Deterministic('log_en_t', log_en_t)

    yn_t = T.as_tensor_variable(yn.values)

    chi_ss, vn_ss = ll.steady_state_theano(Ex_t, Ey_t, np.exp(log_en_t), yn_t)
    pm.Deterministic('chi_ss', chi_ss)
    pm.Deterministic('vn_ss', vn_ss)

    vn_subset = T.clip(vn_ss[:, v_inds], 0, 2)

    chi_clip = T.clip(chi_ss[:, x_inds], -3, 3)  

    #chi_obs = pm.Normal('chi_obs', mu=chi_clip, sd=0.2,
    #                    observed=xn.values)
    vn_obs = pm.Normal('vn_obs', mu=vn_subset, sd=0.025,
                           observed=vn.values)
pstjohn commented 5 years ago

You certainly don't have to log-transform the enzymes, so I would stick with the first version. Just for context, I've used that approach when I have proteomics data -- the protein data often looks normally distributed in log-scale, but would tend to blow out some of those lower concentrations if I tried to fit relative protein expression in an absolute scale.

I would also say that if you don't have a lot of experimental data to fit; (i.e., not using multiomics measurements), you might try omitting the laplace distributions on unmeasured enzymes and just leaving them at 1.

    e_unmeasured = pm.Laplace('e_unmeasured', mu=1, b=0.1,
                              shape=(n_exp, len(e_laplace_inds)))

That's what I do in the contador example:

    # We don't know the expression (here, underexpression vs. reference) of
    # each protein as a result of the plasmid integration. So, we use a
    # symbolic variable for each protein in the enzyme expression matrix.
    e_hat = T.ones((n_exp, nr))
    e_hat_entries = pm.Uniform('e_hat_entries', lower=0, upper=1, shape=(5,))

    for i, rxns in enumerate(ox_order):
        for rxn in rxns:
            e_hat = T.set_subtensor(
                e_hat[i, model.reactions.index(rxn)],
                e_hat_entries[rxn_index[rxn]])

    ...

    # Calculate steady-state concentrations and fluxes from elasticities
    xn, vn = ll.steady_state_theano(Ex_t, Ey_t, e_hat, np.ones((n_exp, ll.ny)))