pnnl-predictive-phenomics / syn_bmca

Code for implementing BMCA for Synechococcus elongatus
Other
2 stars 0 forks source link

Handle missing data for input into Bayesian MCA #3

Open djinnome opened 9 months ago

djinnome commented 9 months ago

All data sucks. Bayesian MCA naively assumes that the matrix of metabolite, flux and enzyme observations has the same size as the stoichiometric matrix rows or columns.

There is a tensor trick to reindex a tensor so that a smaller tensor can represent the observed data and its rows map to the rows of the full stoichiometric matrix, but only ChatGPT knows how this works.

djinnome commented 9 months ago

The current enzyme activity function generates enzymes with Inf values if there is no expression data. The solution to this issue will align the interface to handle missing rows, missing values and missing columns

augeorge commented 8 months ago

I will help with this

augeorge commented 7 months ago

Initial tests to build and pass:

inputs: dataframe: ncond x nvariables outputs: pytensor: ncond x nvariables test that tensor_equal(example_tensor, make_observables( example_df )). Can use data from hackett or Wu et al.

  1. ex1_df: unmeasured everywhere. ex1_tensor: Laplace everywhere w/ shape=ex1_df.shape
  2. ex2_df: measured everywhere - uninformed. ex2_tensor: Normal(mu=0, sigma=1, shape=ex2_df.shape)
  3. ex2_df: measured everywhere - informed. ex2_tensor: Normal(mu=ex2_df.values, sigma=1, shape=ex2_df.shape).
  4. ex3_df= some variables missing - uninformed. ex3_tensor = some columns are Laplace, some columns are (Normal(mu=0, shape=nconditions))
  5. ex3_df= some variables missing - uninformed. ex3_tensor = some columns are Laplace, some columns are (Normal(mu=ex3_df[var], shape=nconditions))
  6. ex4_df = some conditions missing. ex4_tensor = some rows are Normal(0,1) some columns are Normal
  7. ex5_df = ragged array ex5_tensor = pymc_ragged_array
djinnome commented 6 months ago

@ShantMahserejian can we get a dataframe for each data type where the column names are all the conditions, and rows are the model ids for the data type, and a cell is a float if it was measured, a Inf if it isn't measured, and Nan if no measurement can be mapped to the model id (for example, reactions that are not enzyme-catalyzed).