microbiome / miaSim

microbiome data simulation package
https://microbiome.github.io/miaSim/
Artistic License 2.0
18 stars 10 forks source link

Parameters as the field of SE object #25

Closed YagmurSimsekk closed 2 years ago

YagmurSimsekk commented 3 years ago

Hi,

We have been thinking on the topic of adding parameters as the fields of SE object. Since some parameters are generated by distribution functions(mostly uniform distribution for instance in simulateGLV function ), adding them as fields would be one way of seeing them explicitly, so the users can see that they are working on. But we are not sure how suitable it is for SE object structure or whether it is outside the concept of SE object class.

Any opinion on this is appreciated.

antagomir commented 3 years ago

The SE structure does not have a placeholder for model parameters afaik.

Depends on how critical this really is. If you can generate the same output by setting random seed, and running the script that generated the data, this might be enough in many cases.

When it is not enough, one could also consider separating the two functions, so that the stochastically generated parameters would be generated already outside of the original function. This could be more transparent. If this is not feasible, we can ask how important it is to know the exact values if the process is meant to be stochastic in the first place.

I would prefer finding a solution that respects standard workflows. This is important for transparency and interoperability, especially in the longer term. I am not sure if we have a solution that could readily add these as part of the SE object.

In principle, we can suggest new structures in the SE or TreeSE object but justifications should be quite strong before this is feasible I think.

YagmurSimsekk commented 2 years ago

New suggestion regarding SE structure: The result of simulations focused on the SummarizedExperiment class data container. Parameters produced by distribution functions (rbinom, runif, ...) are currently only placeholders. For the benefit of the user, it is recommended that the result to be a list object that also contains the parameters used in the function, once the placeholder parameters are replaced with actual values.

antagomir commented 2 years ago

I am attempting to summarize, please correct if necessary:

Problem microbiome time series simulations involve various parameters, including stochastic parameters generated during the process. It can be necessary to keep a record on these but the SE container does not naturally allow this.

Solution Implement separately:

  1. Functions that perform the simulations, and return a list with all possible details
  2. Functions that can convert such output into a separate SE object

If the output from (1) is always in a standard format (for instance, list with fields "matrix" and then other fields for some parameters), it might be possible implement (2) as a generic wrapper that works for several different simulators.

Would this work @danielriosgarza @YagmurSimsekk ?

I would be curious to hear if @FelixErnst has other suggestions.

Related to #23

danielriosgarza commented 2 years ago

Indeed, I think the best option is to implement (2) as a generic wrapper. The output from (1) should be in a standard format, with the matrix of composition over time having the same format for each model.

YagmurSimsekk commented 2 years ago

I agree with Daniel's point. This may be better for users to keep track of important values.

antagomir commented 2 years ago

Sounds feasible to me. Standardization is essential here, as well as providing clear examples in function roxygen pages and vignette.

FelixErnst commented 2 years ago

Why not use the metadata list? This is exactly the use case. You can have a name, set a value and expect a value for a given name. If it is NULL you can through an error and be done with it.

FelixErnst commented 2 years ago

Can you prepare an example of which parameters and values we are talking about?

YagmurSimsekk commented 2 years ago

@FelixErnst Thank you for the suggestion! Here is this example simulateHubbellRates function where we can see parameters in the metadata, including the parameters that were initially NULL , but they are generated in the function. Also, the colData shows the community size for each species before and after the simulation. I am not sure if colData field can be used for this, but it can be beneficial to see these numbers together in data frame format.

antagomir commented 2 years ago

In practice:

community.initial <- c(0, 5, 10)
se <- simulateHubbellRates(community.initial, migration.p = 0.01,
                   metacommunity.p = NULL, k.events = 1, growth.rates = NULL,
                   norm = FALSE, t.end=1000)

This simulation has 1000 time points for a 3 species community.

dim(se) [1] 1000 3

The sample data shows the initial and last time state for this 3-species community.

colData(se)

DataFrame with 3 rows and 2 columns initial.community after.community

s_1 0 0 s_2 5 15 s_3 10 0 The assay data has corresponding information. Notable here is that the first time point is not the initial time point `colData(se)[,1]` but instead it is the first step after the initial time point if I get it correctly. The last time point in assay data seems to be the same as the last time point in `colData(se)[,2]`. `assay(se, "counts")[1,]` s_1 s_2 s_3 0 6 9 `assay(se, "counts")[1000,]` s_1 s_2 s_3 0 15 0 To me this raises two questions: 1. Why do we need to include duplicated information on time points to colData when we can readily read these from the assay data itself? The interpretation is more clear when these are read from the assay data, and I do not see added value in duplicating the same information in colData. 2. Having colData starting from time point 0, and assay data starting from time point 1 is potentially confusing. -> My suggestion is to - remove the initial & last point info from colData entirely, and - include also the initial 0 time point to assay data so that it is available from there; it is rather standard in time series simulation analyses to include the initial starting point in the time series. The indexing of the rows (rownames) starts from 0. I would propose changing this to start from 1 because in R, indexing typically starts from 1 (unlike in some other comp languages). Making exceptions may lead to confusion(?) `head(assay(se, "counts"))` s_1 s_2 s_3 0 0 6 9 1 0 3 12 2 0 0 15 The parameter metadata looks like this. Also from here I would remove the initial time point if it can be already included in the assay data as the first time point. Otherwise, to me the metadata list seems like a a good solution. It would be helpful to document this with an example on the manpage (roxygen). `metadata(se)` $values $values$community.initial [1] 0 5 10 $values$migration.p [1] 0.01 $values$metacommunity.p [,1] [,2] [,3] [1,] 0.1942493 0.0399247 0.765826 $values$k.events [1] 1 $values$growth.rates [1] 1 1 1 $values$norm [1] FALSE $values$t.end [1] 1000 $values$...
FelixErnst commented 2 years ago

The values look non-descriptive to me.

values should be something like hubbelRates to mark where those values belong to.

But I think this is the result of using mget and the while loop. I think this is not good practice. You should know which parameter values you will generate before using them. In addition, most of these are the input arguments, so why not explicitly store them?

danielriosgarza commented 2 years ago

I agree that there is no point in storing the "community.initial" and "community.after" since this information is already in the matrix.

There was a mistake in writing the data to the out.matrix which is hopefully now fixed.

By default, the first time-point of the simulation should contain the initial population.

If I understood correctly, the rownames are the timepoins corresponding to each composition, in this case it makes sense to start at zero if zero was the first point, e.g:

$se = simulateHubbellRates(community.initial = c(12, 13, 0), t.end=470, t.store=53)

$head(assay(se, "counts"))

 s_1 s_2 s_3

0 12 13 0 8.9 0 0 25 17.8 0 0 25 26.7 0 0 25 35.6 0 2 23 44.5 0 0 25

antagomir commented 2 years ago

The indices should start from one (in R) but the colData(se) could include a field that declares the time point for each sample. This is a good idea also because sometimes (quite often) the intervals between time points can vary in real studies and there can be a need also for simulations where time points have been stored at irregular intervals (for benchmarking etc). Then the integer indices would not workl. I would therefore propose to add a field time in colData(se), this can get integer values, starting from zero, in the simplest case where the simulation has integer time steps.

FelixErnst commented 2 years ago

I would try to use as much of native R including the Date object.

antagomir commented 2 years ago

That would be recommended for real experiments. For reproducible simulations, there is less need but if the Date type is useful then whynot. At the same time it is good to keep it simple, numeric variables for time steps are simpler than Date objects. But uniform treatment is also good and it could help if it was implemented in this package in the same way than in other packages that may use real data and real dates. Just not sure how to reasonably add dates to artificial simulations that can be exactly replicated at any time.

FelixErnst commented 2 years ago

Apart from saving the Date, it allows for easy conversion and calculation of time differences and such. One could do this manually, but from experience I would rather avoid the headache, which comes with time calculations.

antagomir commented 2 years ago

Fine with me, also for simulations. If @YagmurSimsekk could have a look? Let's chat more when necessary.

YagmurSimsekk commented 2 years ago

I considered the feedbacks and changed the metadata field to contain only the parameters generated during the function. Also, I thought it would be better to add the time points to the metadata along with other parameters, colData field didn't seem a perfect fit for it. (Open to discussion, of course). I'm trying to add a generalized function to extract time intervals and add them onto date/time. I assume this can also be used in real experiment data.

antagomir commented 2 years ago

To me time point information seems exactly the type of information that is given per sample (hence fit for colData), and including time points in sample metadata is rather standard in similar studies widely across different fields of bioinformatics. What would be the specific justification for metadata field being a better fit than colData?

There are already existing functions to get time differences for time format (for instance difftime). In general this is something to be added by the user I think. But in simulated data it can be provided as a conveniency. In this case, colData could have for instance fields "time" and "time_from_baseline" (or some better name).

YagmurSimsekk commented 2 years ago

It wasn't exactly the same as what I understood at the time, I just added the time points without the species. But now it makes sense that the time point information of each species can be stored in colData.

antagomir commented 2 years ago

Isn't it time point information of each sample (and not of each species)?

YagmurSimsekk commented 2 years ago

True, that should be sample. Does it look better now after last commit?

antagomir commented 2 years ago

I am not sure what you are aiming at here.

Pulling the latest version of the "authors" branch I simulate data:

se <- simulateHubbellRates(community_initial = c(0,5,10),
                  migration_p = 0.01, metacommunity_p = NULL, k_events = 1,
                  growth_rates = NULL, norm = FALSE, t.end=1000)

Now in this latest version, you have

The following dimension for the assay data

dim(assay(se, "counts")) [1] 1000 3

If this is a 3 species community with 1000 time steps then the assay data should be a 3 x 1000 matrix (and not 1000 x 3) i.e. species x samples matrix.

Same applies to the column data.

dim(colData(se)) [1] 3 1000

This means that there are 1000 different fields in colData for 3 samples. Instead of the single time point field for 1000 samples (time steps).

Then I realized that in fact colData duplicates the information that is in the assay data:

all(as.matrix(colData(se)) == t(assay(se, "counts"))) [1] TRUE

The idea is that the assay data contains the abundance of each species (rows) per each sample (time point, cols).

The colData should give the time point for each sample. If there are 1000 samples (or time points), then the colData should be 1000x1 DataFrame with a single field called "time", or 1000x2 DataFrame if you also like to include time from baseline. The colData would then provide time point information for each of the 1000 samples. This is not the same than the assay data, which is giving the abundance information of each species at each time point.

We are observing a 3-species community over time and at each time step the timepoint is shared by all community members. Therefore it can be provided as a single field in colData.

YagmurSimsekk commented 2 years ago

Thank you so much for detailed explanation! I have mixed the column data and row data for assay, so it also messed up the row numbers of colData field, that's why I added time point to metadata in the first place. Now when I run the same example I get dim(assay(se, "counts")) [1] 3 1000
---> 3 is the number of species and 1000 is the time points with [0] is the initial community size

dim(colData(se)) [1] 1000 2 ---> 1000 is the time points and time and time_interval in the columns

antagomir commented 2 years ago

This now seems to be solved as the parameters are stored in the metadata slot in the examples.

@YagmurSimsekk and others - kindly close the issues as they become ready. Or is there still something pending on this one? Correct me if I am mistaken, closing this now.