Closed YagmurSimsekk closed 2 years ago
The SE structure does not have a placeholder for model parameters afaik.
Depends on how critical this really is. If you can generate the same output by setting random seed, and running the script that generated the data, this might be enough in many cases.
When it is not enough, one could also consider separating the two functions, so that the stochastically generated parameters would be generated already outside of the original function. This could be more transparent. If this is not feasible, we can ask how important it is to know the exact values if the process is meant to be stochastic in the first place.
I would prefer finding a solution that respects standard workflows. This is important for transparency and interoperability, especially in the longer term. I am not sure if we have a solution that could readily add these as part of the SE object.
In principle, we can suggest new structures in the SE or TreeSE object but justifications should be quite strong before this is feasible I think.
New suggestion regarding SE structure: The result of simulations focused on the SummarizedExperiment class data container. Parameters produced by distribution functions (rbinom, runif, ...) are currently only placeholders. For the benefit of the user, it is recommended that the result to be a list object that also contains the parameters used in the function, once the placeholder parameters are replaced with actual values.
I am attempting to summarize, please correct if necessary:
Problem microbiome time series simulations involve various parameters, including stochastic parameters generated during the process. It can be necessary to keep a record on these but the SE container does not naturally allow this.
Solution Implement separately:
If the output from (1) is always in a standard format (for instance, list with fields "matrix" and then other fields for some parameters), it might be possible implement (2) as a generic wrapper that works for several different simulators.
Would this work @danielriosgarza @YagmurSimsekk ?
I would be curious to hear if @FelixErnst has other suggestions.
Related to #23
Indeed, I think the best option is to implement (2) as a generic wrapper. The output from (1) should be in a standard format, with the matrix of composition over time having the same format for each model.
I agree with Daniel's point. This may be better for users to keep track of important values.
Sounds feasible to me. Standardization is essential here, as well as providing clear examples in function roxygen pages and vignette.
Why not use the metadata list? This is exactly the use case. You can have a name, set a value and expect a value for a given name. If it is NULL you can through an error and be done with it.
Can you prepare an example of which parameters and values we are talking about?
@FelixErnst Thank you for the suggestion! Here is this example simulateHubbellRates function where we can see parameters in the metadata, including the parameters that were initially NULL , but they are generated in the function. Also, the colData shows the community size for each species before and after the simulation. I am not sure if colData field can be used for this, but it can be beneficial to see these numbers together in data frame format.
In practice:
community.initial <- c(0, 5, 10)
se <- simulateHubbellRates(community.initial, migration.p = 0.01,
metacommunity.p = NULL, k.events = 1, growth.rates = NULL,
norm = FALSE, t.end=1000)
This simulation has 1000 time points for a 3 species community.
dim(se)
[1] 1000 3
The sample data shows the initial and last time state for this 3-species community.
colData(se)
DataFrame with 3 rows and 2 columns initial.community after.community
The values look non-descriptive to me.
values
should be something like hubbelRates
to mark where those values belong to.
But I think this is the result of using mget
and the while loop. I think this is not good practice. You should know which parameter values you will generate before using them. In addition, most of these are the input arguments, so why not explicitly store them?
I agree that there is no point in storing the "community.initial" and "community.after" since this information is already in the matrix.
There was a mistake in writing the data to the out.matrix which is hopefully now fixed.
By default, the first time-point of the simulation should contain the initial population.
If I understood correctly, the rownames are the timepoins corresponding to each composition, in this case it makes sense to start at zero if zero was the first point, e.g:
$se = simulateHubbellRates(community.initial = c(12, 13, 0), t.end=470, t.store=53)
$head(assay(se, "counts"))
s_1 s_2 s_3
0 12 13 0 8.9 0 0 25 17.8 0 0 25 26.7 0 0 25 35.6 0 2 23 44.5 0 0 25
The indices should start from one (in R) but the colData(se) could include a field that declares the time point for each sample. This is a good idea also because sometimes (quite often) the intervals between time points can vary in real studies and there can be a need also for simulations where time points have been stored at irregular intervals (for benchmarking etc). Then the integer indices would not workl. I would therefore propose to add a field time
in colData(se), this can get integer values, starting from zero, in the simplest case where the simulation has integer time steps.
I would try to use as much of native R including the Date object.
That would be recommended for real experiments. For reproducible simulations, there is less need but if the Date type is useful then whynot. At the same time it is good to keep it simple, numeric variables for time steps are simpler than Date objects. But uniform treatment is also good and it could help if it was implemented in this package in the same way than in other packages that may use real data and real dates. Just not sure how to reasonably add dates to artificial simulations that can be exactly replicated at any time.
Apart from saving the Date, it allows for easy conversion and calculation of time differences and such. One could do this manually, but from experience I would rather avoid the headache, which comes with time calculations.
Fine with me, also for simulations. If @YagmurSimsekk could have a look? Let's chat more when necessary.
I considered the feedbacks and changed the metadata field to contain only the parameters generated during the function. Also, I thought it would be better to add the time points to the metadata along with other parameters, colData field didn't seem a perfect fit for it. (Open to discussion, of course). I'm trying to add a generalized function to extract time intervals and add them onto date/time. I assume this can also be used in real experiment data.
To me time point information seems exactly the type of information that is given per sample (hence fit for colData), and including time points in sample metadata is rather standard in similar studies widely across different fields of bioinformatics. What would be the specific justification for metadata
field being a better fit than colData
?
There are already existing functions to get time differences for time format (for instance difftime
). In general this is something to be added by the user I think. But in simulated data it can be provided as a conveniency. In this case, colData could have for instance fields "time" and "time_from_baseline" (or some better name).
It wasn't exactly the same as what I understood at the time, I just added the time points without the species. But now it makes sense that the time point information of each species can be stored in colData.
Isn't it time point information of each sample (and not of each species)?
True, that should be sample. Does it look better now after last commit?
I am not sure what you are aiming at here.
Pulling the latest version of the "authors" branch I simulate data:
se <- simulateHubbellRates(community_initial = c(0,5,10),
migration_p = 0.01, metacommunity_p = NULL, k_events = 1,
growth_rates = NULL, norm = FALSE, t.end=1000)
Now in this latest version, you have
The following dimension for the assay data
dim(assay(se, "counts"))
[1] 1000 3
If this is a 3 species community with 1000 time steps then the assay data should be a 3 x 1000 matrix (and not 1000 x 3) i.e. species x samples matrix.
Same applies to the column data.
dim(colData(se))
[1] 3 1000
This means that there are 1000 different fields in colData for 3 samples. Instead of the single time point field for 1000 samples (time steps).
Then I realized that in fact colData duplicates the information that is in the assay data:
all(as.matrix(colData(se)) == t(assay(se, "counts")))
[1] TRUE
The idea is that the assay data contains the abundance of each species (rows) per each sample (time point, cols).
The colData should give the time point for each sample. If there are 1000 samples (or time points), then the colData should be 1000x1 DataFrame with a single field called "time", or 1000x2 DataFrame if you also like to include time from baseline. The colData would then provide time point information for each of the 1000 samples. This is not the same than the assay data, which is giving the abundance information of each species at each time point.
We are observing a 3-species community over time and at each time step the timepoint is shared by all community members. Therefore it can be provided as a single field in colData.
Thank you so much for detailed explanation! I have mixed the column data and row data for assay, so it also messed up the row numbers of colData field, that's why I added time point to metadata in the first place. Now when I run the same example I get
dim(assay(se, "counts"))
[1] 3 1000
---> 3 is the number of species and 1000 is the time points with [0] is the initial community size
dim(colData(se))
[1] 1000 2
---> 1000 is the time points and time
and time_interval
in the columns
This now seems to be solved as the parameters are stored in the metadata slot in the examples.
@YagmurSimsekk and others - kindly close the issues as they become ready. Or is there still something pending on this one? Correct me if I am mistaken, closing this now.
Hi,
We have been thinking on the topic of adding parameters as the fields of SE object. Since some parameters are generated by distribution functions(mostly uniform distribution for instance in simulateGLV function ), adding them as fields would be one way of seeing them explicitly, so the users can see that they are working on. But we are not sure how suitable it is for SE object structure or whether it is outside the concept of SE object class.
Any opinion on this is appreciated.