plantbreeding / BrAPI

Repository for version control of the BrAPI specifications
https://brapi.org
MIT License
57 stars 32 forks source link

Adjusted entry mean #468

Open NPZInno opened 3 years ago

NPZInno commented 3 years ago

We what to manage adjusted entry means using BrAPI. This an aggregated phenotypic value, which is generated for each germplasm in a specific trial. It´s generated by statistical methods, averaging multiple observations (generated in different studies).

This issue is one special case of high interest of the issue #467

Thanks!

mverouden commented 3 years ago

My humble opinion based on experience: I do not think that storing derived data into your data base is generally a good idea. As soon as you start storing, e.g. means over mulitple studies for a germplasm the next question will be whether also BLUEs and BLUPs can be stored.

Normally you pull the data from the data base using BrAPI calls into your software and perform the analysis there. As Biometris we have had many clients requesting this as well. In the end it turns out that the calculation time was so low, that storing means/BLUEs/BLUPs makes no sense at all.

ch728 commented 3 years ago

Good point, it often doesn't take a lot of time, but what about in larger breeding programs that are running analysis on 100s of trials in semi-automated pipelines? I think it would b useful to be able to store means/BLUEs/BLUPs and appropriate weights to use in a downstream second stage analyses. Thinking along the lines of the pipeline/analysis discussed in this paper https://acsess.onlinelibrary.wiley.com/doi/full/10.2135/cropsci2018.03.0182

mverouden commented 3 years ago

I understand your point, but I do not think it is the task for BrAPI to facilitate storage of in between results. This should be handled by the pipeline. This topic was already discussed extensively at the 2016 BrAPI Hackathon in Ithaca, where the outcome was as formulated above based on the opinion of many that were opposed to it at the time. I still support that statement and consider it bad practice to use the data base as storage for temporary/in between results.

I have never spoken to a Breeding company, that stores these in between results in their data base to facilitate the calculations within their pipeline. Even those that run hundreds of studies (in BrAPI a program can contain many trials, where each trial can have multiple studies) in semi automated pipelines.

We also apply the two-staged approach as written down by Hans Peter Piepho. The paper you site uses ASREML with models as proposed by Cullis et al. Generally these models are quite exhaustive and calculation intensive. Have you tried the Biometris CRAN packages (statgenSTA, statgenGxE, etc.). The models may be "suboptimal" compared to what Brian preaches, but generally yield very similar and often the same results in much shorter calculation time.

nickmorales commented 3 years ago

I think it would be interesting for BrAPI to allow sharing of analytic models and analytic results.

The analytic models could define a model saved in the database and available for exchange. For example, trained regression models from R which can be .Rds files or trained convolution neural network models from Tensorflow which can be .hdf5 files. The analytic models could indicate the model formula (e.g. mmer(Yield ~ 1, random=~vs(id, Gu=A) ) ) and other metadata about how the model was fit. Exchange of the fit model files (.Rds, .hdf5, etc) could be beneficial.

Then, when an analysis is run against such an analytic model, the results can be saved in the database using statistics related observation variables, and the results can be related to the analytic model.

ch728 commented 3 years ago

@nickmorales something along these lines is currently implemented in BreedBase, right? I agree that it would be cool to have something like this in BrAPI, but it seems like it would be a challenge to get to a standard that people agree on!

NPZInno commented 3 years ago

Dear all,

our issue is not about computing time at all, because @mverouden is right about calculation time. Nevertheless, there is an need to store calculated external data, like other companies, external service providers, researchers and (most importantly) official testing offices provide semi-aggregated data. For the German official variety testing office, we receive mean values per entry and location, which averages multiple replications but have no access to the single plot data. Just as another example, we also receive mean values for a location, averaging all raw data. Furthermore, due to legal issues, sometimes raw data is not accessible from publications or repositories.

Therefore, as an employee of a practical, commercial breeding company, I disagree with the opinion of @mverouden, that storing intermediate or derived data in the database is a bad idea, at least in the setting discussed here. As in the case described above (and also in other situations) it is simply not possible to recalculate results which are of high importance for further downstream analysis.

mverouden commented 3 years ago

@NPZInno In your case I would then recommend to store the data in a new study with the variable being the aggregated mean. I was under the impression, that you already had the raw (unprocessed) data available in the data base. You can of course create a study with a variable that represents the aggregated mean (in Crop Ontology these also exist). For this the POST /observations could be used.

My point would be not to store the means, in case the original data is accessible.

@nickmorales You're idea sounds interesting. However, usually model objects in R can get rather big, because of replicated storage of data. The example given of sommer::mmer() stores not only the modeled data, but also the original data. In can give you give you plenty of examples from genomic seIection and prediction using linear mixed models, where you will not be able to store it as a .Rds due to the enormous size of the model output object.

Besides that I agree with @ch728 that I will get rather difficult to find a standard that the community could agree upon.

mverouden commented 3 years ago

@lukasmueller @jeback1 What are your opinions on this? In the 2016 Hackathon at BTI, I distinctly remember the two of you being opposed to writing analyses results (such as means/BLUEs and BLUPs) into a data base using BrAPI calls.