plantbreeding / BrAPI

Repository for version control of the BrAPI specifications
https://brapi.org
MIT License
55 stars 32 forks source link

Managing aggregated Phenotypic Data #467

Open NPZInno opened 3 years ago

NPZInno commented 3 years ago

BrAPI manage phenotypic data well on the single observation level (e.g. single measurements on one plot). In the breeding process, be means of a statistical analysis, most often a ANOVA (Analysis of variance), aggregated phenotypic data is generated. Most predominantly the so called “adjusted entry mean”. This would correspond to on aggregated phenotypic value in a given Trial per Germplasm (and Trait).

Could you please help us to manage those aggregated phenotypes in BrAPI? Furthermore, there are other aggregations around:

As most of those aggregated phenotypes are generated using advanced statistical methods externally, we need to store the vales and it also would be nice to store metadata on the analysis (method, parameter, ect.).

Last but not least, one could apply different methods and/or parameter sets to calculate aggregated Phenotypes for the same germplasm/trait combination. Therefore, it is a need to allow for multiple analysis.

Thanks a lot!

mverouden commented 3 years ago

As mentioned in #468 I do not think it wise to store derived results in the data base. Insights to methods change also over time, which would also require means to update the stored derived data.

Generally the "advanced" statistical methods are not that advanced, that they can not be easily recalculated. Biometris has a very strong background in Statistical Genetics for Plant Breeding and we have in the last few years released packages on CRAN in this field (statgenSTA for single site analyis, statGxE for multi environment analysis, statgenGWAS for genome wide association studies, statgenHTP for High Throughput Phenotyping Data Analysis, SpATS for spatial analysis of field trials, etc.).

cpommier commented 2 years ago

Brapi can be used to exchange both measured data and computed/derived data, and both makes sense. The risk would be to publish and maintain access only to derived data. That would be a mistake since it has been computed for a certain scientifique question. In other words, a measured dataset will generat many derived dataset.

But is is also important ot be able to share and publish derived dataset and to reuse it in metanalysis or genetic analysis such as GWAS. In that case the brapi.observationunit.level is not a physical object (plot, plant, bloc, etc...) but a virtual reference (whole study, genotype/germplasm) This is illustrated in the following dataset: https://doi.org/10.15454/IASSTN which can be further explored here : https://urgi.versailles.inra.fr/ephesis/ephesis/viewer.do#dataResults/trialSetIds=42 . On the latter database (GnpIS) click on the Phenotypic data tab and you will see three levels : plant, plot and trial (maps to brapi.study) . BrAPI endpoint : https://urgi.versailles.inrae.fr/faidare/brapi/v1/trials/aHR0cHM6Ly9kb2kub3JnLzEwLjE1NDU0L0lBU1NUTg%3D%3D and swagger: https://urgi.versailles.inrae.fr/faidare/swagger-ui.html#/Breeding%20API/getTrialUsingGET

The most important issue here is probably Provenance. We must store how the data has been computed : Blues, Anova, from which studies and levels, with which parameters, etc... I would therefore propose to activate a group working on the traceability and provenance at the study and observationUnit level.