solgenomics / sgn

The code behind the Sol Genomics Network, Cassavabase and other Breedbase websites
https://solgenomics.net
MIT License
66 stars 35 forks source link

Trial QC (aka workflow for bulk phenotype suppression) #3454

Open ch728 opened 3 years ago

ch728 commented 3 years ago

Before running the sol GS pipeline, mixed model analysis, or GWAS a decision needs to be made about which trials and which plots should be used for a given trait. For example, before I run the sol GS pipeline I pull the data out of Breedbase and run a single-trial analysis for each trait-trial combination. Then I remove trials where the genetic variance for a trait is not significantly different than zero and I also identify outliers using studentized residuals. I generate a list of plots that pass QC and only use these plots/trials for GS or other downstream analyses.

Breedbase already allows plots to be suppressed on the trial detail page. I am wondering if it would make sense to have a method for suppressing phenotypes in bulk via a dataset (plot name and traits). Taking that idea further it might be useful to have a trial QC pipeline in BreedBase. The user would supply a data set with trials and traits created in the Wizard and then the tool would run in the background and generate a report similar to the way the sol GS pipeline works. The report would flag trials as bad (red), warning (yellow) or good (green) based on basic QC thresholds. The user would have the option to view analysis details for each trial, which would include residual plots and outlier diagnostics for each trait. Then there would need to be a mechanism for creating a data set with plots to suppress for a specific trait. An option could then be added to downstream tools that use phenotypic data that allows the user to specify raw phenotypic data or omit suppressed plots--this option already exists for phenotype downloads.

lukasmueller commented 3 years ago

This should be integrated in the dataset. In the dataset, you can actually go down to the plot level... maybe it would be great to "exclude" plots in the dataset...

ch728 commented 3 years ago

Yeah, having an exclude suppressed plots option in the dataset, and an option to set plots as suppressed in bulk (right now the user needs to point and click on the phenotypic heatmap under trial details to suppress individual plots) would be really useful I think. This would allow me to run analysis and suppress plots across the trials ahead of time. Then users could use downstream tools with the option to exclude suppressed plots from the dataset and not have to worry about doing outlier analysis themselves.

isaak commented 3 years ago

In solGS, currently, you can do single-trial ANOVA and check for significance of difference in genotype means (by default, genotypes are fitted as fixed effects) for a trait before doing single trial GS modeling. Also you can check for outlier genotypes (based on genotype means), but you see this after doing GS modeling using the trial.

Visualizing phenotype data at plot level to be able to see outlier observations/plots in the solGS pipeline is in the works... issue #3393

In the end, if you can filter the outlier plots and create a list of plots from any number of trials, you can use the list in solGS, already.

Perhaps, we need to implement the outlier analysis as stand-alone or as part of the wizard search so it can be used in all the downstream analyses tools.