Open ChristianLieven opened 7 years ago
I think we just need to get started with some penalty function so we can have something concrete to discuss.
For tests that output a number of reactions, metabolites etc. we could use the fraction as the preliminary score.
For instance: The fraction of blocked reactions (in rich medium) should be low. The fraction of reactions without GPR should be low. The fraction of metabolites without mass and charge should be low. The fraction of metabolites with annotation should be high. etc.
I think we should have an ideal perfect model as a reference toy to begin with probably 100 full score. Then we play with changes to reduce the score based on the results from the memote test. Probably the core model of ecoli would be a good one ?? what do you think ?
I would say that soft tests do not really fail, but rather give a warning. Obviously you would want people to a global standard (and BiGG's standard). Some models I used from BiGG weren't annotated correctly to BiGG's standards actually. I guess these types tests either pass of fail (1 or 0), only two outcomes possible.
I agree on the fact that hard tests should have some sort of gradient. The example @ChristianLieven gave should be useful (dividing the outcome with total number of reactions available), providing a score between 0 and 1 (and inverted based on the question). But how would you specify the thresholds after which a test passes? Or should you even define those? Maybe adding all the outcome values and calculating the percentage of accuracy using the total number of points that could be obtained within the current model.
But how would you specify the thresholds after which a test passes? Or should you even define those?
Yeah! I was also thinking of this too much in terms of unittesting. Really what I think people want is just a continuous score. So I wouldn't exactly define any thresholds at all. In fact I think for now:
[...] adding all the outcome values and calculating the percentage of accuracy using the total number of points that could be obtained within the current model.
is the way to go indeed!
I might also like a "known failure" category (cf examples in pytest and https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.testing.decorators.knownfailureif.html. They serve a different purpose than discovering accidental breakage, letting you document mis(sing) features, which is useful in summaries, remembering to update docs when fixed, etc.
I think this is a relevant issue, because the final assessment of a model (and the comparison with others) will be measured and summarized based on it. Currently, the long list of independent test passed of failed does not clarified the quality of the model.
Apart of a global weighted score, I would also provide with a weighted score per each category (basic, biomass, consistency, annotation, syntax). By this way, an user could evaluate a model depending on its interest, objective to use the model. And comparing models in different aspects (maybe a model could be better in one category, and other model in a different category).
Besides, the independent biomass tests should have a global or average score collecting all the biomasses functions results. I means, a score for 'test_biomass_consistency', other score for 'test_biomass_precursors_default_production', etc. The same for all the tests repeated for different sources, such as 'test_detect_energy_generating_cycles'
I like the concept of scoring, but fear that scores that are too general may lead to meaningless comparison between reconstructions, or misguided curation efforts (e.g. curating to achieve a high score, rather than curating to make the reconstruction as predictive/representative as possible for the intended purpose).
To illustrate the point, let's consider some of the "easy to score" tests that @ChristianLieven brought up (sorry to pick on them, I realize these were just quick examples):
The fraction of blocked reactions (in rich medium) should be low.
While unblocking these reactions might improve performance, and removing them may reduce the size of the reconstruction, penalizing their inclusion may harm the use of the reconstruction as a knowledgebase. For example, if a particular understudied organism has a pathway that is blocked because it contains a novel, unidentified reaction that connects it to the rest of the network, the reconstruction's score might be penalized for including this blocked pathway, even though identification/characterization of the blocked portions may represent important biological knowledge. If the goal is to improve such a reconstruction in an iterative fashion, I don't think that having such a metric contribute to the overall score for an organism will encourage that.
The fraction of reactions without GPR should be low.
Similar to the logic above, this might discourage authors from including reactions for which there is substantial experimental evidence, yet no known gene.
I think the best compromise is the suggestion that @beatrizgj made, e.g. there should be category-specific scores. I favor leaving an overall score out entirely, although I realize that makes it more difficult to communicate the overall quality of the reconstruction. Maybe presenting only the results of the 'Hard' tests as the overall score might work better (e.g. if there's a mass balance issue, I can't think of any way that penalizing the overall score would hurt the science).
I also particularly like the idea of known failures @jonovik . Framing some of the test results like that could guide/prioritize future curation efforts in a way that I think is more constructive than reporting a continuous score.
I get your points @gregmedlock. Favouring misguided curation efforts is a very real possibility when providing a score. Summarising the thoughts so far: (A)There are tested aspects where the score is not easily quantifiable because these aspects depend on the preferred use/ or underlying biology of a metabolic reconstruction ('soft tests'). Opposed to that, there is (B) a set of tests that can be quantified quite well as they purely depend on modelling paradigms (let's call them 'hard tests'), and perhaps, similarly opposed to that are (C) the tests that run against provided experimental data, which are context-dependent 'hard tests'.
I would also like to point out that with memote's two fundamental workflows we're looking at separate problems:
Snapshot
report, i.e. the Benchmark for editors/reviewers, I think a single score on an immutable set of tests is essential to enable fast decision-making. So, the way I see it, really we are faced with several problems here:
After some discussion with @gregmedlock and Jason today, we've collected some thoughts on the presence of an "overall" score:
Overall Score:
score-centric
report may further aggravate an undifferentiated interpretation of the results in question by putting a specific score into a user’s head from the start.In response to that, I've opened issue #526.
We want to come up with a reasonable weighting of the categories and of individual tests within the categories. We're already very sure that we can make a larger distinction between soft and hard tests.
Soft = 'Syntax' and 'Annotation'. If a model scores bad here, then the predictive capabilities of the model could still be fine, it would only be rather difficult to share or for outsiders to use the model in a different setup.
Hard = 'Basic', 'Consistency' and 'Biomass' [and 'Experimental']. A bad score in these categories often means a model may not be biologically meaningful or operational. It may not be possible to rely on it's predictions.
Let's discuss the details of a possible weighting scheme in here.