Open rcannood opened 3 years ago
To paste my reply here as well:
We should have the AnnData outputs from the methods that we can generate visualizations from together with the solution, right? That should be a more lightweight way to go about this I reckon.
What I mean that metric components might also return additional output. for instance, the AUROC component for task 2 might also return additional data frames in order to be able to generate roc plots.
Hmm... I would probably have a standard eval pipeline which just outputs the values and not save the additional components. Instead, could we not run those metrics again with different parameters to output possible additional visualizations at a later point when you might want to plot them. What do you think?
I wouldn't rerun metrics because some of them might be stochastic. For instance, a rank based method like spearman or AUROC might shuffle ties. Sure, you can set a seed and hope that all RNGs properly listen to this seed, but why not simply allow a metric component to return a h5ad?
Alternatively, it would also be possible to output a tsv and optionally output additional files if so desired.
I would go with the optional part... an anndata object seems like overkill, no?
I completely agree with you :)
However, there are quite a few metrics by now, and I would rather focus on getting everything working instead of refactoring the metric components.
Can we either stick with the AnnData format for now (including @mumichae's contributions) until somebody (possibly me) has time to refactor all the metrics at once?
I've added it to the agenda for tomorrow's meeting, we can discuss it there ;)
Are you then expecting a separate anndata object per metric? Having a lot of metric components will lead to that in the current format.
I'm currently expecting one anndata object per component, but a component can implement multiple metrics.
That there are a lot of output files are okay, since they're just a few kb.
Just to clarify, it's okay for me to switch to tsv. I left the TSV code as comments in Michaela's components.
Just to clarify, it's okay for me to switch to tsv.
Yeah, I got that :). Just wanted to clarify the way forward before the switch for now. I think we will keep 1 metric per component for now to keep it more nicely in line with open problems viash.
Do you really want to discuss metric output format in the meeting today? It might be more of a question of announcing it, no? I don't imagine there will be many opinions here. Basically, when we have time -> move to tsv. It doesn't really matter as long as sth is output that we can work with.
I think we will keep 1 metric per component for now to keep it more nicely in line with open problems viash.
What about multiple metrics if they are conceptually similar? E.g. pearson and spearman correlation, or the AUROC and AUPR metrics.
Do you really want to discuss metric output format in the meeting today?
Just to make sure, isn't it tomorrow?
I also think that probably moving to tsv when we have the time will be OK for everyone, but maybe somebody might have additional input.
For instance, somebody might disagree with the naming conventions that I chose. I've also been considering adding a 'stderr' or 'message' field to the output in case of an error. This could be added to the same AnnData / tsv format, but if the tsv will contain a blob of text, it feels more natural to use a TSV instead. Example:
[
{
"dataset_id": "totalvi_10x_malt_10k",
"method_id": "babel",
"metric_id": "pearson",
"metric_value": 0,
"metric_higherisbetter": true,
"stderr": "Error: computation of pearson score failed for this and that reason."
},
{
"dataset_id": "totalvi_10x_malt_10k",
"method_id": "babel",
"metric_id": "spearman",
"metric_value": 0.44,
"metric_higherisbetter": true,
"stderr": ""
}
]
Just to make sure, isn't it tomorrow?
Haha, woop... yes ;).
What about multiple metrics if they are conceptually similar? E.g. pearson and spearman correlation, or the AUROC and AUPR metrics.
I think it's fine to do this elsewhere. I'd just like to keep it this way for task 3 to ensure that easy open problems compatability remains.
Continuation of #1 by @LuckyMD :
Regarding the anndata/tsv: I was wondering about this as well. Right now, the common/extract_scores component generates one tsv from one or more anndata files. The benefit of this is that it's possible to add more data to your anndata file and generate visualisations from it later. However, having multiple tsv files and concatenating them is a much simpler process. Thoughts?