To tsv or not to tsv - Githubissues

rcannood commented 3 years ago

Continuation of #1 by @LuckyMD :

Another aspect, I haven't touched yet is whether we can have the metrics output a tsv file instead of an AnnData object? We could then concatenate all individual metric tsv results at the end to generate one output file? This would make it easier to have the modular metric addition that we now have. What do you think @rcannood ?

Regarding the anndata/tsv: I was wondering about this as well. Right now, the common/extract_scores component generates one tsv from one or more anndata files. The benefit of this is that it's possible to add more data to your anndata file and generate visualisations from it later. However, having multiple tsv files and concatenating them is a much simpler process. Thoughts?

LuckyMD commented 3 years ago

To paste my reply here as well:

We should have the AnnData outputs from the methods that we can generate visualizations from together with the solution, right? That should be a more lightweight way to go about this I reckon.

rcannood commented 3 years ago

What I mean that metric components might also return additional output. for instance, the AUROC component for task 2 might also return additional data frames in order to be able to generate roc plots.

LuckyMD commented 3 years ago

Hmm... I would probably have a standard eval pipeline which just outputs the values and not save the additional components. Instead, could we not run those metrics again with different parameters to output possible additional visualizations at a later point when you might want to plot them. What do you think?

rcannood commented 3 years ago

I wouldn't rerun metrics because some of them might be stochastic. For instance, a rank based method like spearman or AUROC might shuffle ties. Sure, you can set a seed and hope that all RNGs properly listen to this seed, but why not simply allow a metric component to return a h5ad?

Alternatively, it would also be possible to output a tsv and optionally output additional files if so desired.

LuckyMD commented 3 years ago

I would go with the optional part... an anndata object seems like overkill, no?

rcannood commented 3 years ago

I completely agree with you :)

However, there are quite a few metrics by now, and I would rather focus on getting everything working instead of refactoring the metric components.

Can we either stick with the AnnData format for now (including @mumichae's contributions) until somebody (possibly me) has time to refactor all the metrics at once?

rcannood commented 3 years ago

I've added it to the agenda for tomorrow's meeting, we can discuss it there ;)

LuckyMD commented 3 years ago

Are you then expecting a separate anndata object per metric? Having a lot of metric components will lead to that in the current format.

rcannood commented 3 years ago

I'm currently expecting one anndata object per component, but a component can implement multiple metrics.

That there are a lot of output files are okay, since they're just a few kb.

Just to clarify, it's okay for me to switch to tsv. I left the TSV code as comments in Michaela's components.

LuckyMD commented 3 years ago

Just to clarify, it's okay for me to switch to tsv.

Yeah, I got that :). Just wanted to clarify the way forward before the switch for now. I think we will keep 1 metric per component for now to keep it more nicely in line with open problems viash.

Do you really want to discuss metric output format in the meeting today? It might be more of a question of announcing it, no? I don't imagine there will be many opinions here. Basically, when we have time -> move to tsv. It doesn't really matter as long as sth is output that we can work with.

rcannood commented 3 years ago

I think we will keep 1 metric per component for now to keep it more nicely in line with open problems viash.

What about multiple metrics if they are conceptually similar? E.g. pearson and spearman correlation, or the AUROC and AUPR metrics.

Do you really want to discuss metric output format in the meeting today?

Just to make sure, isn't it tomorrow?

I also think that probably moving to tsv when we have the time will be OK for everyone, but maybe somebody might have additional input.

For instance, somebody might disagree with the naming conventions that I chose. I've also been considering adding a 'stderr' or 'message' field to the output in case of an error. This could be added to the same AnnData / tsv format, but if the tsv will contain a blob of text, it feels more natural to use a TSV instead. Example:

[
  {
    "dataset_id": "totalvi_10x_malt_10k",
    "method_id": "babel",
    "metric_id": "pearson",
    "metric_value": 0,
    "metric_higherisbetter": true,
    "stderr": "Error: computation of pearson score failed for this and that reason."
  },
  {
    "dataset_id": "totalvi_10x_malt_10k",
    "method_id": "babel",
    "metric_id": "spearman",
    "metric_value": 0.44,
    "metric_higherisbetter": true,
    "stderr": ""
  }
]

LuckyMD commented 3 years ago

Just to make sure, isn't it tomorrow?

Haha, woop... yes ;).

What about multiple metrics if they are conceptually similar? E.g. pearson and spearman correlation, or the AUROC and AUPR metrics.

I think it's fine to do this elsewhere. I'd just like to keep it this way for task 3 to ensure that easy open problems compatability remains.

openproblems-bio / neurips2021_multimodal_viash

To tsv or not to tsv #3