openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
287 stars 76 forks source link

datasets metadata #855

Open KaiWaldrant opened 1 year ago

KaiWaldrant commented 1 year ago

The task-specific datasets have metadata (link) that are not included in the data loaders (link). In v2, we are currently using an additional yaml (link) to add metadata by default. Is it okay if we add the following mandatory fields to the util.loader decorator: dataset_name, dataset_organism, dataset_summary?

The dataset_id can be derived from the function name, while dataset_description can be derived from the function docstring.

If you agree, I can make a PR for this.

scottgigante-immunai commented 1 year ago

The name and summary can be task-specific. See e.g. zebrafish_random va zebrafish_labs in label projection.

Organism is generic but currently not required everywhere; however, since it's always going to be available, I'm okay with moving it to the loader and making it mandatory there.