jimmyjbling / VirtualDrugBuffet

0 stars 0 forks source link

GroupBy Curation #12

Open jimmyjbling opened 1 month ago

jimmyjbling commented 1 month ago

Sometimes we might want to apply different curation workflows to different subsets of the dataset, based on the value of label or non-label column (e.g. protein targets class)

Right now you'd need to run a different curation steps for each of the groups manually. I think there should be a way to group them instead with a GroupBy curation workflow, that leverages a new parameter in the DataIO step that lets you define a "group" column. This would need adjusting in both the Loader and Dataset classes to allow it.

Then you can specify a CurationWorkflow with a "groupby=True" call, passing a dict of steps lists for each group. Like

{"GPCR": [CurateValidate(), CurateInorganic()], "Ion Channel": [CurateValidate(), CurateNeutralize(), CurateInorganic()]
jimmyjbling commented 1 month ago

Should probably be a new class rather than the same curation workflow class