openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
290 stars 77 forks source link

Refactoring #125

Closed michalk8 closed 1 day ago

michalk8 commented 3 years ago

Hi @scottgigante , I've briefly spoken with @LuckyMD about some refactoring ideas for the code that might make adding/modiying stuff and ultimately the testing easier.

In general, there are 2 areas that could be improved:

  1. the abstraction (going away from all the decorators to more OOP approach) - this will better constrain creation of tasks/methods and remove bunch of code duplication (e.g. https://github.com/singlecellopenproblems/SingleCellOpenProblems/blob/master/openproblems/tasks/label_projection/methods/logistic_regression.py#L47)
  2. separate metadata from the code: e.g. tasks/methods/etc. could be just define in .yaml files, which makes it language agnostic - this make it less dependent on Python (some methods might be in R/other lang) +

I can create a UML diagram of what I imagine the abstraction to look like and then post it here and further refine it so that everything is included.

dburkhardt commented 3 years ago

I like this idea in general. I don't see any issue with replacing the method decorators with config .yaml config files, and that will make the methods more generalizable. We did want to have first-class Python support because eventually the goal is to have this platform drive innovation in ML, where Python is the dominant language.

I'm wondering if you could explain more what you mean by the first part. Can you give an example of what you're proposing and how it would save duplication?

michalk8 commented 3 years ago

I.e. if we imagine we have some abstract class Method with 1 abstract method _run_algo and 2 other methods _preprocess and _postprocess, one could do:

class Method(ABC):
    def _run_algo(adata: AnnData):
       self._preprocess(adata)
       self._run_algo(adata)  # should be in try: ... except: ...
       self._postprocess(adata)

class LogReg(Method, ABC):
    def _run_algo(adata: AnnData):
        # the implementation as above

class LogRegScran(LogReg):
    def _preprocess(adata: AnnData):
       # run log_scran

class LogRegCPM(LogReg):
    def _preprocess(adata: AnnData):
       # run log_cpm

It could also inherit the metadata and modify relevant parts (in the linked example, all the fields are the same except for the method name). However, this is very simplistic implementation (better abstraction would be a method factory) - e.g. the preprocessing step is tied to the method at "compile" time, but it need not be that case, esp. if one wants to evaluate different preprocessing for 1 method [in that case we'd inject at runtime some "Preprocessor" class]. Same for postprocessing - as far as my current understanding of the project goes, it could really be task specific and mostly validation/scoring.

All in all, I think with OOP you can better contrain everything so that mistakes are less likely happen - if the _logistic_regression fn were to throw an error, how would I easily catch it so that I can propagate useful info to the user and continue with evaluation? If I were to add try-catch here: https://github.com/singlecellopenproblems/SingleCellOpenProblems/blob/4baa50563e6c9a27ad6df661ec8259b934dd7daf/openproblems/tools/decorators.py#L45, it could also catch erros in the data normalization etc. so it means that error-handling would most likely be done in the _logistic_regression, if done at all (means method developer has to worry about it + if there are many methods, all of them have to/should follow this pattern => the same code is being duplicated because the abstraction is wrong).

I think the best way to go about this is to gather as many tasks as available because they dictate what is required as input and what as output, methods/datasets just need to follow the constraints - the more we know about them, the easier it will be to find (dis)similarities and abstract away.

LuckyMD commented 3 years ago

Hey @michalk8, I sent you an email to your protonmail account. Hope that's the correct one.

michalk8 commented 3 years ago

Hey @michalk8, I sent you an email to your protonmail account. Hope that's the correct one.

I've received the email, thanks @LuckyMD !

michalk8 commented 3 years ago

Sorry for being a bit late with this - still working on the diagrams, but tomorrow I will definitely post some alpha version so that we can start discussing.

michalk8 commented 3 years ago

I've done the pre-alpha class diagram of the general abstraction, including for the tests (with which I'm currently most dissatisfied). I've tried to keep the abstract as simple as possible, though already I see some design flaws that I would change.

Note that many attributes/methods/method arguments are still missing, as well as a ConstrainParser which should take in some config file (e.g. .yaml) and create the Constraint objects. Also the more I got into this, the more I think some form of a DB could/should be used (sqlite would do just fine) - this would also help with some sort of a "versioning" of methods/data + keeping logs long-term (to see evolution of the computed metrics).

Lastly, it would be super cool if we could really formalize the requirements as per 1. and 2. in https://en.wikipedia.org/wiki/Requirements_engineering#Activities I realized this way too late in the process of creating this.

pre_alpha_diagram

LuckyMD commented 3 years ago

Wow... this is elaborate. A lot to take in at first glance. I like the separation of the evaluation runs and the task definition. However, if I see correctly the task and method classes are separate, right? I thought we were considering the method class (and metric for that matter) as a subclass of task? That way it might be easier to define a task as having at least 1 method defined and 1 metric.

scottgigante commented 3 years ago

Two main themes from me.

  1. Agree on the way tests are currently run in containers being unsatisfying, though since it's internal (doesn't really affect method/task contributors) I see it as a somewhat lower priority
  2. The low-level details are a little beyond my ability to evaluate the structure, but I think the relationship between tasks, methods, datasets and metrics is the biggest question here. If each is a separate class, how do we define the relationship between a task and its functions?
michalk8 commented 3 years ago

I thought we were considering the method class (and metric for that matter) as a subclass of task? If each is a separate class, how do we define the relationship between a task and its functions?

That's the other way to do it, though esp. for metrics, they can often be reused in multiple tasks, so they'd either have to subclass bunch of task classes (this could explode in the future) - it would be better just use composition (i.e. task class has 1 method and 1 metric and just create instances with diff. methods/tasks).

The reseason why I've separated the task and methods/metrics don't subclass it (or are part of the task) is that I don't like much the above choices (1st the multiple inheritance bears in this case no fruit, the 2nd fosters some form of a repetition). In principle, I see a task as just metadata to categorize the methods into human-friendly categories, i.e. task T = { m | m \in AllMethods and reqs(m) \subset reqs(T) and provs(T) \subset provs(m) } and can be purely defined by the requirements/provisions, just as the methods are. Or it can be seen that datasets/metrics are nodes in a graph and methods are the edges connecting them - task in this case is just some classification of the edges, where each edge can belong to multiple categories.

What it would mean for the end user is that there would be no need to specify the task for which the method should be applied (although it should be made possible to have an optional whitelist/blacklist) - the task(s) could be automatically determined by the reqs/provs of the method (either expl. specified in some config file or extracted from the source code). Other way around, the user could just whitelist/blacklist specific tasks without specifying any reqs/provs (will be easier than specifying on a config file) and we could determine automatically if the method satisfied the reqs/provs of the selected task(s).

LuckyMD commented 3 years ago

I have the feeling that the organization you suggest makes it more difficult for a new user to submit a method as the output of the method would then have to potentially be task agnostic. This is not how we have been thinking about task and method interplay so far (and I don't think is task-agnostic methods are generally possible). Currently a task is defined by a particular output of a method. For example, a multi-modal data integration method could be used to infer a modality that was not used to integrate the data, or it could be used to label cells based on the integration. Although the core parts of the method would be the same, the outputs are different and would therefore be two separate tasks. Generating these outputs would have to be part of the method code, which would then make the method not task-agnostic anymore. I have the feeling that you are thinking about a task as I would think about multiple metrics for the same task. If the method output is exactly the same, then different aspects of method output would be evaluated by different metrics all within the framework of the same task.

Secondly, I also see it much easier to understand a task not as a set of edges between datasets and metrics, but as a base unit of open problems. This might be quite important as we want to grow the community as well, and they would ideally be able to understand the framework easily.

As to your point about metrics being re-used... This is indeed something that a separate metrics class would benefit from. Typically, the metrics are quite small pieces of code that we could also re-use by util functions that are used in metric functions from different task class instances, no?

scottgigante commented 3 years ago

I agree with Malte that this might make it more difficult to contribute a new task. In theory there can be methods / metrics / datasets that can be reused across tasks, but since each task's API is different they almost certainly will need to be adapted, so the automatic matching with provisions and requirements, while elegant, might end up being overkill.

michalk8 commented 3 years ago

Thank for the feedback from the both of you. You're right that it's an overkill/not so easy to understand in the current form. Will try to refactor/simplify the diagram in the next couple of days.

github-actions[bot] commented 1 day ago

This issue has been automatically closed because it has not had recent activity.