Provide a way to experiment on Tournesol algorithms without an entire backend environment

amatissart commented 1 year ago

Current situation

Mehestan implementation can be found in the folder backend/ml, which is a Django app and is part of the backend deployment.

Although "ml/inputs.py" provides an abstract class MlInput to define how to access the input data required by the algorithms, running Mehestan is still coupled to the Django apps, as the implementation relies on the Django models and the PostgreSQL database (e.g to save scores and scaling parameters).

This makes iterating on the algorithm implementation quite cumbersome. For example in #1332, I used a local VM and a custom script in order to schedule several runs with different parameter and export plots.

We'd like to provide a more flexible interface to enable developers and researchers to replicate this kind of experiments. It would no longer depend on the Django context and would not require any local database running. Ideally everything could run from Jupyter Notebooks, with a small set of Python dependencies.

Of course, the existing backend should rely on the same interface, to avoid code duplication as much as possible.

Possible architecture


"""
Abstract classes, to implement in various contexts
"""

class MlInput(ABC): 
    def get_comparisons(self, criteria, user_id) -> pd.DataFrame:
      ...

    def ratings_properties(self) -> pd.DataFrame:
      ...

class MlOutput(ABC):
    def save_contributor_scalings(...):
        ...

    def get_contributor_scalings(...) -> pd.DataFrame:
        ...

    def save_contributor_scores(...):
        ...

    def get_contributor_scores(...) -> pd.DataFrame:
        ...

    def save_entity_scores(...):
        ...

    def get_entity_scores(...) -> pd.DataFrame:
        ...

"""
Interface used to interact with run Tournesol algorithms and access results
"""

class TournesolResult:
    """
    Provides helpers to access well-formatted datasets based on Tournesol input and output
    """
    ml_input: MlInput
    ml_output: MlOutput

    def get_tournesol_scores_distribution():
        ...

class TournesolRun:
    ml_input: MlInput
    ml_output: MlOutput:
    ml_parameters: MehestanParameters

    def __init__(self, input, output, parameters):
        ...

    def execute() -> TournesolResult:
        ...

Usage

result = TournesolRun(
    input=MlInputFromPublicDataset(),
    output=MlOutputInMemory(),
    parameters=MehestanParameters(alpha=0.5)
).execute()
result.get_tournesol_scores_distribution().plot()

or in the Django context

poll = Poll.objects.get(name="videos")
TournesolRun(
    input=MlInputFromDb(poll),
    output=MlOutputInDb(poll)
).execute()

amatissart commented 1 year ago

@glerzing Any opinion on this? I know you spent some time experimenting on the algorithms. Do you have specific use-cases that we should take into account when designing this? And by any chance, is that something you would be interested in implementing in the coming weeks/months?

glerzing commented 1 year ago

I'm not sure what to propose, and I need to save time for other occupations, so I will not implement this.

glerzing commented 1 year ago

oops, accidentally clicked on "close with comment"

amatissart commented 1 year ago

I'm not sure what to propose, and I need to save time for other occupations, so I will not implement this.

Ok no problem. We will talk with the team about how to prioritize this.

glerzing commented 1 year ago

Have we considered making a Mehestan or Tournesol Python library, that can be downloaded with pip ? This looks like quite a lot of work so I'm really not sure this is a good idea, but that would make things easy to use in a Notebook.

tournesol-app / tournesol