signaux-faibles / predictsignauxfaibles

Dépôt du code python permettant la production de liste de prédiction Signaux Faibles.
MIT License
6 stars 1 forks source link

refacto: preprocessing #37

Closed vviers closed 3 years ago

vviers commented 3 years ago
from predictsignauxfaibles.preprocessors import PIPELINE

my_data = # a DataFrame

# this could be a single preprocessing step or a thousand
for preprocessor in PIPELINE:
    assert set(preprocessor.input).issubset(my_data.columns)
    my_data = preprocessor.function(my_data)
    assert set(preprocessor.output).issubset(my_data.columns)

WDYT ?

vviers commented 3 years ago

TODO: add other preprocessing steps (ACOSS mostly)

vviers commented 3 years ago

How to add a preprocessing function

slebastard commented 3 years ago

I think we should include the list of columns created by each preprocessing function in their documentation. Otherwise in future debugging and reviews, we will have to go back and forth between namedtuples Preprocessor and function definitions.

slebastard commented 3 years ago

In the future, we may also want to have several PIPELINE objects: there may be several if we end up creating multiple intermediate datasets) for production. That's also the appeal of having function run_pipeline(data: pd.DataFrame, pipeline: List[namedtuple]) designed this way.

I'm undecided on whether passing PIPELINE to tests/unit/preprocessors_test.py and running test_full_pipeline_sucess() against PIPELINE may be constraining in the future. WDYT?

vviers commented 3 years ago

I think we should include the list of columns created by each preprocessing function in their documentation.

Yes, ~I think the naming convention of the functions ({source}_make_{output_column_name}) is pretty explicit as long as a single function doesn't create more than one output column~ I realise this naming convention only exists in my head 😄 . But including this in the docstrings is very cheap and useful :)

vviers commented 3 years ago

In the future, we may also want to have several PIPELINE objects: there may be several if we end up creating multiple intermediate datasets) for production. That's also the appeal of having function run_pipeline(data: pd.DataFrame, pipeline: List[namedtuple]) designed this way.

I agree ! How about we store pipelines in another separate module predictsignauxfaibles.pipelines (and imports the functions from preprocessors) ? That would look like this :

from predictsignauxfaibles.pipelines import DEFAULT_PIPELINE, ANOTHER_PIPELINE, run_pipeline

data = run_pipeline(data, DEFAULT_PIPELINE)
data = run_pipeline(data, ANOTHER_PIPELINE)

I'm undecided on whether passing PIPELINE to tests/unit/preprocessors_test.py and running test_full_pipeline_sucess() against PIPELINE may be constraining in the future. WDYT?

Why do you think that this would be constraining ?

slebastard commented 3 years ago

I think we should include the list of columns created by each preprocessing function in their documentation.

Yes, ~I think the naming convention of the functions ({source}_make_{output_column_name}) is pretty explicit as long as a single function doesn't create more than one output column~ I realise this naming convention only exists in my head 😄 . But including this in the docstrings is very cheap and useful :)

For sure this is trivial when a single field gets created. I'm thinking more of complex preprocessing that we may have in the future, where creating multiple new fields in a single function will still be the most canonic/pythonic way to proceed.

slebastard commented 3 years ago

My only argument is in terms of maintenance cost:

  • Say you have three standard pipelines. Do you have a test for all three?

yes :) I made that quite easy to do via the ALL_PIPELINES list and the use of parametrized tests, cf : https://github.com/signaux-faibles/predictsignauxfaibles/pull/37/commits/d864eb8517d8c593584042fa312b077e9da9d09d#diff-99d865133abd3a60cc0c9432e688a51e0d973463f9861d6c589b4e18d7219211R22

  • Do you modify the test every time that one of the functions in a pipeline changes behaviour, perhaps creating new columns >or removing unnecessary ones?

If the behavior of your function changes, so should your test yes (ideally, you should even write the test first 😉 https://en.wikipedia.org/wiki/Test-driven_development)

slebastard commented 3 years ago

Yeah current implementation of tests/unit/pipelines_test.py is really nice, especially with test parametrization. I think we're good to go!