refacto: preprocessing - Githubissues

vviers commented 3 years ago

create a preprocessors module
create a Preprocessor "class" (actually, just a namedtuple) that stores 3 attributes :
- a function that only takes a dataframe as an input and returns a dataframe
- input which is a list of needed input columns in order to run the function successfully
- output which is either None or a list of newly created output columns
- create a PIPELINE object that stores a list of Preprocessors, the idea is to use it something like this :

from predictsignauxfaibles.preprocessors import PIPELINE

my_data = # a DataFrame

# this could be a single preprocessing step or a thousand
for preprocessor in PIPELINE:
    assert set(preprocessor.input).issubset(my_data.columns)
    my_data = preprocessor.function(my_data)
    assert set(preprocessor.output).issubset(my_data.columns)

WDYT ?

vviers commented 3 years ago

TODO: add other preprocessing steps (ACOSS mostly)

vviers commented 3 years ago

How to add a preprocessing function

write your function, it should take and return a dataframe and cannot take any other argument. (⚠️ beware of side-effects and scoping when manipulating a DataFrame from within a function, and make sure that you are familiar with the concepts of Copy vs. Views in pandas)
add your function to the PIPELINE list by wrapping it in a Preprocessor object
write tests for your function :
- one unit test based on a small dataframe in test/fake_data/dataframes.py
- update the main test_full_pipeline_sucess so that it includes your function's tests (this allow us to check that preprocessors do not interact with each other in a weird way)

slebastard commented 3 years ago

I think we should include the list of columns created by each preprocessing function in their documentation. Otherwise in future debugging and reviews, we will have to go back and forth between namedtuples Preprocessor and function definitions.

slebastard commented 3 years ago

In the future, we may also want to have several PIPELINE objects: there may be several if we end up creating multiple intermediate datasets) for production. That's also the appeal of having function run_pipeline(data: pd.DataFrame, pipeline: List[namedtuple]) designed this way.

I'm undecided on whether passing PIPELINE to tests/unit/preprocessors_test.py and running test_full_pipeline_sucess() against PIPELINE may be constraining in the future. WDYT?

vviers commented 3 years ago

I think we should include the list of columns created by each preprocessing function in their documentation.

Yes, ~I think the naming convention of the functions ({source}_make_{output_column_name}) is pretty explicit as long as a single function doesn't create more than one output column~ I realise this naming convention only exists in my head 😄 . But including this in the docstrings is very cheap and useful :)

vviers commented 3 years ago

In the future, we may also want to have several PIPELINE objects: there may be several if we end up creating multiple intermediate datasets) for production. That's also the appeal of having function run_pipeline(data: pd.DataFrame, pipeline: List[namedtuple]) designed this way.

I agree ! How about we store pipelines in another separate module predictsignauxfaibles.pipelines (and imports the functions from preprocessors) ? That would look like this :

from predictsignauxfaibles.pipelines import DEFAULT_PIPELINE, ANOTHER_PIPELINE, run_pipeline

data = run_pipeline(data, DEFAULT_PIPELINE)
data = run_pipeline(data, ANOTHER_PIPELINE)

I'm undecided on whether passing PIPELINE to tests/unit/preprocessors_test.py and running test_full_pipeline_sucess() against PIPELINE may be constraining in the future. WDYT?

Why do you think that this would be constraining ?

slebastard commented 3 years ago

I think we should include the list of columns created by each preprocessing function in their documentation.

Yes, ~I think the naming convention of the functions ({source}_make_{output_column_name}) is pretty explicit as long as a single function doesn't create more than one output column~ I realise this naming convention only exists in my head 😄 . But including this in the docstrings is very cheap and useful :)

For sure this is trivial when a single field gets created. I'm thinking more of complex preprocessing that we may have in the future, where creating multiple new fields in a single function will still be the most canonic/pythonic way to proceed.

slebastard commented 3 years ago

My only argument is in terms of maintenance cost:

Say you have three standard pipelines. Do you have a test for all three?

yes :) I made that quite easy to do via the ALL_PIPELINES list and the use of parametrized tests, cf : https://github.com/signaux-faibles/predictsignauxfaibles/pull/37/commits/d864eb8517d8c593584042fa312b077e9da9d09d#diff-99d865133abd3a60cc0c9432e688a51e0d973463f9861d6c589b4e18d7219211R22

Do you modify the test every time that one of the functions in a pipeline changes behaviour, perhaps creating new columns >or removing unnecessary ones?

If the behavior of your function changes, so should your test yes (ideally, you should even write the test first 😉 https://en.wikipedia.org/wiki/Test-driven_development)

slebastard commented 3 years ago

Yeah current implementation of tests/unit/pipelines_test.py is really nice, especially with test parametrization. I think we're good to go!

signaux-faibles / predictsignauxfaibles

refacto: preprocessing #37

How to add a preprocessing function