Closed vviers closed 3 years ago
TODO: add other preprocessing steps (ACOSS mostly)
PIPELINE
list by wrapping it in a Preprocessor
objecttest/fake_data/dataframes.py
test_full_pipeline_sucess
so that it includes your function's tests (this allow us to check that preprocessors do not interact with each other in a weird way)I think we should include the list of columns created by each preprocessing function in their documentation.
Otherwise in future debugging and reviews, we will have to go back and forth between namedtuples Preprocessor
and function definitions.
In the future, we may also want to have several PIPELINE
objects: there may be several if we end up creating multiple intermediate datasets) for production. That's also the appeal of having function run_pipeline(data: pd.DataFrame, pipeline: List[namedtuple])
designed this way.
I'm undecided on whether passing PIPELINE
to tests/unit/preprocessors_test.py
and running test_full_pipeline_sucess()
against PIPELINE
may be constraining in the future. WDYT?
I think we should include the list of columns created by each preprocessing function in their documentation.
Yes, ~I think the naming convention of the functions ({source}_make_{output_column_name}
) is pretty explicit as long as a single function doesn't create more than one output column~ I realise this naming convention only exists in my head 😄 . But including this in the docstrings is very cheap and useful :)
In the future, we may also want to have several
PIPELINE
objects: there may be several if we end up creating multiple intermediate datasets) for production. That's also the appeal of having functionrun_pipeline(data: pd.DataFrame, pipeline: List[namedtuple])
designed this way.
I agree ! How about we store pipelines in another separate module predictsignauxfaibles.pipelines
(and imports the functions from preprocessors) ? That would look like this :
from predictsignauxfaibles.pipelines import DEFAULT_PIPELINE, ANOTHER_PIPELINE, run_pipeline
data = run_pipeline(data, DEFAULT_PIPELINE)
data = run_pipeline(data, ANOTHER_PIPELINE)
I'm undecided on whether passing
PIPELINE
totests/unit/preprocessors_test.py
and runningtest_full_pipeline_sucess()
againstPIPELINE
may be constraining in the future. WDYT?
Why do you think that this would be constraining ?
I think we should include the list of columns created by each preprocessing function in their documentation.
Yes, ~I think the naming convention of the functions (
{source}_make_{output_column_name}
) is pretty explicit as long as a single function doesn't create more than one output column~ I realise this naming convention only exists in my head 😄 . But including this in the docstrings is very cheap and useful :)
For sure this is trivial when a single field gets created. I'm thinking more of complex preprocessing that we may have in the future, where creating multiple new fields in a single function will still be the most canonic/pythonic way to proceed.
My only argument is in terms of maintenance cost:
- Say you have three standard pipelines. Do you have a test for all three?
yes :) I made that quite easy to do via the ALL_PIPELINES
list and the use of parametrized tests, cf : https://github.com/signaux-faibles/predictsignauxfaibles/pull/37/commits/d864eb8517d8c593584042fa312b077e9da9d09d#diff-99d865133abd3a60cc0c9432e688a51e0d973463f9861d6c589b4e18d7219211R22
- Do you modify the test every time that one of the functions in a pipeline changes behaviour, perhaps creating new columns >or removing unnecessary ones?
If the behavior of your function changes, so should your test yes (ideally, you should even write the test first 😉 https://en.wikipedia.org/wiki/Test-driven_development)
Yeah current implementation of tests/unit/pipelines_test.py
is really nice, especially with test parametrization.
I think we're good to go!
preprocessors
modulenamedtuple
) that stores 3 attributes :function
that only takes a dataframe as an input and returns a dataframeinput
which is a list of needed input columns in order to run the function successfullyoutput
which is eitherNone
or a list of newly created output columnsPIPELINE
object that stores a list of Preprocessors, the idea is to use it something like this :WDYT ?