mozilla / docker-etl

Collection of dockerized ETL jobs managed by data engineering.
Mozilla Public License 2.0
16 stars 14 forks source link

Refactor Config files #263

Closed jaredsnyder closed 1 month ago

jaredsnyder commented 1 month ago

This PR modifies the config files, especially for the FunnelForecast jobs, to bring them more closely into alignment. This is a first step toward adding the ability to fit different "sub-models" to different partitions of the data (metric_hub.segments set in the FunnelForecast config) to the base class.

DotMap was also removed because managing the conversion for dict to DotMap and back was beginning to be more trouble than it was worth as the config files got bigger and used in more complex ways.

Checklist for reviewer:

jaredsnyder commented 1 month ago

Question for @m-d-bowerman on this one: I'm trying to clean up/document all the start dates. As part of that I added the predict_historical_dates parameter so that we wouldn't have to overwite dates_to_predict in the __post_init__. I would like to use predict_historical_dates to modify the start_date attribute via _default_start_date rather than overwriting dates_to_predict. The only place in FunnelForecast I can see this causing problems is here:

https://github.com/mozilla/docker-etl/blob/fd1435b74b2dee6edbdb55faafb5fc670b34c138/jobs/kpi-forecasting/kpi_forecasting/models/funnel_forecast.py#L494

What is the purpose of this line? Also what do you think of the general approach?

m-d-bowerman commented 1 month ago

Question for @m-d-bowerman on this one: I'm trying to clean up/document all the start dates. As part of that I added the predict_historical_dates parameter so that we wouldn't have to overwite dates_to_predict in the __post_init__. I would like to use predict_historical_dates to modify the start_date attribute via _default_start_date rather than overwriting dates_to_predict. The only place in FunnelForecast I can see this causing problems is here:

https://github.com/mozilla/docker-etl/blob/fd1435b74b2dee6edbdb55faafb5fc670b34c138/jobs/kpi-forecasting/kpi_forecasting/models/funnel_forecast.py#L494

What is the purpose of this line? Also what do you think of the general approach?

In the FunnelForecast class, we also capture the intra-time series predictions, along with the Prophet components used to make that prediction, done here: https://github.com/mozilla/docker-etl/blob/2f4b0231eda133133daaa5134d20de40152ca6de/jobs/kpi-forecasting/kpi_forecasting/models/funnel_forecast.py#L400

That's the motivation behind overwriting dates_to_predict, while retaining the start_date param to trim down the forecast_df to only dates in the future. I'll review the changes you made to see if we can remove the repeated application of that condition; I agree, it's incredibly clunky.

jaredsnyder commented 1 month ago

Question for @m-d-bowerman on this one: I'm trying to clean up/document all the start dates. As part of that I added the predict_historical_dates parameter so that we wouldn't have to overwite dates_to_predict in the __post_init__. I would like to use predict_historical_dates to modify the start_date attribute via _default_start_date rather than overwriting dates_to_predict. The only place in FunnelForecast I can see this causing problems is here: https://github.com/mozilla/docker-etl/blob/fd1435b74b2dee6edbdb55faafb5fc670b34c138/jobs/kpi-forecasting/kpi_forecasting/models/funnel_forecast.py#L494

What is the purpose of this line? Also what do you think of the general approach?

In the FunnelForecast class, we also capture the intra-time series predictions, along with the Prophet components used to make that prediction, done here:

https://github.com/mozilla/docker-etl/blob/2f4b0231eda133133daaa5134d20de40152ca6de/jobs/kpi-forecasting/kpi_forecasting/models/funnel_forecast.py#L400

That's the motivation behind overwriting dates_to_predict, while retaining the start_date param to trim down the forecast_df to only dates in the future. I'll review the changes you made to see if we can remove the repeated application of that condition; I agree, it's incredibly clunky.

Given that it gets filtered to dates in the future using self.start_date how is this different than just having dates_to_predict start at self.start_date?

m-d-bowerman commented 1 month ago

Question for @m-d-bowerman on this one: I'm trying to clean up/document all the start dates. As part of that I added the predict_historical_dates parameter so that we wouldn't have to overwite dates_to_predict in the __post_init__. I would like to use predict_historical_dates to modify the start_date attribute via _default_start_date rather than overwriting dates_to_predict. The only place in FunnelForecast I can see this causing problems is here: https://github.com/mozilla/docker-etl/blob/fd1435b74b2dee6edbdb55faafb5fc670b34c138/jobs/kpi-forecasting/kpi_forecasting/models/funnel_forecast.py#L494

What is the purpose of this line? Also what do you think of the general approach?

In the FunnelForecast class, we also capture the intra-time series predictions, along with the Prophet components used to make that prediction, done here: https://github.com/mozilla/docker-etl/blob/2f4b0231eda133133daaa5134d20de40152ca6de/jobs/kpi-forecasting/kpi_forecasting/models/funnel_forecast.py#L400

That's the motivation behind overwriting dates_to_predict, while retaining the start_date param to trim down the forecast_df to only dates in the future. I'll review the changes you made to see if we can remove the repeated application of that condition; I agree, it's incredibly clunky.

Given that it gets filtered to dates in the future using self.start_date how is this different than just having dates_to_predict start at self.start_date?

The forecast dates get filtered that way, yeah, but the components table doesn't -- it'll have a row for every date in the observed data and the forecast dates.

jaredsnyder commented 1 month ago

I'm going to leave dealing with start_date for the next PR where I generalize the use of SegmentSettings objects. So this one is really for a full review

jaredsnyder commented 1 month ago

Created a notebook to validate that the outputs are the same. Ended up finding an error that was corrected in the latest commit, but now everything matches

https://colab.research.google.com/drive/1dLeLUz_99ln9PC1AG-izZILj9-zIJHmJ#scrollTo=oukMs3d3qiEJ