mara / mara-pipelines

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
MIT License
2.07k stars 100 forks source link

limit node cost calc. #37

Closed ghost closed 3 years ago

ghost commented 4 years ago

The node cost calculation here just takes the whole history which I think is not a good idea. There should be a limit e.g. for the last 60 (successful) executions.

The perfect solution would probably be using the expotential smoothing algorithm weighting older execution times less than newer once. (but guess that that would be a bit oversized)

martin-loetzsch commented 4 years ago

We though about doing some weighting, but did not find a good argument why to do that (and then went with the most simple solution, averaging).

What problem do you have with taking the whole history, that is in which situation does the scheduler make bad decisions based on that?

ghost commented 4 years ago

I often do adjustments to preceding SQL objects (e.g. replacing a complicated view calculation with a materialized table). Such changes lead to a dramatical change of the execution time (e.g. 80% time saving of previous execution time). Taking then the whole history to calculate the cost is not reasonable, since the history in mara might be quite old.

For such cases it would make sence to be able to configure that not the whole history is taken, e.g. only take the history since day 01.05.2020. I don't want to throw away the execution history to still be able to see if the change was worth it or it should be rolled back.

martin-loetzsch commented 3 years ago

I think the cost calculation in mara pipelines is quite ok. If you have very volatile workloads, feel free to decrease run_log_retention_in_days (https://github.com/mara/mara-pipelines/blob/master/mara_pipelines/config.py#L57) from 30 days to something less.