feat: Introduced User-Defined outputs_path for Pipeline Execution

jupyter-naas / naas

Low-code Python library to safely use notebooks in production: schedule workflows, generate assets, trigger webhooks, send notifications, build pipelines, manage secrets (Cloud-only)

https://app.naas.ai/

GNU Affero General Public License v3.0

282 stars 25 forks source link

feat: Introduced User-Defined outputs_path for Pipeline Execution #402

Closed MinuraPunchihewa closed 1 year ago

MinuraPunchihewa commented 1 year ago

This Pull Request aims to allow users to define a custom path to store the results of their Pipeline executions. This is done by using the outputs_path parameter in the run() method of the Pipeline class.

Given below is an example of how this works,

from naas.pipeline.pipeline import (Pipeline, DummyStep,End)

pipeline = Pipeline()

step1 = DummyStep("Notebook 1")
step2 = DummyStep("Notebook 2")
step3 = DummyStep("Notebook 3")

pipeline >> step1 >> step2 >> step3 >> End()
pipeline.run(outputs_path='outputs')

As shown here, the outputs directory is created as specified above and the results of the pipeline executions are stored in separate sub-directories within it,

The contents of the outputs directory,

If a path is not specified, the results will be stored within the pipeline_executions directory, by default.

Note: This directory is also visible in the above screenshot, because I have the run the Pipeline twice; once without specifying the outputs_path and a second time by passing the argument.

jravenel commented 1 year ago

Hey @MinuraPunchihewa thanks!👌 from what I understand from you screenshots, if I specify outputs_paths then the pipeline exécutions folder is not created anymore and go straight to the location specified. Is that correct ?

MinuraPunchihewa commented 1 year ago

Hey @jravenel, Yes, that is correct.

jravenel commented 1 year ago

Ok. I think @FlorentLvr would argue it will be best to keep the name of the folder pipeline_executions every time. Otherwise things can get messy, or give us the job to create a new folder, name it, etc when we don't really need that effort.

MinuraPunchihewa commented 1 year ago

Oh, I don't quite understand. I worked on this to resolve this issue, https://github.com/jupyter-naas/naas/issues/373

FlorentLvr commented 1 year ago

Ok. I think @FlorentLvr would argue it will be best to keep the name of the folder pipeline_executions every time. Otherwise things can get messy, or give us the job to create a new folder, name it, etc when we don't really need that effort.

I don't understand what you mean @jravenel! This PR seems to resolve the issue i had 🙈

FlorentLvr commented 1 year ago

@MinuraPunchihewa, Let's summarize to ensure everything functions as intended:

If no value is specified for "output_dir," the pipeline executions will be stored in a folder named "pipeline_executions," located at the position where the pipeline notebook is invoked. This maintains the same behavior as before.
If a specific path is provided for "output_dir," such as "/home/ftp/data-product/outputs/pipeline_executions," all executions will be stored in that specified location when the pipeline is executed. Is my understanding accurate?

jravenel commented 1 year ago

The question I was raising is: Do we want to have: A/ all the pipeline executions in any folder (it will create a lot of folders in the directory) B/ specify a folder (outputs) and keep a pipeline_executions folder in the target folder generated automatically

@MinuraPunchihewa @FlorentLvr

MinuraPunchihewa commented 1 year ago

@MinuraPunchihewa, Let's summarize to ensure everything functions as intended:

If no value is specified for "output_dir," the pipeline executions will be stored in a folder named "pipeline_executions," located at the position where the pipeline notebook is invoked. This maintains the same behavior as before.

If a specific path is provided for "output_dir," such as "/home/ftp/data-product/outputs/pipeline_executions," all executions will be stored in that specified location when the pipeline is executed. Is my understanding accurate?

@FlorentLvr Yes, you are right.

jravenel commented 1 year ago

And my point is why do I need to specify /pipeline_executions in the output dir? Can't we have pipeline_executions created automatically ?

FlorentLvr commented 1 year ago

And my point is why do I need to specify /pipeline_executions in the output dir? Can't we have pipeline_executions created automatically ?

@jravenel, you don't. It is default parameter in the function. @MinuraPunchihewa

FlorentLvr commented 1 year ago

@MinuraPunchihewa, Let's summarize to ensure everything functions as intended:

If no value is specified for "output_dir," the pipeline executions will be stored in a folder named "pipeline_executions," located at the position where the pipeline notebook is invoked. This maintains the same behavior as before.

If a specific path is provided for "output_dir," such as "/home/ftp/data-product/outputs/pipeline_executions," all executions will be stored in that specified location when the pipeline is executed. Is my understanding accurate?

@FlorentLvr Yes, you are right.

@MinuraPunchihewa, sounds good to me! You can merge the PR once all checks are valid :)

MinuraPunchihewa commented 1 year ago

Hey @Dr0p42, @FlorentLvr, Do you know why the checks are failing?

jravenel commented 1 year ago