jupyter-naas / naas

Low-code Python library to safely use notebooks in production: schedule workflows, generate assets, trigger webhooks, send notifications, build pipelines, manage secrets (Cloud-only)
https://app.naas.ai/
GNU Affero General Public License v3.0
282 stars 25 forks source link

feat: Introduced User-Defined outputs_path for Pipeline Execution #402

Closed MinuraPunchihewa closed 1 year ago

MinuraPunchihewa commented 1 year ago

This Pull Request aims to allow users to define a custom path to store the results of their Pipeline executions. This is done by using the outputs_path parameter in the run() method of the Pipeline class.

Given below is an example of how this works,

from naas.pipeline.pipeline import (Pipeline, DummyStep,End)

pipeline = Pipeline()

step1 = DummyStep("Notebook 1")
step2 = DummyStep("Notebook 2")
step3 = DummyStep("Notebook 3")

pipeline >> step1 >> step2 >> step3 >> End()
pipeline.run(outputs_path='outputs')

As shown here, the outputs directory is created as specified above and the results of the pipeline executions are stored in separate sub-directories within it, image

The contents of the outputs directory, image

If a path is not specified, the results will be stored within the pipeline_executions directory, by default.

Note: This directory is also visible in the above screenshot, because I have the run the Pipeline twice; once without specifying the outputs_path and a second time by passing the argument.

jravenel commented 1 year ago

Hey @MinuraPunchihewa thanks!👌 from what I understand from you screenshots, if I specify outputs_paths then the pipeline exécutions folder is not created anymore and go straight to the location specified. Is that correct ?

MinuraPunchihewa commented 1 year ago

Hey @jravenel, Yes, that is correct.

jravenel commented 1 year ago

Ok. I think @FlorentLvr would argue it will be best to keep the name of the folder pipeline_executions every time. Otherwise things can get messy, or give us the job to create a new folder, name it, etc when we don't really need that effort.

MinuraPunchihewa commented 1 year ago

Oh, I don't quite understand. I worked on this to resolve this issue, https://github.com/jupyter-naas/naas/issues/373

FlorentLvr commented 1 year ago

Ok. I think @FlorentLvr would argue it will be best to keep the name of the folder pipeline_executions every time. Otherwise things can get messy, or give us the job to create a new folder, name it, etc when we don't really need that effort.

I don't understand what you mean @jravenel! This PR seems to resolve the issue i had 🙈

FlorentLvr commented 1 year ago

@MinuraPunchihewa, Let's summarize to ensure everything functions as intended:

jravenel commented 1 year ago

The question I was raising is: Do we want to have: A/ all the pipeline executions in any folder (it will create a lot of folders in the directory) B/ specify a folder (outputs) and keep a pipeline_executions folder in the target folder generated automatically

@MinuraPunchihewa @FlorentLvr

MinuraPunchihewa commented 1 year ago

@MinuraPunchihewa, Let's summarize to ensure everything functions as intended:

  • If no value is specified for "output_dir," the pipeline executions will be stored in a folder named "pipeline_executions," located at the position where the pipeline notebook is invoked. This maintains the same behavior as before.
  • If a specific path is provided for "output_dir," such as "/home/ftp/data-product/outputs/pipeline_executions," all executions will be stored in that specified location when the pipeline is executed. Is my understanding accurate?

@FlorentLvr Yes, you are right.

jravenel commented 1 year ago

And my point is why do I need to specify /pipeline_executions in the output dir? Can't we have pipeline_executions created automatically ?

FlorentLvr commented 1 year ago

And my point is why do I need to specify /pipeline_executions in the output dir? Can't we have pipeline_executions created automatically ?

@jravenel, you don't. It is default parameter in the function. @MinuraPunchihewa

FlorentLvr commented 1 year ago

@MinuraPunchihewa, Let's summarize to ensure everything functions as intended:

  • If no value is specified for "output_dir," the pipeline executions will be stored in a folder named "pipeline_executions," located at the position where the pipeline notebook is invoked. This maintains the same behavior as before.
  • If a specific path is provided for "output_dir," such as "/home/ftp/data-product/outputs/pipeline_executions," all executions will be stored in that specified location when the pipeline is executed. Is my understanding accurate?

@FlorentLvr Yes, you are right.

@MinuraPunchihewa, sounds good to me! You can merge the PR once all checks are valid :)

MinuraPunchihewa commented 1 year ago

Hey @Dr0p42, @FlorentLvr, Do you know why the checks are failing?

jravenel commented 1 year ago

Hey @Dr0p42, @FlorentLvr, Do you know why the checks are failing?

@MinuraPunchihewa it's coming from the linter, go to GitHub Action you'll be able to have infos.

Screenshot 2023-06-20 at 16 38 16
MinuraPunchihewa commented 1 year ago

Hey @jravenel, Yes, I saw that, but it's not clear to me what the issue is.

jravenel commented 1 year ago

Calling Mr @Dr0p42 for help here 🙏

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Dr0p42 commented 1 year ago

Ok @MinuraPunchihewa @jravenel this is finally fixed :)

jravenel commented 1 year ago

;) thanks @Dr0p42. We will be implementing it today or tomorrow.