ploomber / ploomber-engine

A toolbox 🧰 for Jupyter notebooks 📙: testing, experiment tracking, debugging, profiling, and more!
https://engine.ploomber.io
BSD 3-Clause "New" or "Revised" License
59 stars 14 forks source link

engine and papermill differences #61

Closed idomic closed 1 year ago

idomic commented 1 year ago

When running the same notebook with --log-output papermill shows all of the outputs and ploomber-engine doesn't.

This happens on our posthog reporting notebook. For instance Cell 26, shows the output in papermill:

Screen Shot 2023-03-12 at 10 24 41 AM

But Doesn't in ploomber-engine:

Screen Shot 2023-03-12 at 10 24 26 AM

Another thing I noticed is when the notebook runs, there's a dual progress bar within the cell the messes with the main bar, that might be confusing for users. (in ploomber-engine)

edublancas commented 1 year ago

I ran two sample notebooks (sample-notebooks.zip) to understand the issue a bit more, some comments:

printed output

I was unable to reproduce the issue, all print statements are displayed when passing --log-output; ploomber-engine only displays whatever is sent to stdout so my guess is that papermill is also displaying stderr and/or the text results from each cell - we should run a more detailed analysis and then ensure that both produce the same output. Another thing we can add is the cell delimiter (papermill prints: Executing Cell X -----)

ploomber-engine print.ipynb /dev/null --log-output

dual progress bar

I could not reproduce this by creating notebook that displays a progressbar using tqdm and executing it with --log-output, so we need to investigate more:

ploomber-engine progress.ipynb /dev/null --log-output
idomic commented 1 year ago

Yeah the delimiter can be a good option, it prints all together. To recreate you can run the posthob.ipynb.

mehtamohit013 commented 1 year ago

Hi @idomic Can you please provide me with posthob.ipynb or where it is located? I can't seem to find it.

mehtamohit013 commented 1 year ago

Hi, So, I am running this code:

print(1+2)
print(3+4)
print(1+7)
from tqdm.auto import tqdm
import time
my_list = list(range(100))

with tqdm(total=len(my_list)) as pbar:
    for x in my_list:
       time.sleep(0.01)
       pbar.update(1)
       if x%20==0:
           print(x)
print(1)

Running with papermill: Screenshot_2023-03-17_11-14-43

Running with ploomber engine on CLI give me output: Screenshot_2023-03-17_11-27-26

Observations:

Commands used

 ploomber-engine rough.ipynb output.ipynb --log-output
 papermill rough.ipynb output.ipynb --log-output 

PS: I am not able to run the notebook @idomic mentioned Edit: Updated Images and fix spellings

idomic commented 1 year ago

@mehtamohit013 A few thoughts:

Also, I can't find documentation of ploomber-engine CLI command

I've opened an issue about it last week I think

Progress bar of cell 5 is not displayed

I think if the --log-output is there we need to research why, sounds like a bug.

Also ploomber-engine execution time which is around 5-8sec is not consistent and it is slower than papermill 3-4sec

It runs on a different process, that's why the difference, but try profiling it, see what's causing this delay.

Let's connect on the notebook I'll help you run it!

edublancas commented 1 year ago

I think the missing output might be that the tqdm progress bar is printed to standard error and we're just displaying standard output. If that's the case, we should ensure we also display standard error in the console.

You can check this with:

import sys

print("printing to stderr", file=sys.stderr)

and see if ploomber-engine displays it

mehtamohit013 commented 1 year ago

Some clarification regarding performance

Just a minor observation: We cannot pass the file name to which data should be saved in --save-profiling-data. It creates output-profiling-data.csv by default

idomic commented 1 year ago

Just a minor observation: We cannot pass the file name to which data should be saved in --save-profiling-data. It creates output-profiling-data.csv by default

Please open an issue about it, I think there should be an option to pass an argument.

idomic commented 1 year ago

The 5-8 sec that I mentioned above is the time, the zsh shell is taking to generate a new command for me to input. So maybe it should include the delay in stdout displaying to the shell.

Seems like it's faster than papermill, but the output is slower, but we still need to figure out why and how to fix it.

mehtamohit013 commented 1 year ago

I think the missing output might be that the tqdm progress bar is printed to standard error and we're just displaying standard output. If that's the case, we should ensure we also display standard error in the console.

Hi @edublancas , Currently, ploomber engine prints the output from stdout only when the cell is completely executed, however, this is not ideal as the output should be printed to the console as soon as it is printed to notebook stdout

I have mentioned more details in PR #66