add digautoprofiler package to environments

willeppy commented 1 year ago

Hi,

We would like to add the digautoprofiler library to the jupyterhub.

This adds a jupyterlab extension, AutoProfiler, that helps users visualize their data with an interactive visualization sidebar. More information is in our github: https://github.com/cmudig/AutoProfiler

I am the developer of this package and would be happy to answer any questions about it (email: willepp@cmu.edu). Our group at CMU has been working with Wei Xu at BNL on this project.

danielballan commented 1 year ago

Thanks for the PR, @willeppy. We enjoyed your presentation at CSI.

I think this should be merged. But before we merge it, I want a couple of us in the Data Science and Systems Integration (DSSI) Program at NSLS-II to have a basic familiarity with the tool and its uses before we are supporting users.

To start, I tried making a simple DataFrame in the pyiodide environment linked on the project's README. It looks like the DataFrame was detected but not successfully described or visualized. Clicking the extensions refresh button did not help. Can you advise what we should be doing?

dig cmu edu_AutoProfiler_lab_index html

willeppy commented 1 year ago

So unfortunately I don’t think our pyodide demo is working at the moment. Pyodide runs jupyter totally in the browser (no separate python process for the backend) so the way we fetch the data doesn’t work for that demo anymore; I should probably remove it!

Easiest way to try it out is to pip install digautoprofiler then start a jupyter notebook from the command line locally.

On Apr 21, 2023, at 4:01 PM, Dan Allan @.***> wrote:

Thanks for the PR, @willeppyhttps://github.com/willeppy. We enjoyed your presentation at CSI.

I think this should be merged. But before we merge it, I want a couple of us in the Data Science and Systems Integration (DSSI) Program at NSLS-II to have a basic familiarity with the tool and its uses before we are supporting users.

To start, I tried making a simple DataFrame in the pyiodide environmenthttp://dig.cmu.edu/AutoProfiler/lab/index.html linked on the project's README. It looks like the DataFrame was detected but not successfully described or visualized. Clicking the extensions refresh button did not help. Can you advise what we should be doing?

[dig cmu edu_AutoProfiler_lab_index html]https://user-images.githubusercontent.com/2279598/233724248-e969d66c-fc46-44a4-ab0b-6aa99cd44d8e.png

— Reply to this email directly, view it on GitHubhttps://github.com/nsls2-conda-envs/nsls2-collection-tiled/pull/15#issuecomment-1518279269, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADGHTX7BKVGOWORF2FRLZQTXCLRS5ANCNFSM6AAAAAAXHCBHHU. You are receiving this because you were mentioned.Message ID: @.***>

danielballan commented 1 year ago

OK, the binder link also works. I notice that pandas is imported and pd is added to the namespace, conditional on the pane being open, with no change to notebook code or command history.

On a quick glance it looks like this might come from here.

I worry about it creating rare but deeply confusing bugs for some users:

User who defines pd to mean something else and then is very confused when, via a stray click, that variable is now the pandas module.
User who forgets to import pandas as pd, does not realize the AutoProfile has done it for them, then tries to run the notebook in another context or share it and finds that the notebook does not run.

Of course, if the user knows what to look for, these are easily fixed, but I worry about the user who would not anticipate that a front-end extension has silently changed their code in a way they cannot detect.

Have you considered making the extension check for pandas and do nothing until/unless pandas is imported by the user code? For example, something along these lines could work:

>>> import sys as __autoprofiler_private_variable; 'pandas' in __autoprofiler_private_variable.modules
False
>>> import pandas
>>> import sys as __autoprofiler_private_variable; 'pandas' in __autoprofiler_private_variable.modules
True

After the result is captured, del __autoprofiler_private_variable could be run to leave the user namespace exactly as it was. Any invisible code execution at all could potentially create confusing behavior, but a lighter touch such as the above would reduce the impact.

willeppy commented 1 year ago

Hey @danielballan thanks for this comment and sorry for my slow reply, I was traveling.

I agree this can cause subtle bugs. I'm thinking that I will still import pandas, however as __autoprofilerPandas rather than pd so that it is less likely there is a clash with a user defined variable.

With this change, AutoProfiler adds / overrides two variables in the user's notebook

__autoprofilerPandas (deletes the variable after import but 'pandas' will still be in sys modules)
digautoprofiler (was already doing this before)

The reason I'm opting to still import pandas every time is because I use the module to check if other variables are pandas dataframes, which could still be the case even if pandas was never manually imported by the user (such as if they read in a dataframe from a pickle file or get it back from API call).

I think in a perfect world I wouldn't have to add any variables to memory but to get the automatic profiles I need to add a couple. These are both uncommon names so I think unlikely for there to be a clash. Plus notebooks already automatically add some variables like _1 or _2 which store the execution results so some precedent for this strategy.

Tracking this change in this PR: https://github.com/cmudig/AutoProfiler/pull/129

danielballan commented 1 year ago

Thanks @willeppy. I think this addresses the name collision concern well enough. I still have concerns about importing pandas silently in every user notebook, namespace aside. There have been times when I've been debugging issues with dependencies or even sometimes import order. In those scenarios, having pandas and some of its dependencies (like numpy) get imported by a front-end extension would have been very unexpected and confusing.

I see a distinction between adding variables to the user namespace (e.g. _1,_2) and silently running code---especially importing modules---outside of [I]Python's normal mechanisms for doing so.

Fortunately, however, I think that checking sys.modules will cover the cases you mentioned. Once there is a DataFrame anywhere in the runtime, 'pandas' will be in sys.modules, whether the user imported it or not. Here's a pickle example:

>>> import sys
>>> import pickle
>>> 'pandas' in sys.modules
False
>>> with open('df.pickle', 'rb') as file:
...     df = pickle.loads(file.read())
... 
>>> 'pandas' in sys.modules
True

You'll find the same with a DataFrame that came from some API.

willeppy commented 1 year ago

Good point. I have updated AP's checking mechanism to first check if pandas is in sys modules before re-importing pandas and the digautoprofiler module. An important note is that this will only happen if the AutoProfiler panel is open, so if a user never opens the AutoProfiler panel their notebook variables and modules will not be affected. I included an example in my updated PR of how the sys modules are affected: https://github.com/cmudig/AutoProfiler/pull/129#issue-1696802075

danielballan commented 1 year ago

The demonstration in your comment looks good.

Regarding my suggestion (via email, I think) that no code injection at all should happen without some kind of opt-in mechanism, I think I agree that opening the AutoProfiler tab can count as "opting in".

Have you tested the behavior on non-Python kernels? Does it fail gracefully if I open the pane with a notebook running an R or Julia kernel?

willeppy commented 1 year ago

Yeah good point on other language notebooks, It was just failing silently but just added in some explicit checks to make sure we only execute code if connected to a python notebook and gives some user feedback that will only show profiles if connected to python notebook

willeppy commented 1 year ago

Hi i wanted to follow up about getting this merged so Autoprofiler will be included in the Jupyter environment. I've removed all the logging and merged the changed we had discussed and released in the current version (1.0.2).

We currently have not rolled out support for Jupyterlab >= 4.0.0, so not sure which version you all are on in this env (it didn't look like there's a specific version pinned). We can prioritize this support to get this merged if need be

mrakitin commented 4 months ago

This was added to 2024-2.0 via a separate PR and will be available going forward. Thanks for the contribution, @willeppy!

nsls2-conda-envs / nsls2-collection-tiled

add digautoprofiler package to environments #15