posit-dev / positron

Positron, a next-generation data science IDE
Other
2.31k stars 68 forks source link

%pip install pandas requires a kernel restart before being use - refactor to get_pandas that do the import each time, wrapped in a try-except. #3653

Closed ronblum closed 1 month ago

ronblum commented 2 months ago

Positron Version:

Positron Version: 2024.06.1 build 27 Code - OSS Version: 1.90.0 Commit: a893e5b282612ccb2200102957ac38d3c14e5196 Date: 2024-06-26T01:22:29.024Z Electron: 29.4.0 Chromium: 122.0.6261.156 Node.js: 20.9.0 V8: 12.2.281.27-electron.0 OS: Linux x64 6.5.0-41-generic

Steps to reproduce the issue:

EDIT: 2024-07-18 by @jthomasmock

Daniel: I think I know what the issue is.

Pandas is discovered when ipykernel is first loaded, if we install pandas within the current session (eg with %pip install pandas) then even though you are able to use it, the internal ipykernel calls to pandas won't be able to use them as pd_ wasn't found when launching the kernel. > A simple solution is to simply restart the kernel.

Here's what happens when ipykernel is launched the following is evaluated, and pd becomes None as pandas is not installed. If you install pandas now with %pip install pandas, you can run import pandas and it will work, but pd won't be updated until you restart the kernel.

Wasim: I think we should move away from importing third party packages on startup the way we currently do, and instead have functions like get_pandas that do the import each time, wrapped in a try-except.


  1. In the Python Console (cribbed from test plan)
    import pandas as pd
    import numpy as np
    df = pd.DataFrame(np.random.randint(0, 10, size=(1000, 100)))
    column_names = [f"Column_{i+1}" for i in range(100)]
    df.columns = column_names

The df object should appear under the "Data" section

  1. Click on the checkboard to the right of df to view the data.
    • Sometimes, the data explorer tab doesn't open.
    • No errors are emitted, though, in the output.
    • This doesn't happen consistenty.

What did you expect to happen?

A data explorer tab should open.

Were there any error messages in the output or Developer Tools console?

Not in the output.

dfalbel commented 2 months ago

I think I know what the issue is.

Pandas is discovered when ipykernel is first loaded, if we install pandas within the current session (eg with %pip install pandas) then even though you are able to use it, the internal ipykernel calls to pandas won't be able to use them as pd_ wasn't found when launching the kernel. A simple solution is to simply restart the kernel.

Here's what happens when ipykernel is launched the following is evaluated, and pd_ becomes None as pandas is not installed. If you install pandas now with %pip install pandas, you can run import pandas and it will work, but pd_ won't be updated until you restart the kernel.

https://github.com/posit-dev/positron/blob/ff87538651c017bcd442a2e8eb6c4d41a56c8454/extensions/positron-python/python_files/positron/positron_ipykernel/third_party.py#L64

https://github.com/posit-dev/positron/blob/ff87538651c017bcd442a2e8eb6c4d41a56c8454/extensions/positron-python/python_files/positron/positron_ipykernel/third_party.py#L20-L25

This behavior shouldn't be linux specific

seeM commented 2 months ago

I think we should move away from importing third party packages on startup the way we currently do, and instead have functions like get_pandas that do the import each time, wrapped in a try-except.

It’ll mean that users don't have to restart after installing (most of the time; some packages still require a restart), and it’ll make kernel startup a little bit faster since we’ll only import third party packages when they’re needed.

Alternatively, @dfalbel mentioned on Slack:

I think this makes sense! I think it's also fine having to restart the kernel, but we should at least re-surface the error to the UI, with some tips that restarting might fix.

testlabauto commented 2 months ago

Do we feel 100% confident that we have identified the root cause here? I ask because in smoke tests, I am seeing occasional failures where the test double clicks a variables row and no data explorer opens. These tests pre-install all their dependencies.

dfalbel commented 2 months ago

I don't see any other reasons for that to happen. Can you try in the smoket ests if running %view df returns an error?

testlabauto commented 2 months ago

Sure. I will try that. I also just put in retries in case double clicking on the variables row is the real problem.

testlabauto commented 2 months ago

It looks to me like click retries have solved the smoke test problem! Carry on :)

jthomasmock commented 2 months ago

Should we open an issue for Wasim's suggestion to try-expect and/or surface the error per Daniel?

seeM commented 2 months ago

@jthomasmock I made #3800 but then closed it. I think we can use this issue to track the %pip install pandas bug.

jthomasmock commented 1 month ago

I've relabled this issue and added context to original comment.

dfalbel commented 1 month ago

I ended up approaching this by forwarding the error to the user instead of lazily loading pandas as it's a simpler change and also solves other possible issues during initialization that wouldn't be covered by simply loading pandas lazily. See https://github.com/posit-dev/positron/pull/4122 for more info.

testlabauto commented 1 month ago

Verified Fixed

Positron Version(s) : 2024.07.0-113
OS Version          : OSX

Test scenario(s)

Error message appears for user when trying open data frame post pandas install without restart:

Image

Link(s) to TestRail test cases run or created: N/A