spyder-ide / spyder

Official repository for Spyder - The Scientific Python Development Environment
https://www.spyder-ide.org
MIT License
8.21k stars 1.59k forks source link

Spyder slows down when running code with big dataframes #16937

Open korosig opened 2 years ago

korosig commented 2 years ago

Hi there, I have installed a new Anaconda 2021.11 with a new Spyder 5.1.5 on my Windows 10 because the latest Sypder4.2.5 broke down.

When I open the new Spyder 5.1.5 it is working well, but then usually slow down, didn't refresh the variable size in the variable explorer, and a simple +,-,/ take 10-30 seconds..... Do you have an idea to solve that problem?

I have uninstalled previews Anaconda, deleted all python related folders, etc. My machine has 64GB Ram, iCore9, SSD 1T, 2x 3080 11GB Nvida

Syder IDE 5.15| Python 3.9.7 64-bit | Qt 5.9.7 | PyQt5 5.9.2 | Windows 10

Here is the video about the issue: https://youtu.be/mLzyZIW19GQ the same code in Jupyter https://youtu.be/UBSXuL4VihM Dependencies

Mandatory: atomicwrites >=1.2.0 : 1.4.0 (OK) chardet >=2.0.0 : 4.0.0 (OK) cloudpickle >=0.5.0 : 2.0.0 (OK) cookiecutter >=1.6.0 : 1.7.2 (OK) diff_match_patch >=20181111 : 20200713 (OK) intervaltree >=3.0.2 : 3.1.0 (OK) IPython >=7.6.0 : 7.29.0 (OK) jedi >=0.17.2;<0.19.0 : 0.18.0 (OK) jsonschema >=3.2.0 : 3.2.0 (OK) keyring >=17.0.0 : 23.1.0 (OK) nbconvert >=4.0 : 6.1.0 (OK) numpydoc >=0.6.0 : 1.1.0 (OK) paramiko >=2.4.0 : 2.7.2 (OK) parso >=0.7.0;<0.9.0 : 0.8.2 (OK) pexpect >=4.4.0 : 4.8.0 (OK) pickleshare >=0.4 : 0.7.5 (OK) psutil >=5.3 : 5.8.0 (OK) pygments >=2.0 : 2.10.0 (OK) pylint >=2.5.0;<2.10.0 : 2.9.6 (OK) pyls_spyder >=0.4.0 : 0.4.0 (OK) pylsp >=1.2.2;<1.3.0 : 1.2.4 (OK) pylsp_black >=1.0.0 : None (OK) qdarkstyle =3.0.2 : 3.0.2 (OK) qstylizer >=0.1.10 : 0.1.10 (OK) qtawesome >=1.0.2 : 1.0.2 (OK) qtconsole >=5.1.0 : 5.1.1 (OK) qtpy >=1.5.0 : 1.10.0 (OK) rtree >=0.9.7 : 0.9.7 (OK) setuptools >=49.6.0 : 58.0.4 (OK) sphinx >=0.6.6 : 4.2.0 (OK) spyder_kernels >=2.1.1;<2.2.0 : 2.1.3 (OK) textdistance >=4.2.0 : 4.2.1 (OK) three_merge >=0.1.1 : 0.1.1 (OK) watchdog >=0.10.3 : 2.1.3 (OK) zmq >=17 : 22.2.1 (OK)

Optional: cython >=0.21 : 0.29.24 (OK) matplotlib >=2.0.0 : 3.4.3 (OK) numpy >=1.7 : 1.20.3 (OK) pandas >=1.1.1 : 1.3.4 (OK) scipy >=0.17.0 : 1.7.1 (OK) sympy >=0.7.3 : 1.9 (OK)

dalthviz commented 2 years ago

Hi @korosig thank you for the feedback! Checking the videos is clear that there is some delay (and nice ending comic by the way) when executing inside Spyder. However, I'm not totally sure what could be happening :/

Is this happening for you after certain amount of time after launching Spyder? Has any of your variables a considerable size?

Also, is there any sample script that you can share with us to reproduce this problem in our side?

Any new info in order to reproduce this is greatly appreciated, let us know!

korosig commented 2 years ago

Hi, here is a little script with the error. The parquet could be any kind of parquet data. (I have created a brand new Conda environment with Python 3.8.2, and IPython 7.29.0)

import pandas as pd 
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
import tensorflow as tf
import numpy as np
import dask.dataframe as dd
import dask
from datetime import timedelta
import json

timedelta(hours=200)

df = dd.read_parquet('dataset/flat_table1116.pq/')

timedelta(hours=200)

imported packages pandas 1.3.4 tensorflow 2.3.0 numpy 1.20.3 dask 2021.10.0 json 2.0.9

The video: With Spyder https://youtu.be/XLkxrfZeQwc With Jupyter https://youtu.be/9VVMNxpo2rI With vanilla Python https://youtu.be/XdQlICPXv-g

korosig commented 2 years ago

I have downgraded the dask. dask==2021.01.1 Spyder broke-down dask==2021.02.0 Spyder broke-down dask==2021.03.0 Spyder slow down ...... dask==2021.10.0 Spyder slow down

ccordoba12 commented 2 years ago

Hey @korosig, you said:

The parquet could be any kind of parquet data.

I thing the problem is precisely related to the size of the dataframe associated to that parquet file. Could you provide us with a parquet file of roughly the size the one you're using so we can run tests with it? Its contents are not important, just its size.

korosig commented 2 years ago

Hey @korosig, you said:

The parquet could be any kind of parquet data.

I thing the problem is precisely related to the size of the dataframe associated to that parquet file. Could you provide us with a parquet file of roughly the size the one you're using so we can run tests with it? Its contents are not important, just its size.

~30 GB

I thing the problem is precisely related to the size of the dataframe ....

if this is true, do you have idea why did this not cause a problem in the other IDE

ccordoba12 commented 2 years ago

~30 GB

Oh wow! Then that's almost surely the problem. As I said, if you can provide us with such a file, we'll try to fix the problem.

if this is true, do you have idea why did this not cause a problem in the other IDE

I think that has to do with the Variable Explorer. That's because each time code is evaluated in the console, we need to generate the representation of variables on it to display it in the Variable Explorer. And that can take a lot of time for such a big dataframe.

korosig commented 2 years ago

I think that has to do with the Variable Explorer. That's because each time code is evaluated in the console, we need to generate the representation of variables on it to display it in the Variable Explorer. And that can take a lot of time for such a big dataframe.

This means..... I could use DASK to handle big data frames, but Spyder doesn't support this (yet)

korosig commented 2 years ago

Oh wow! Then that's almost surely the problem. As I said, if you can provide us with such a file, we'll try to fix the problem.

I have tried with smaller parquet and it works without any delay.

Does it mean DASK + BIG parquet is not compatible with Spyder?

ccordoba12 commented 2 years ago

I have tried with smaller parquet and it works without any delay.

Ok, thanks for the confirmation.

Does it mean DASK + BIG parquet is not compatible with Spyder?

I'd say it is, it's just annoying to wait for the console prompt to come back after each evaluation.

But seriously, I think to fix this we should give up computing the representation I mentioned after a timeout, and only show those results we could compute before it. We'll try to do that in the coming months.

ccordoba12 commented 2 years ago

Note: Here is a code to generate a large dataframe with Dask: https://coiled.io/blog/introducing-the-dask-active-memory-manager/