Reduce amount of data for DataFrames by sampling

spyder-ide / spyder

Official repository for Spyder - The Scientific Python Development Environment

https://www.spyder-ide.org

MIT License

8.22k stars 1.59k forks source link

Reduce amount of data for DataFrames by sampling #21011

Open uprokevin opened 1 year ago

uprokevin commented 1 year ago

On Side Panel, variable visualizer, When clicking on large dataframe or large dictionnary, Panel and Spyder freezed.

Suggest the workaround : 
    nmax= 50000
    On Click Visualize(   dfbig.sample( n = min( len(dfbig, max_df))    , replace=False ) 

Suppose len(dfbig) = 1 million ...
It will sample the dataframe with nmax= 50000 values. and Spyder does not crash...

Same for list 
    On Click Visualize (    listbig[:nmax. )

Reference:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html

Thanks !

ccordoba12 commented 1 year ago

Hey @pprokevin, thanks for reporting. Could you post a video or animated gif that shows Spyder freezing after opening a big dataframe or dictionary?

I just tested with a one million row/single column dataframe, and Spyder didn't freeze for me.

uprokevin commented 1 year ago

Does it handle visualization of 10 million rows with 560 columns in string ?

believe sub-sampling is simple and efficient way to reduce load in visualization ...

ccordoba12 commented 1 year ago

Does it handle visualization of 10 million rows with 560 columns in string ?

That depends on the amount of memory available in your computer, not on Spyder. That's because we need to make a copy of the dataframe in the IPython console kernel to send and display it in Spyder (which runs in a different process).

believe sub-sampling is simple and efficient way to reduce load in visualization ...

Sure, this is a good idea too. Thanks for the suggestion, I didn't know about it. We'll try to implement it in Spyder 6.

uprokevin commented 1 year ago

Thanks for considering it. Think visualizing 1 million rows table does not make much sense for human... At max 100,000 rows would handl most use visualization use cases ( ie find pattern, wrong columns) and reduce memory footprint a lot.