Test Polars compatibility and performance

toni-neurosc commented 1 month ago

So Polars is a replacement for Pandas written in Rust (https://pola.rs/) which can be 10-100x faster than Pandas depending on the operations. However, it's still not fully compatible with certain things, for example, I have read that it can have problems working directly with scikit-learn.

PyNM is using Pandas dataframes to store analysis results, so I think at some we should at least give Polars a go and see if it would fit the project.

Demo of Plotly Dash with Polars https://www.youtube.com/watch?v=_iebrqafOuM

timonmerk commented 1 month ago

Thanks @toni-neurosc for mentioning that! Nice video also with impressive speed improvements over pandas. I guess our main aim would be time to store data in an existing data frame / array (either using append/concat after feature computation) and then IO by saving the data frame / array. Thinking about it, the dataframe columns also stay throughout computation the same. Therefore we could think about saving the features to disk in real-time only the numpy array with np.save? It might not be a super elegant solution, but after the recording is finished those could still be merged into a single csv / parquet dataframe. I guess it's also less overhead than a database write.

toni-neurosc commented 1 month ago

Hi @timonmerk, I opened a discussion about this in #322. I did not consider numpy's .npy format but it's actually not that crazy, since pretty much anyone who wants to use PyNM is going to be doing the data processing in Python for sure.

In fact, I already had thought about the problem of the intermediate representation of the feature calculation results, which are currently written in a dictionary, then moved into a Pandas dataframe. I think the dictionary representation might be a bit troublesome, and my idea was to basically flatten the nested structure that can arise in some of the feature calcualtions (e.g. different frequency bands for each channel) and hold the order of each of the features in a separate string array, then return a tuple[list[str], np.ndarray] for each of the features.

If we were to do that, maybe we would be able to ditch dataframes altogether. Maybe we need to use them for the GUI for visualization, but in order to send data around parts of the program, I think we could stay within numpy all the time if we wanted.

Then storing to .npy would be quite fast. We just need to save a file with the header separate from the main data array. Plus, it supports compression with .npz for sparse data and it's quite fast according to this benchmark: bm_example

timonmerk commented 3 weeks ago

I played with polars a bit for a different project now, and it's quite amazing! The core problem however, that we currently accumulate all computed features in RAM still needs to be adressed. After my previous calculation I will try to implement sqlite and save features after every iteration. This option was the fastest and should not create too much overhead.

Also the computation should not affect the other examples, since pandas or polars provide methods to load from a database. This all comes at a cost not having a human readable csv file.. But we could also save a snippet / head of the features simply for debugging purposes.

toni-neurosc commented 3 weeks ago

Coincidentally earlier this morning, when I erroneously thought I had fixed the RTD, I preemptively opened a new local branch called "no_pandas" where I wanted to eventually:

Replace all pandas instances with polars
See if I could get rid of the features_dict data structure and return features as ndarrays. It's uncertain it's going to be faster (there is a change that having a memory-sparse dictionary is actually faster when it comes to building the final dataframe than a contiguous numpy array, but at the very minimum I would return separate dictionaries instead of passing a single one around).

neuromodulation / py_neuromodulation

Test Polars compatibility and performance #320