Generate results directly from feature_dict instead of pd.Series

toni-neurosc commented 5 months ago

While I was working on bursts vectorization, I was checking the profiling results and I noticed an unusual amount of time being spent in pd.Series.__init__ and also in _add_timestamp. In total, they accounted for 6 to 7% of the running time.

pd.Series.__init__: 2.5%
_add_timestamp: 3.8%

I realized this was due to the transformation of the dictionary of feature results feature_dict to a pd.Series feature_series, as well as the modification of a pd.Series instead of a dict which comes at a much greater cost due to Pandas overhead, possibly having to move the entire series around in memory, etc. This was done, I imagine, to facilitate some of the computations that were done on that series, mainly the cortex/subcortex projection. But paying a 7% of the running time is too high a cost imho.

Also, using a pd.Series to store a row works in opposition to Pandas logic, which actually stores one Series per column, not per row. So what I did was remove the intermediate transformation of feature_dict to feature_series and instead of the Stream.run function generating a list of pd.Series, it generates a list of feature_dict, then it transforms the dict into the final result dataframe with pd.DataFrame.from_records()

To make this work, I had to make some very small syntax changes, but the logic of everything is the same, just using dict instead of pd.Series. Of course now Projection.project() is a bit slower than before after the very naif syntax conversion (there might be a way to speed it up) but overall shaving off a 7% of the runtime basically for free compensates for it.

Also, I think removing the reliance on Pandas when it comes to handling data during processing is a positive, as it comes with too much overhead. In fact, I'm a proponent of even ditching dict and going np.array all the way, but getting rid of pd.df is a good step in that direction.

So, in conclusion:

No pd.Series, just dict until the final dataframe generation.
6-7% of the running time saved when cortical projection not enabled.
Dependency on Pandas for data handling reduced.

timonmerk commented 5 months ago

Great addition @toni-neurosc! It goes into a similar direction that I just mentioned here: https://github.com/neuromodulation/py_neuromodulation/discussions/322#discussioncomment-9761662

Ideally we will move away from the csv write, and only use the sqlite database for that.

timonmerk commented 5 months ago

Many thanks @toni-neurosc!

neuromodulation / py_neuromodulation

Generate results directly from feature_dict instead of pd.Series #334