Closed toni-neurosc closed 5 months ago
Great addition @toni-neurosc! It goes into a similar direction that I just mentioned here: https://github.com/neuromodulation/py_neuromodulation/discussions/322#discussioncomment-9761662
Ideally we will move away from the csv write, and only use the sqlite database for that.
Many thanks @toni-neurosc!
While I was working on bursts vectorization, I was checking the profiling results and I noticed an unusual amount of time being spent in
pd.Series.__init__
and also in_add_timestamp
. In total, they accounted for 6 to 7% of the running time.pd.Series.__init__
: 2.5%_add_timestamp
: 3.8%I realized this was due to the transformation of the dictionary of feature results
feature_dict
to a pd.Seriesfeature_series
, as well as the modification of a pd.Series instead of adict
which comes at a much greater cost due to Pandas overhead, possibly having to move the entire series around in memory, etc. This was done, I imagine, to facilitate some of the computations that were done on that series, mainly the cortex/subcortex projection. But paying a 7% of the running time is too high a cost imho.Also, using a pd.Series to store a row works in opposition to Pandas logic, which actually stores one Series per column, not per row. So what I did was remove the intermediate transformation of
feature_dict
tofeature_series
and instead of theStream.run
function generating a list of pd.Series, it generates a list of feature_dict, then it transforms the dict into the final result dataframe with pd.DataFrame.from_records()To make this work, I had to make some very small syntax changes, but the logic of everything is the same, just using
dict
instead ofpd.Series
. Of course nowProjection.project()
is a bit slower than before after the very naif syntax conversion (there might be a way to speed it up) but overall shaving off a 7% of the runtime basically for free compensates for it.Also, I think removing the reliance on Pandas when it comes to handling data during processing is a positive, as it comes with too much overhead. In fact, I'm a proponent of even ditching
dict
and goingnp.array
all the way, but getting rid of pd.df is a good step in that direction.So, in conclusion:
dict
until the final dataframe generation.