Speed up the fluksoapi.consolidate_sensor method or change data handling

saroele commented 10 years ago

The current synchronization of the data is based on consolidation of all available data to a single csv per sensor. The consolidation is extremely slow, probably due to the combine_first method.

There are some solutions:

replace the combine_first by a append operation?
get rid of the idea that we need 1 file per sensor
switch to tmpo
switch to database

Not really a priority now, we can develop some data analysis scripts with the data we have. Premature optimization is the root of all evil...

saroele commented 10 years ago

Hm, looks like also the saving is part of the reason, or maybe the main reason. Profiling the method would be the first thing to do probably...

Ryton commented 9 years ago

It takes about 60 minutes for fetching & consolidating 70 sensors.

Bad, but not too bad imho, since you only have to do it once every xx weeks.

saroele commented 9 years ago

60 minutes is a long time. Specifically for testing scripts that need recent data, this is not acceptable.
I propose to:

switch to the tmpo data format and sync all recent data with tmpo.
keep the csv files in case we need to run scripts on older data

Ideally, we convert all historical data to tmpo so we have a single data source. If this is too much work we keep both api's and build some logic to combine the data into a single dataframe.

Ryton commented 9 years ago

In that case, tmpo is probably more of a priority than 1 file consolidate. Unfortunately, that kind of scripting is too high an aim for me, I ll leave it to the experts 😉

Van: saroelemailto:notifications@github.com Verzonden: ‎zaterdag‎ ‎15‎ ‎november‎ ‎2014 ‎11‎:‎52 Aan: opengridcc/opengridmailto:opengrid@noreply.github.com

60 minutes is a long time. Specifically for testing scripts that need recent data, this is not acceptable.

I propose to:

switch to the tmpo data format and sync all recent data with tmpo.
keep the csv files in case we need to run scripts on older data

Ideally, we convert all historical data to tmpo so we have a single data source. If this is too much work we keep both api's and build some logic to combine the data into a single dataframe.

— Reply to this email directly or view it on GitHubhttps://github.com/opengridcc/opengrid/issues/8#issuecomment-63168317.

Ryton commented 9 years ago

For 1 week (7 days) of data, and all OG-sensors (70ish), the synchronise.py script takes 555 seconds on a Windows 8 laptop. Per sensor, it has fetched 7 and merged 8 (7+original) files

The top 9 time consuming functions (found with Cprofiler, sorted by tottime column) are:

tottime percall cumtime percall filename:lineno(function) 172.411 0 172.411 0 index.py:740() 98.366 0.43 336.149 1.468 index.py:632(_format_native_types) 77.764 0.143 77.764 0.143 {method 'read' of 'pandas.parser.TextReader' 40.921 0 40.921 0 __init__.py:194(u) 39.662 0.173 39.662 0.173 {pandas.lib.write_csv_rows} 38.091 0.07 38.124 0.07 parsers.py:936(**init**) 20.939 0.091 193.35 0.844 {pandas.lib.map_infer} 10.827 0.02 10.827 0.02 {pandas.tslib.array_to_datetime} 7.948 0.015 13.493 0.025 parsers.py:2109(_concat_date_cols)

870 other functions were called; but these take <5 s each, and in total 37s

To my untrained eye, there not much use to optimise here... I'd propose that someone -:8ball:, not me :-) - looks into tmpo instead...

saroele commented 9 years ago

thanks Ryton for these statistics. As moving to tmpo is on the roadmap anyway, I'll give it a try this weekend and maybe we have blasting sync times by next week :-)

Enjoy the weekend, Roel

On Fri, Nov 21, 2014 at 3:19 PM, Ryton notifications@github.com wrote:

For 1 week (7 days) of data, and all OG-sensors (70ish), the synchronise.py script takes 555 seconds on a Windows 8 laptop. Per sensor, it has fetched 7 and merged 8 (7+original) files

The top 9 time consuming functions (found with Cprofiler, sorted by tottime column) are: tottime percall cumtime percall filename:lineno(function)

172.411 0 172.411 0 index.py:740()

98.366 0.43 336.149 1.468 index.py:632(

_format_native_types) 77.764 0.143 77.764 0.143 {method 'read' of 'pandas.parser.TextReader' 40.921 0 40.921 0 init.py:194(u)

39.662 0.173 39.662 0.173 {pandas.lib.write_csv_rows}

38.091 0.07 38.124 0.07 parsers.py:936(init)

20.939 0.091 193.35 0.844 {pandas.lib.map_infer}

10.827 0.02 10.827 0.02 {pandas.tslib.array_to_datetime}

7.948 0.015 13.493 0.025 parsers.py:2109(_concat_date_cols)

870 other functions were called; but these take <5 s each, and in total 37s

To my untrained eye, there not much use to optimise here... I'd propose that someone -[image: :8ball:], not me :-) - looks into tmpo instead...

— Reply to this email directly or view it on GitHub https://github.com/opengridcc/opengrid/issues/8#issuecomment-63975534.

kdebrab commented 9 years ago

Since not all sensors use tmpo yet, I had a look at how to speed up the data synchronization:

Basic profiling with time.time() prints showed me that roughly 60% of time is spent in fluksoapi.save_csv and most of the remaining 40% in fluksoapi.load_csv. Compared to these, the time spent in combine_first is not significant.
Have a look at http://pandas.pydata.org/pandas-docs/stable/io.html#io-perf. Saving and reading hdf files is approximately 10 times faster than saving and reading csv files (or sql)!

So, it seems like we can speed up synchronization a lot (factor 10) simply by using hdf files instead of csv files.

kdebrab commented 9 years ago

Above code should work. The input files of consolidate_sensor can be csv, hdf or both. The output file type of consolidate_sensor is now always hdf. So one can apply it on the existing data folder (the csv files will be removed and replaced by hdf files).

In order to check the timing, I synchronized from scratch using the 'Synchronize data' notebook, thus downloading 175 zip files (one for each day), unzipping them into csv files (one for each day and sensor) and consolidating these into 68 hdf files (one for each sensor). The timing of all this with the new code:

Download time: 169.105000019 s (3 min) Unzip time: 364.179999828 s (6 min) Consolidate time: 979.518000122 s (16 min) Total time: 1512.80299997 s (25 min)

That seems acceptable to me.

opengridcc / opengrid-dev

Speed up the fluksoapi.consolidate_sensor method or change data handling #8