Closed saroele closed 9 years ago
Hm, looks like also the saving is part of the reason, or maybe the main reason. Profiling the method would be the first thing to do probably...
It takes about 60 minutes for fetching & consolidating 70 sensors.
Bad, but not too bad imho, since you only have to do it once every xx weeks.
60 minutes is a long time. Specifically for testing scripts that need recent data, this is not acceptable.
I propose to:
Ideally, we convert all historical data to tmpo so we have a single data source. If this is too much work we keep both api's and build some logic to combine the data into a single dataframe.
In that case, tmpo is probably more of a priority than 1 file consolidate. Unfortunately, that kind of scripting is too high an aim for me, I ll leave it to the experts 😉
Van: saroelemailto:notifications@github.com Verzonden: ‎zaterdag‎ ‎15‎ ‎november‎ ‎2014 ‎11‎:‎52 Aan: opengridcc/opengridmailto:opengrid@noreply.github.com
60 minutes is a long time. Specifically for testing scripts that need recent data, this is not acceptable.
I propose to:
Ideally, we convert all historical data to tmpo so we have a single data source. If this is too much work we keep both api's and build some logic to combine the data into a single dataframe.
— Reply to this email directly or view it on GitHubhttps://github.com/opengridcc/opengrid/issues/8#issuecomment-63168317.
For 1 week (7 days) of data, and all OG-sensors (70ish), the synchronise.py script takes 555 seconds on a Windows 8 laptop. Per sensor, it has fetched 7 and merged 8 (7+original) files
The top 9 time consuming functions (found with Cprofiler, sorted by tottime column) are:
870 other functions were called; but these take <5 s each, and in total 37s
To my untrained eye, there not much use to optimise here... I'd propose that someone -:8ball:, not me :-) - looks into tmpo instead...
thanks Ryton for these statistics. As moving to tmpo is on the roadmap anyway, I'll give it a try this weekend and maybe we have blasting sync times by next week :-)
Enjoy the weekend, Roel
On Fri, Nov 21, 2014 at 3:19 PM, Ryton notifications@github.com wrote:
For 1 week (7 days) of data, and all OG-sensors (70ish), the synchronise.py script takes 555 seconds on a Windows 8 laptop. Per sensor, it has fetched 7 and merged 8 (7+original) files
The top 9 time consuming functions (found with Cprofiler, sorted by tottime column) are: tottime percall cumtime percall filename:lineno(function)
172.411 0 172.411 0 index.py:740()
98.366 0.43 336.149 1.468 index.py:632(
_format_native_types) 77.764 0.143 77.764 0.143 {method 'read' of 'pandas.parser.TextReader' 40.921 0 40.921 0 init.py:194(u)
39.662 0.173 39.662 0.173 {pandas.lib.write_csv_rows}
38.091 0.07 38.124 0.07 parsers.py:936(init)
20.939 0.091 193.35 0.844 {pandas.lib.map_infer}
10.827 0.02 10.827 0.02 {pandas.tslib.array_to_datetime}
7.948 0.015 13.493 0.025 parsers.py:2109(_concat_date_cols)
870 other functions were called; but these take <5 s each, and in total 37s
To my untrained eye, there not much use to optimise here... I'd propose that someone -[image: :8ball:], not me :-) - looks into tmpo instead...
— Reply to this email directly or view it on GitHub https://github.com/opengridcc/opengrid/issues/8#issuecomment-63975534.
Since not all sensors use tmpo yet, I had a look at how to speed up the data synchronization:
So, it seems like we can speed up synchronization a lot (factor 10) simply by using hdf files instead of csv files.
Above code should work. The input files of consolidate_sensor can be csv, hdf or both. The output file type of consolidate_sensor is now always hdf. So one can apply it on the existing data folder (the csv files will be removed and replaced by hdf files).
In order to check the timing, I synchronized from scratch using the 'Synchronize data' notebook, thus downloading 175 zip files (one for each day), unzipping them into csv files (one for each day and sensor) and consolidating these into 68 hdf files (one for each sensor). The timing of all this with the new code:
Download time: 169.105000019 s (3 min) Unzip time: 364.179999828 s (6 min) Consolidate time: 979.518000122 s (16 min) Total time: 1512.80299997 s (25 min)
That seems acceptable to me.
The current synchronization of the data is based on consolidation of all available data to a single csv per sensor. The consolidation is extremely slow, probably due to the combine_first method.
There are some solutions:
Not really a priority now, we can develop some data analysis scripts with the data we have. Premature optimization is the root of all evil...