Open davide-f opened 2 years ago
@davide-f, thank you for explanations regarding performance during the discussion. Preliminary results of testing workflow for China (that's not profiling rather ux test):
1) build_shapes
takes not so long -- about 12 hours but that feels not very comfortable especially during the first run as the script remains silent during all these 12 hours. (Maybe I'm missing some logs?) Could it probably make sense to attach some progress tracking at the computational cycle itself?
2) regarding parallelizing which you mentioned is not working yet for adding the population: could you please clarify a bit?
Is it imap
not working there
or is it this piece which is not under imap
but nevertheless slow and could be parallized as well?
3) in build_osm_network
the limiting stage at the moment is even not (yet) set_substations_ids
and set_lines_ids
but fix_overpassing_lines
. Currently only a half is processed for about 30 hours. [Probably, that's a good idea to switch this option off for the first quick run... :)] It feels not so bad as it's clear that something is goings on and it's possible to get an estimation for the ending time. However, probably could performance of that stage also be taken into consideration as well when working on the performance?
I would like to create a PR on set_substations_ids and set_substations_ids, that is more efficient. But the only thing holding me back is a lack of input and output dataframes. If someone can provide an input dataframe and the expected output (for the given params) that would be very helful.
Great @mnm-matin !
This task is very interesting and I'm very happy to support you. I've some ideas on how to do that and could be good to discuss on them. This task should also be quite easy to do. Shall we have a 30 minute chat about it?
I can provide input and output files for any country in the world. I'd recommend to start debugging with small countries and then test a large one.
A good large test case could be US or China, for a small one, maybe Nigeria should do the job. What do you think?
Thanks @davide-f
That sounds great. Happy to have a meeting. The input and output files (perhaps over discord) would be awesome. For set_substations_ids(buses, distance_crs, tol=2000), input: buses dataframe output: buses dataframe with the added columns
I will keep the pr limited to just set_substations_ids, but the approach should work for line_ids as well.
Large or small countries would be nice for benchmarking. Mainly, I require the input and output files just to make sure I'm getting the right results.
To track the needed improvements, this is the current time requirements in hours for using US:
rule key
download_osm_data total_time 0.102822
clean_osm_data total_time 3.603223
build_shapes total_time 4.601684
build_bus_regions total_time 0.324454
build_osm_network total_time 16.631785
build_demand_profiles total_time 0.059216
build_powerplants total_time 1.337166
build_renewable_profiles total_time 0.637599
base_network total_time 0.105632
add_electricity total_time 0.059819
simplify_network total_time 0.211443
cluster_network total_time 0.019749
solve_network total_time 0.110048
total_comp_stats total_time 30.608660
Name: US, dtype: float64
The PRs on build_osm_network by @mnm-matin can help tackle the major bottleneck. Current PR #650 by @GridGrapher can significantly help break down computational time for build_shapes The subsequent bottleneck is addressing clean_osm_network, in particular the function set_countryname_by_shape
Towards global PyPSA-Earth
In this issue, we track major requirements needed to successfully run the workflow at complete global scale. I have been running parts of the model using
countries=["Earth"]
and in the following I resume some findings; this list is to be populated by additional comments.download_osm_network
needed some fixes but it generally works; however, I needed to rerun the workflow some times because the procedure got stopped several times because of download limits at the server end. When the user is interested in downloading large areas, it may be better to download the combined/continental chuncks rather than each country; however this leads to less generalizability and duplication if the user is then interested in smaller areas. Alternatively, some delays can be manually inserted to avoid such problem. Some tests may be needed when larger regions are more needed.clear_osm_network
: this is the real big deal. Currently, after one day of complete execution, with the "names_by_shapes" option disabled we are still far from solved. The procedure is stuck atafrica_shape.contains
. I noticed that functions on polygons are super heavy to execute and we need to work hard on that. In the following some comments are provided:split_cells
may be removed and use the more generalsplit_cells_multiple
instead (this is for clarity and to avoid duplication)ext_country_shapes
that is the unary_union of the country_shapes with the corresponding offshore_shapes that may be used by several scripts.build_osm_network
is going to be another big deal as we have the functionsset_substations_ids
,fix_overpassing_lines
andset_lines_ids
that generally take a long time already for the africa model and their complexity is O(n^2), we shall improve that.build_cutout
: works with the feat/era5-monthly-retrieveal branch, yet the global cutout is about 300GB. The new compression features available by atlite for large cutout may be tested and used.