Missing pieces towards global PyPSA-Earth

davide-f commented 2 years ago

Towards global PyPSA-Earth

In this issue, we track major requirements needed to successfully run the workflow at complete global scale. I have been running parts of the model using countries=["Earth"] and in the following I resume some findings; this list is to be populated by additional comments.

[x] download_osm_networkneeded some fixes but it generally works; however, I needed to rerun the workflow some times because the procedure got stopped several times because of download limits at the server end. When the user is interested in downloading large areas, it may be better to download the combined/continental chuncks rather than each country; however this leads to less generalizability and duplication if the user is then interested in smaller areas. Alternatively, some delays can be manually inserted to avoid such problem. Some tests may be needed when larger regions are more needed.
[x] #574
[ ] clear_osm_network: this is the real big deal. Currently, after one day of complete execution, with the "names_by_shapes" option disabled we are still far from solved. The procedure is stuck at africa_shape.contains. I noticed that functions on polygons are super heavy to execute and we need to work hard on that. In the following some comments are provided:
- [x] #601
- [x] split_cells may be removed and use the more general split_cells_multiple instead (this is for clarity and to avoid duplication)
- [ ] In build_shapes, we may preprocess better the regions to avoid extra calculations down the chain. For example, we shall perform better checks on the parameters to simplify_polys that polishes and simplify the shapes. The workflow shall work using those "cleaned" shapes although they may be slightly distorted with respect to the original ones. The original shapes may be saved anyway that can be used. Again in build shapes, we may prepare also the ext_country_shapes that is the unary_union of the country_shapes with the corresponding offshore_shapes that may be used by several scripts.
- [ ] all geopandas function may be implemented using dask_geopandas to support multithreading
- [ ] all shape comparison in the workflows may be revised to preprocess the geometries and avoid detailed comparisons for only where it is necessary
- [ ] functions that compare shapes shall be approximated or simplified. For example, the country flag to a line (if necessary) can be calculated using one of the two substations the line is connected to.
[ ] build_osm_network is going to be another big deal as we have the functions set_substations_ids, fix_overpassing_lines and set_lines_ids that generally take a long time already for the africa model and their complexity is O(n^2), we shall improve that.
[ ] build_cutout: works with the feat/era5-monthly-retrieveal branch, yet the global cutout is about 300GB. The new compression features available by atlite for large cutout may be tested and used.

ekatef commented 2 years ago

@davide-f, thank you for explanations regarding performance during the discussion. Preliminary results of testing workflow for China (that's not profiling rather ux test):

1) build_shapes takes not so long -- about 12 hours but that feels not very comfortable especially during the first run as the script remains silent during all these 12 hours. (Maybe I'm missing some logs?) Could it probably make sense to attach some progress tracking at the computational cycle itself?

https://github.com/pypsa-meets-earth/pypsa-earth/blob/3f2c3c91017f200e45ac66c5c179801b1f113dbb/scripts/build_shapes.py#L728-L729

2) regarding parallelizing which you mentioned is not working yet for adding the population: could you please clarify a bit?

Is it imap not working there

https://github.com/pypsa-meets-earth/pypsa-earth/blob/3f2c3c91017f200e45ac66c5c179801b1f113dbb/scripts/build_shapes.py#L722-L726

or is it this piece which is not under imap but nevertheless slow and could be parallized as well?

https://github.com/pypsa-meets-earth/pypsa-earth/blob/3f2c3c91017f200e45ac66c5c179801b1f113dbb/scripts/build_shapes.py#L728-L729

3) in build_osm_network the limiting stage at the moment is even not (yet) set_substations_ids and set_lines_ids but fix_overpassing_lines. Currently only a half is processed for about 30 hours. [Probably, that's a good idea to switch this option off for the first quick run... :)] It feels not so bad as it's clear that something is goings on and it's possible to get an estimation for the ending time. However, probably could performance of that stage also be taken into consideration as well when working on the performance?

mnm-matin commented 1 year ago

I would like to create a PR on set_substations_ids and set_substations_ids, that is more efficient. But the only thing holding me back is a lack of input and output dataframes. If someone can provide an input dataframe and the expected output (for the given params) that would be very helful.

davide-f commented 1 year ago

Great @mnm-matin !

This task is very interesting and I'm very happy to support you. I've some ideas on how to do that and could be good to discuss on them. This task should also be quite easy to do. Shall we have a 30 minute chat about it?

I can provide input and output files for any country in the world. I'd recommend to start debugging with small countries and then test a large one.

A good large test case could be US or China, for a small one, maybe Nigeria should do the job. What do you think?

mnm-matin commented 1 year ago

Thanks @davide-f

That sounds great. Happy to have a meeting. The input and output files (perhaps over discord) would be awesome. For set_substations_ids(buses, distance_crs, tol=2000), input: buses dataframe output: buses dataframe with the added columns

I will keep the pr limited to just set_substations_ids, but the approach should work for line_ids as well.

Large or small countries would be nice for benchmarking. Mainly, I require the input and output files just to make sure I'm getting the right results.

davide-f commented 1 year ago

Here they are :) https://drive.google.com/drive/folders/1YJp7fIrlCIIac2Gm-2ie4w8aLPgNWZZq?usp=drive_link

davide-f commented 1 year ago

To track the needed improvements, this is the current time requirements in hours for using US:

rule key
download_osm_data total_time 0.102822 clean_osm_data total_time 3.603223 build_shapes total_time 4.601684 build_bus_regions total_time 0.324454 build_osm_network total_time 16.631785 build_demand_profiles total_time 0.059216 build_powerplants total_time 1.337166 build_renewable_profiles total_time 0.637599 base_network total_time 0.105632 add_electricity total_time 0.059819 simplify_network total_time 0.211443 cluster_network total_time 0.019749 solve_network total_time 0.110048 total_comp_stats total_time 30.608660 Name: US, dtype: float64

The PRs on build_osm_network by @mnm-matin can help tackle the major bottleneck. Current PR #650 by @GridGrapher can significantly help break down computational time for build_shapes The subsequent bottleneck is addressing clean_osm_network, in particular the function set_countryname_by_shape

pypsa-meets-earth / pypsa-earth

Missing pieces towards global PyPSA-Earth #445

Towards global PyPSA-Earth