GbotemiB commented 11 months ago

This PR contains a validation notebook that explains the discrepancy found here.

Objective

The Objective of this notebook is to validate OSM data against ENTSOE data using resource files for Austria and North Macedonia.

Results

It was confirmed that the reason for the discrepancy was because the elec.nc file for OSM that was compared against that of PyPSA-Eur contains voltages that are lower than 220kV that was rebased to 220kV. It is advised that comparison should be done using base_network_csv files since there is no rebase of voltage in the csv files.

PS: There is an ongoing conversation about this notebook here

ekatef commented 11 months ago

Hello @GbotemiB! Great improvements! It looks like we are converging :)

Adding a couple of comments aimed to make the code a bit easier to read.

I wonder if if_country_in_entsoe( ) checking function may be a bit of overshoot. We know in advance that the requested countries is in ENTSOE data. Agree that it has been a valuable finding with missed countries in ENSTOE data, but this notebook has quite a narrow purpose.
I have an impression that geospatial_plot( ) can be made probably a bit clearer if
- we would remove # Filter for country code block and avoid using the global variables (at_country_shape etc.) inside the function. I see two possible solutions:
- country code would be transferred as an argument into geospatial_plot( ) and loading data would be done inside geospatial_plot( ) itself.
- we would transfer as arguments a country shape and both osm dataframes

What is your feeling about that?

Has it been a specific reason to start a name of _apply_simplication( ) with an underscore? Do you mean that the function is intended for internal use?

Regarding the plots, I have the following comments.

The plots look amazing. A really great work!
In Geospatial Visualization, it looks like raw and clean geometries are mostly completely overlaid by base data, right? It that is the case, I think we may state it it the introduction to the section. Something like: "In this section we compare topology of raw, clean and base datasets for transmission lines. Script clean_osm_data removes lines outside of the considered region, so only "raw" lines present outside each of the considered countries Inside a country, all three PyPSA-Earth datasets are mostly fully overlapped." Feel free please to improve this draft.
I think, it would be helpful for a reader to add a couple of lines to explain a purpose of Simplified vs Non Simplified and some kind of a conclusion. For example: "ENTSOE-extracted data have a simplified topology, while OSM-extracted lines reproduce topology of a transmission network in quite a detail. To estimate an effect of this simplification, we have applied a simplification procedure to OSM line geometries, as well. Douglas-Peucker simplification algorithm has been applied for that."

GbotemiB commented 11 months ago

I wonder if if_country_in_entsoe( ) checking function may be a bit of overshoot. We know in advance that the requested countries is in ENTSOE data. Agree that it has been a valuable finding with missed countries in ENSTOE data, but this notebook has quite a narrow purpose.

This function descibed here is to make it easy to filter through the entsoe for just AT and MK using the geometry cordinates.

Has it been a specific reason to start a name of _apply_simplication( ) with an underscore? Do you mean that the function is intended for internal use?

Actually, no reason in particular. i will rename the variable name 😀.

I have an impression that geospatial_plot( ) can be made probably a bit clearer if

we would remove # Filter for country code block and avoid using the global variables (at_country_shape etc.) inside the function. I see two possible solutions: country code would be transferred as an argument into geospatial_plot( ) and loading data would be done inside geospatial_plot( ) itself. we would transfer as arguments a country shape and both osm dataframes

Sounds great. I shouldnt be having global variables in my functions.

ekatef commented 11 months ago

@GbotemiB amazing work and very nice presentation. I think, we need a final clean-up round, and that task will be completed.

Find please the comments bellow.

My feeling is that a block with if_country_in_entsoe() can be simplified:
- not sure the function is really needed: it consists of only a single line, while entsoe_ref["if_at"] = if_country_in_entsoe(at_country_shape, entsoe_ref) is not nessesarily more clear as compared with base_data["geometry"].apply(lambda row: row.within(country_df["geometry"][0]))
- the function name is misleading, as well as the comment to applying it:
```
# Determine if Austria (AT) is in the ENTSO-E data.
entsoe_ref["if_at"] = if_country_in_entsoe(at_country_shape, entsoe_ref)
```
  As you have explained, that is filtering rather than actually definition if the country present in the data
- the filtering itself can be probably simplified: could you please check if geo_df.loc[geo_df.within(country_df["geometry"][0])] works?

I'd avoid duplicating comments, like currently is the case in this part:


# Clean the Austria (AT) OSM raw data, addressing non-uniform column names and NaN values.
at_osm_raw_lines = preprocess_raw_data(at_osm_raw_lines)

Clean the North Macedonia (MK) OSM raw data, addressing non-uniform column names and NaN values.

mk_osm_raw_lines = preprocess_raw_data(mk_osm_raw_lines)


My feeling is that it would be more clear to keep both code lines together and have only one comment for both of the lines

3. Note please that the comments in the code should explain **why** or **what for** is the following part needed, not **what** is it doing. On the contrary, names of the functions or methods should indicate what actually they are doing.

In particular, I think that this comment should be revised: `# Applying simplification function to preprocess the dataframe with __apply_simplification func`

4. A suggestions on the naming improvement (that is a really hard part!):
- it would be great to rename `converted_length` into something like `reprojected_length` or  `length_3035`;
- it would be greate to make `create_plot` name more specific.

5. A section `Comparison with Barchart` misses explanations.

6. Would be great to specify also a unit for `tolerance` in the dedicated comment `# Define tolerance level for simplification`: is it in meters?

7. Aren't the bar diagrams duplicated in `Comparison with Barchart` and `Section 2 Applying Simplification`?

8. Great that you have added explanations to the Conclusion. To make our statement more credible, I'd probably add a bit more details: instead to declare that OSM data are more accurate, we may explain that OSM data are more detailed and more accurately represent the actual grid topology.

GbotemiB commented 11 months ago

Aren't the bar diagrams duplicated in Comparison with Barchart and Section 2 Applying Simplification?

The plots are not the same, they look similar, which makes sense because of the effect of simiplication. if you take a look at the Simplified vs Non-Similified Section, you will find that the difference is not the large.

This is the difference in kilometers

GbotemiB commented 11 months ago

@ekatef Thank you for the wonderful comments and review. I have pushed my recent commits effecting the changes you suggested. I will be glad to have you review the notebook with the recent commits.

GbotemiB commented 10 months ago

Hi @ekatef, I have refined the conclusion in the notebook. We can start iterating over the notebook from here making it understandable for a larger audience. You can check my recent commits.

ekatef commented 10 months ago

Hey @GbotemiB and thanks for the nice conclusion draft!

A couple of comments.

Some typo fixing may not harm like ensure that ENTSO is capitalised everywhere.
The statement in the beginning may feel like wishful thinking: "OSM data provides a more detailed and accurate representation". That may be totally correct in its' essence, but a reader should have an opportunity to judge him/herself, using the evidence we provide. The aim of the conclusion is to summarise the evidence provided in the notebook.
Generally, it's better to write the test in a way when each next sentence is connected with the previous one, and the whole text presents some story. That facilitates reading significantly and helps to communicate the intended message.

I'd suggest you to write the next iteration of the conclusion, giving it some structure. Making some preliminary plan may be probably helpful for that. Basing on the notebook content and the draft you have created, I'd suggest the following plan.

Remind briefly which OSM-extracted data have been considered (basically, that has been the whole data preparation workflow).
Explain which types of comparison have been done: topology and quantitative comparison for length values -> effects have been found which looked like possible using of simplification methods on ENTSO data
Check a hypothesis on simplification applying geopandas simplify method (which utilises Douglas-Peucker algorithm) with a quite satisfactory results: simplified OSM topologies get in fact closer to ENTSO data.
Would be great to provide quantitative estimation to lengths comparison: no need to include a lot of numbers into the conclusion, but a reference to an order of magnitude/one or two nicely looking estimations.

Then the main conclusion point would look well-reasoned ;)

ekatef commented 10 months ago

Btw, with a fresh look it feels like clarity of the notebook may be slightly improved:

1) under # Create plot using the simplified dataframe in [26] cell, that is not clear if the plot shows values for the original or the simplified geometry; in both cases there may be duplication there: do we really need it?; 2) in cells [28] and [29], not sure if it's evident which OSM-deriven values have been taken for comparison. I assume there have been simplified ones, but that does not follow directly from the table itself. Would be great to double-check it and add some explanations to the table. 3) does it probably make sense to add such a percentage difference for non-simplified values, as well? 4) as a side note: please keep in mind that it's better show only meaningful digits when presenting numbers: in the tables, in particular. Set-up of the proper number formatting can be helpful for that.

GbotemiB commented 10 months ago

Hi @ekatef, thanks for the review. Based on the recent comments, I have revamped the conclusion based on the suggestions.

Here is the new conclusion.

This analysis aimed to investigate OSM data against ENTSOE data using Austria and North Macedonia. The following process was carried out during the analysis:

A geospatial comparison plot for 220kV and 380kV lines shows the lines topology.
A bar-chart comparison plot comparing the length of the lines in OSM and ENTSOE data.
A hypothesis was tested on the lines by applying the geopandas simplify method, leveraging the Douglas-Peucker algorithm.
A geospatial comparison plot and a bar-chart comparison plot were done based on this hypothesis. This effect of the hypothesis simplified the topologies for OSM data which made the results from OSM data closer to that of ENTSOE data by an estimated percentage difference of 5%
A CRS check was done to verify the length of lines of OSM data by comparing with lines measured on google maps manually.

The analysis showed an estimated percentage difference of 11% between osm-data and ENTSOE-data. After applying the Douglas-peucker algorithm, an estimated percentage difference of about 5% was observed in the analysis.

ekatef commented 9 months ago

Hello @GbotemiB! Thank you for working on that. My feeling is that we are converging.

Comments on the overall structure of the notebook:

Would be nice to clarify for the percentage_differences between what and what they are taken. I assume, that is always a difference between the length in question x and the ENTSOE length x_ensto: percentage_difference = (x - x_enstso)/x_entso. But I may be wrong.
Would be perfect to find some way to decrease a number of displayed digits. Hopefully, it may be fixed by specifying format options for the notebook.

Regarding the Conclusion: now it reads much smoother! A couple of comments:

Not sure it feels perfect to have introductory words (In summary and In conclusion) in the conclusion. I feel that they may do the text "heavier". But that is very subjective 🙂
I think it could help to add a bit of context into the conclusion to remind that we are dealing with PyPSA-Earth data extraction workflow. It would be enough to add a few of words to explain what are these OSM-extracted data about in the beginning and which csv-s do we mean in the end.
I do not understand the third paragraph ("While a detailed quantitative ..."): can't agree that the quantitative estimation is not well presented in the notebook, and not sure which impact is meant. Although, the writing style is very nice.

I think we are very close to finalise. Great work!

GbotemiB commented 9 months ago

Thanks for the review. Based on the comments.

Looks like the formula you are referring to percentage change. I think we have to come to a conclusion on whether to use percentage change or percentage difference.
Thanks for the tip. I found an option with pandas.

GbotemiB commented 9 months ago

Hi @ekatef, I have revised the notebook based on your comments. Ready for another revision 🙃

ekatef commented 9 months ago

Hello @GbotemiB! Perfect!! :D

Great that you have found a neat way to fix numeric precision issue. And I very much like the result of you work.

The only comment left is an answer to your question regarding the percentage change: I think it should be rather a relative change as compared to ENTSO reference data, as we assume them to be standard de facto. Would be also perfect to add a couple of the relative changes values to the conclusion, like: closely align with ENTSO data: being about x1..x2% for Austria and y1..y2% for Northern Macedonia (where x1, x2 correspond to the network which has been made as close as possible to ENSTOE data).

After that, I think the notebook is basically ready to merge.

@pz-max do probably you have any comments or recommendations? 🙂

GbotemiB commented 9 months ago

Thank you @ekatef for the review, I will make the final changes as you have suggested.

ekatef commented 8 months ago

Thank you @ekatef for the review, I will make the final changes as you have suggested.

Great work! I think, the only point left is the very final polishing of the conclusion, and the PR is ready to merge.

Thanks for the amazing contribution! 😄

ekatef commented 8 months ago

Merged 🎉 🎉 🎉 Thanks for the outstanding contribution, @GbotemiB! 🥇 Fantastic work which will be definitely very helpful in future.

GbotemiB commented 8 months ago

Thank you for the amazing support.

pypsa-meets-earth / documentation

Validation for OSM and ENTSOE data using resource files for Austria and North Macedonia #67

Objective

Results

Clean the North Macedonia (MK) OSM raw data, addressing non-uniform column names and NaN values.