network analysis for osm and pypsa-eur data

GbotemiB commented 10 months ago

@pz-max @ekatef

Here is the notebook that was used for the comparison between osm-data and pypsa-eur data. Please kindly review the notebook.

pz-max commented 10 months ago

Hi @GbotemiB , Can you update the notebook to start from 0/1 execution. (Restart kernel)

Noting the code:

entsoe_twkm = sum(entsoe.lines['s_nom'] * entsoe.lines['length'] * entsoe.lines['num_parallel'])
osm_220 = osm.lines[(osm.lines.v_nom > 220.0) & (osm.lines.dc == False)]

Can you remove in sum(entsoe.lines['s_nom'] * entsoe.lines['length'] * entsoe.lines['num_parallel']) the entsoe.lines['s_nom'] as explained in the discord communication? Please also change the same calculation mistake for the OSM bit. We want only km as unit/ not TWkm. Be aware of the unit fixes that are potentially required.

GbotemiB commented 10 months ago

Hello @pz-max Can you help me clarify what you meant by updating the notebook to start from 0/1 execution.

pz-max commented 10 months ago

Sure. For notebooks you can run each cell multiple times (on the left side you see at which cell execution your count is).

If you restart the kernel and delete all outputs and then rerun the full notebook it should be fine and the count should go in the right order.

GbotemiB commented 10 months ago

I have made the changes.

I noticed that the first graph changed. You can look into it

ekatef commented 10 months ago

Hello @GbotemiB and thanks for the PR.

As a comment, it would be nice to keep the same snake-case naming style, to be consistent with the other notebooks.

Regarding the essence of the changes, now the diagram looks much closer to the reference data. That is great improvement! Could you please investigate the effect of the introduced changes on the other plots?

I mean, if you have understood that it was a mistake in the initial version you provided, that is usually a good idea to analyse its' nature and reasons, check which effects it had on the results and introduce all the corrections needed. Errors happen, and that is essential to know how to fix and prevent them ( saying from the experience 😄).

GbotemiB commented 10 months ago

Hi @ekatef,

Taking into consideration the comment you provided, the recent changes made to the notebook only affected the comparison that had to do with transmission capacity where s_nom * length of each line * the number of parallel conductors was introduced. But you can check through to see if something was missed.

I have also adapted the snake-case naming style.

pz-max commented 10 months ago

I have made the changes.

I noticed that the first graph changed. You can look into it

The title says "km_per_length" but you mean maybe "length_in_km"? @GbotemiB

ekatef commented 10 months ago

Hey @GbotemiB! Thanks for making the changes in the name.

Copying here @pz-max comments from our discussion:

Your units are sometimes wrong (e.g. there is a 10^6 devision that is not correct)

The graph titles/lables needs to be fixed

The log image is not really necessary (suggest to remove it)

Comparing the attached image to the image before suggest that something goes wrong.. Needs fixes.

It would be great if you could please address all the points. Regarding the log images, they can make sense if you want to zoom into some details (like difference between the countries). But for log distribution by voltage seems to be kind of redundant.

I have looked in you code and have an impression that it could make sense to revise it a bit to improve readability. In particular, you seem to be using a plenty of variables to do the same analysis, which make it difficult to trace all the calculations. That is the reason why an effect of the corrections you done for the first plot doesn't propagate to the whole code. It would be also great to decrease width of your code: PEP8 specifies the maximum line length at 79 symbols.

That is very needed work which requires quite careful treatment. Looking forward for the updated results :)

GbotemiB commented 10 months ago

Thanks Katia. I will refactor the code to effect the changes. I will also work on the plots to make it interactive, this way the difference between the countries will be observable.

On Thu, Sep 21, 2023, 10:10 Ekaterina @.***> wrote:

Hey @GbotemiB https://github.com/GbotemiB! Thanks for making the changes in the name.

Copying here @pz-max https://github.com/pz-max comments from our discussion:

Your units are sometimes wrong (e.g. there is a 10^6 devision that is not correct)

The graph titles/lables needs to be fixed

The log image is not really necessary (suggest to remove it)

Comparing the attached image to the image before suggest that something goes wrong.. Needs fixes.

It would be great if you could please address all the points. Regarding the log images, they can make sense if you want to zoom into some details (like difference between the countries). But for log distribution by voltage seems to be kind of redundant.

I have looked in you code and have an impression that it could make sense to revise it a bit to improve readability. In particular, you seem to be using a plenty of variables to do the same analysis, which make it difficult to trace all the calculations. That is the reason why an effect of the corrections you done for the first plot doesn't propagate to the whole code. It would be also great to decrease width of your code: PEP8 https://peps.python.org/pep-0008/#maximum-line-length specifies the maximum line length at 79 symbols.

That is very needed work which requires quite careful treatment. Looking forward for the updated results :)

— Reply to this email directly, view it on GitHub https://github.com/pypsa-meets-earth/documentation/pull/64#issuecomment-1729168340, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALUUPPA4UYITIVAQ3R3J2DDX3QAABANCNFSM6AAAAAA45TC23E . You are receiving this because you were mentioned.Message ID: @.***>

ekatef commented 10 months ago

Thanks Katia. I will refactor the code to effect the changes. I will also work on the plots to make it interactive, this way the difference between the countries will be observable.

Super! Looking forward for the revisions 🙂

Regarding the plots: feel free to experiment with interactivity, but I'd keep static versions too, as they are very convenient to quickly get a feeling on what is going on.

ekatef commented 10 months ago

Hello @GbotemiB! Thanks a lot for making the changes. The code looks much cleaner now. Great work!

As a methodological suggestion, I'd also decrease a number of variables. There is the tidy data concept which actually can make life easier.

In short, that means keeping data in dataframes of standartised structure: variables are in columns, observations are in rows, and values are in cells. So, instead of creating entsoe_length_in_km in the first block, you can add an additional column in entsoe_hv dataframe to keep equivalent length of the lines, and re-use this column further. That would eliminate code duplication and allows to downstream changes in your code easily.

Not sure however, if these changes would have significant effect on the result, as it feels like we have a rather grave error somewhere along the workflow which should be relatively easy to catch :)

ekatef commented 10 months ago

@GbotemiB I'd suggest to read and think about data processing in meantime, while now to look into the alternative approaches.

Copying your great summary of the discussion in Discord channel:

analysis of the line dataframe for PyPSA Earth and PyPSA Eur;
checking for CRS error by looking into the same line in OSM / ENTSO / some third-parties data (like google map);
cross-comparison of raw and clean data with AT and MK data;
comparison of PyPSA Eur and PyPSA earth analysis in geometrical terms.

My feeling is that the best next step would be to do comparison for raw and clean OSM data. Reference PyPSA-Eur data are available here.

You may use country_shapes.geojson shape files for AT and MK available in shapes subfolder for each of the countries to filter-out PyPSA-Eur ENTSO data for each of them. Note please that voltage is in V for PyPSA-Earth and in kV for PyPSA-Eur.

GbotemiB commented 9 months ago

Thanks for the feedbacks. I will work on the code and get back to you.

GbotemiB commented 9 months ago

Hey @GbotemiB! 🙂 Thanks for making the adjustments.

Just to be sure: have you used elec.nc or base.nc to make comparison? Don't think it's very important in this case, just for information. Generally, I'd rather keep the original file names using folders to organise them.

I think, we need the final iteration. The code and the plots are still looking great, while explanations could be probably elaborated a bit. (Probably, the hardest part of documentation writing...)

Btw, please feel free to resolve approving review comments, but please do not resolve review comments, if they contain any points to be addressed. Otherwise, we risk to miss some under-addressed points 😄

Have added some comments to the introduction and the conclusion. Happy to elaborate, if something is not clear.

Regarding the comparison, I used elec.nc for both pypsa-earth and pypsa-eur.

ekatef commented 9 months ago

@GbotemiB great contribution with AT-MK notebook!

I can't put the review comments into the code as GitHub doesn't allow to attach in-code comments due to a large diff. It would be great to address the following points.

Code comments

Looking to the line entsoe_ref = gpd.GeoDataFrame(entsoe_ref_df, geometry="geometry", crs="EPSG:4326"), I'm not quite sure that ENTSO geometry is using EPSG:4326 CRS. That is the global CRS, while PyPSA-Europe is mostly using EPSG:3035 focused on Europe. Could you please double-check?
Function high_voltage_df() seems to apply pretty the same procedure for "AT" and "MK". It would be great to remove this duplication using a country name as a parameter.
A code block [8] ("EPSG:4326") seems to be not needed.
Not sure it's a good idea to replace voltage values in block [9] (starting from entsoe_ref['voltage'].replace(400000, 380000, inplace=True)). Generally, the original data should always be kept to avoid distortion during analysis. In our particular case, I think it would be enough to adjust voltage threshold when filtering in high_voltage_df (), instead of modifying the voltages in the data.
Methods chaining can be used to make implementation of simplification more compact (section Section 2 Applying Simplification)

Presentation comments

The title can be probably shortened and made more clear. In particular, I'd replace countries names with full names and remove mentions of "raw" and "clean" which may be a bit too much of details for the title. I think something like "OSM vs ENTSO validation: comparison for Austria and Macedonia". Obviously, feel free to improve.
Introduction should improve objective of the study you done, as well as short presentation of the used data sources.
Section 2 Applying Simplification section needs some explanations. My feeling is that our Discord discussion can provide you some hints on that :)
Agree with the Conclusion, in general. However, some reasoning should be added to explain how does the summary follows from the analysis results.

Do not be discouraged by the number of comments ;) You have done amazing work, and now we need to present its' results as clear as we can.

GbotemiB commented 9 months ago

@GbotemiB great contribution with AT-MK notebook!

I can't put the review comments into the code as GitHub doesn't allow to attach in-code comments due to a large diff. It would be great to address the following points.

Code comments

Looking to the line entsoe_ref = gpd.GeoDataFrame(entsoe_ref_df, geometry="geometry", crs="EPSG:4326"), I'm not quite sure that ENTSO geometry is using EPSG:4326 CRS. That is the global CRS, while PyPSA-Europe is mostly using EPSG:3035 focused on Europe. Could you please double-check?

Function high_voltage_df() seems to apply pretty the same procedure for "AT" and "MK". It would be great to remove this duplication using a country name as a parameter.

A code block [8] ("EPSG:4326") seems to be not needed.

Not sure it's a good idea to replace voltage values in block [9] (starting from entsoe_ref['voltage'].replace(400000, 380000, inplace=True)). Generally, the original data should always be kept to avoid distortion during analysis. In our particular case, I think it would be enough to adjust voltage threshold when filtering in high_voltage_df (), instead of modifying the voltages in the data.

Methods chaining can be used to make implementation of simplification more compact (section Section 2 Applying Simplification)

Presentation comments

The title can be probably shortened and made more clear. In particular, I'd replace countries names with full names and remove mentions of "raw" and "clean" which may be a bit too much of details for the title. I think something like "OSM vs ENTSO validation: comparison for Austria and Macedonia". Obviously, feel free to improve.

Introduction should improve objective of the study you done, as well as short presentation of the used data sources.

Section 2 Applying Simplification section needs some explanations. My feeling is that our Discord discussion can provide you some hints on that :)

Agree with the Conclusion, in general. However, some reasoning should be added to explain how does the summary follows from the analysis results.

Do not be discouraged by the number of comments ;) You have done amazing work, and now we need to present its' results as clear as we can.

Thanks @ekatef for the amazing comments. Please can you elaborate more on option 5 for code comments.

Also, concerning option 4 in code comments, the reason for replacing the values is because we are assuming that 400kV is 380kV. high_voltage_df() is used to filter for only voltages greater 220kV.

GbotemiB commented 9 months ago

Hello @davide-f, Just to confirm. What is the formula for line capacity?

ekatef commented 9 months ago

Hello @davide-f, Just to confirm. What is the formula for line capacity?

Hey @GbotemiB! I'd use this definition from pypsa-earth code ;)

davide-f commented 9 months ago

Hello @GbotemiB :)

Regarding this PR, the description of cells 6-11 of mt at comparison may be slightly improved with some comments or little markdown cells, but they are very good! Congratulations!

Regarding the power calculations, I confirm the formula by Katia that is the official one used in pypsa-earth (and eur). On electrical basis, in a three-phase system, the power is given as s_nom = sqrt(3) v_nom i_nom, where v_nom is the concatenated voltage and i_nom is the nominal phase voltage.

An interesting comparison regarding TWkm would be the following:

For base.nc (or elec.nc is the same), do: sum s_nom * length for every line. The s_nom in the lines is calculated with the formula above. However, in the workflow, we do a basic clustering process where depending on the voltage of a line, we specify the line type of that line. A line type is characterized by a specific i_nom and we use a standard v_nom (220, 300 and 380). For example, a 66kV line will be associated to a 220kV line type and hence its s_nom. This however leads to an implicit ceiling process: all voltages below 220kV (e.g. 66kV) will be recognized as 220kV and so on for the other voltage levels. This may create distortions into the calculation, which is why it may be interesting to investigate the following comparison:
Using the files in resources/base_newtork/all_lines_build_network.csv, calculate the s_nom^bis = sqrt(3) voltage i_nom, where voltage is the value of the column voltage of the csv and the i_nom is calculated according to the linetype: for voltages below 220kV we use the standard type of line 220kV, for those between 220kV and 300kV we use the standard type 330kV and so on. The standard types can be found in the config file under lines->types. To do this, in the notebook, you could create a function like get_i_nom_by_voltage that takes a voltage value and returns the corresponding current. Inside the function, you may have an ordered dictionary where the keys are the voltage levels (e.g. 110, 220, 300) and as values you have the pypsa standard types, e.g. "Al/St 240/40 2-bundle 220.0". then, by looping on the keys in an ordered manner, you may return the i_nom of the first standard type whose voltage key exceeds the v_nom of the line.

In this PR, the first approach may be interesting, while the second one is more for understanding if the 4x factor may be due to that problem. What are your thoughts?

P.S. in the base_network file, it may be interesting to highlight the value of df.voltage.value_counts() to check with voltage values are used and their frequency

ekatef commented 9 months ago

@davide-f thanks a lot for the brilliant review and in-depth comments :)

@GbotemiB I recognise that work on AT-MK notebook is ongoing, but I can still see great progress. Nice structuring of the intro part! My feeling is that it looks much more clear now. So, absolutely agree with Davide that you have done an amazing work.

Regarding calculations of i_nom:

definitions of the line types by voltage are available in pypsa-earth config
i_nom values you can find in PyPSA line types specification.

pz-max commented 8 months ago

Just a note to myself. This is what I will use to review the notebook and create a PR on top if required:

git clone https://github.com/pz-max/documentation.git  # my fork
cd documentation
git remote add GbotemiB https://github.com/GbotemiB/documentation
git fetch documentation
git checkout -b validation GbotemiB/main

pz-max commented 8 months ago

Hi everybody, great analysis @GbotemiB I think the results are close to being conclusive.

My feeling is that we need a notebook that compares the Googlemaps measurement of a power line & the equivalent in our notebook. This notebooks can also play around with the crs to make the lengths ideally match (use the crs of googlemaps & also use a crs that is the physical most correct one)
After the crs is proofed and fixed, we should replicate the AT+MK notebook for one of the extreme cases being either Italy, Germany (preferred because of TUB country), Norway, Poland or France as shown here:

davide-f commented 8 months ago

Just looking at the AT MK notebook again, it may be interesting to add the resources/base_network/..._lines file into the comparison alongside the raw and clean dataset. In the build_osm_network some magic happens and some lines are duplicated, some are dropped and so on. Some duplication may occur here as well.

ekatef commented 8 months ago

@GbotemiB sorry, have missed to answer your question on chaining in Section 2. You can get a feeling on if looking into methods chaining in pandas cheatseet, and get more details when searching by "pandas mathods chaining".

Regarding transformation of 400kV to 380kV, agree with you reasoning. Actually, that is quite closely to the analysis Davide suggests to do (which is a very insightful way to look into the data). However, I think that such transformation should be applied only with a very good reason, while splitting voltage by intervals would be enough for AT-MK analysis.

GbotemiB commented 8 months ago

@GbotemiB sorry, have missed to answer your question on chaining in Section 2. You can get a feeling on if looking into methods chaining in pandas cheatseet, and get more details when searching by "pandas mathods chaining".

Regarding transformation of 400kV to 380kV, agree with you reasoning. Actually, that is quite closely to the analysis Davide suggests to do (which is a very insightful way to look into the data). However, I think that such transformation should be applied only with a very good reason, while splitting voltage by intervals would be enough for AT-MK analysis.

I think creating a function to put the voltages we need into categories should solve the issue of replacing values.

pypsa-meets-earth / documentation