ValueError: invalid literal for int() with base 10: in clean_osm_data

oayana commented 2 years ago

Some cables values (in df_all_lines["cables"]) are not suitable for conversion to int values (eg :3;6).

Workaround: df_all_lines["cables"] = df_all_lines["cables"].astype(str).str.replace(";", ".").astype(float).astype(int) You can use.

pz-max commented 2 years ago

HI @oayana, thanks for opening this issue :) What script did you use? Did it lead to an error?

We splitted at some point one line that had voltage information in this format {110,220) to two lines 110 and 220. So probably we missed splitting the cable as well. I think we need an alternative solution compared to the workaround. Do you want to create a PR and think about a solution?

oayana commented 2 years ago

First of all, you're welcome :)

While I was performing the analysis for Turkey, I got such an error on line 368 of the osm_data_cleaning script. Error code: Traceback (most recent call last): File "C:\Users\muh1\Desktop\pypsa\Pypsa_data\pypsa-africa.snakemake\scripts\tmpb0bbyowa.osm_data_cleaning.py", line 611, in clean_data( File "C:\Users\muh1\Desktop\pypsa\Pypsa_data\pypsa-africa.snakemake\scripts\tmpb0bbyowa.osm_data_cleaning.py", line 493, in clean_data df_all_lines = integrate_lines_df(df_all_lines) File "C:\Users\muh1\Desktop\pypsa\Pypsa_data\pypsa-africa.snakemake\scripts\tmpb0bbyowa.osm_data_cleaning.py", line 371, in integrate_lines_df df_all_lines["cables"] = df_all_lines["cables"].astype("int") File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\generic.py", line 5815, in astype new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\internals\managers.py", line 418, in astype return self.apply("astype", dtype=dtype, copy=copy, errors=errors) File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\internals\managers.py", line 327, in apply applied = getattr(b, f)(**kwargs) File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\internals\blocks.py", line 591, in astype new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\dtypes\cast.py", line 1309, in astype_array_safe new_values = astype_array(values, dtype, copy=copy) File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\dtypes\cast.py", line 1257, in astype_array values = astype_nansafe(values, dtype, copy=copy) File "C:\Users\muh1.conda\envs\pypsa-africa\lib\site-packages\pandas\core\dtypes\cast.py", line 1174, in astype_nansafe return lib.astype_intsafe(arr, dtype) File "pandas_libs\lib.pyx", line 679, in pandas._libs.lib.astype_intsafe ValueError: invalid literal for int() with base 10: '3;2'

I tried a workaround as a solution. But if you want to find a better solution, of course I would like to help.

pz-max commented 2 years ago

Hi @oayana, Thanks for the error message. So it seems in line 371 is the problem.... Your workaround is currently fixing the bug but introduces a small error. You are setting two lines with i.e. 'cable' = 2, but one line need to have 'cable'=3. Let me explain.

A cable can be only an integer (1,2,3,...). The line with the trouble has 2 lines in parallel running. One with 2 cables one with 3 cables. This indeed lead to the error that (2;3) cannot be converted to an integer.

https://github.com/pypsa-meets-africa/pypsa-africa/blob/509d2ba691c622c2245df63631c7f09c91089355/scripts/osm_data_cleaning.py#L362-L366

As you can see below, we splitted voltages because there was also a semicolon. Split cells function is used in the voltage case to break the lines apart. But let's check out the split_cells function in more detail https://github.com/pypsa-meets-africa/pypsa-africa/blob/509d2ba691c622c2245df63631c7f09c91089355/scripts/osm_data_cleaning.py#L238-L244

What the docstring explains is that we separate the voltage in this function and create an identical line with all previous data. So while we have now two lines with separated voltage without semicolon, we still have the problem that the cable information is not fixed - it has now in each line smt like Line1['cable'] -> (2;3), Line2 ['cable'] -> (2:3)

https://github.com/pypsa-meets-africa/pypsa-africa/blob/509d2ba691c622c2245df63631c7f09c91089355/scripts/osm_data_cleaning.py#L218-L235

Solution:

check by row if voltage and cable cells contain a semicolon
if yes split them in both cells
than safe the information into a new line Hope it makes sense

Goal (simplified):

That's how data was stored: 
Line['voltage'] = (100;200)
Line['cable'] = (1;2)

That's how we need it:
Line1['voltage'] = (100)
Line1['cable'] = (1)
Line2['voltage'] = (200)
Line2['cable'] = (2)

Do you want to be on the contributor list with such a fix? :)

oayana commented 2 years ago

I understood the problem and would like to offer you a solution. I will handle it as soon as possible and get back to you :)

oayana commented 2 years ago

semicolon

Hi Max, I had a chance to look at the df_all_lines dataframe now and as you mentioned some cables and voltages values are semicolon. However, both cables and voltages are not semicolon in the same row(only one of them is semicolon).

You have already produced the solution for this. It is solved with the split_cells function, as you threw into the voltages values. All you have to do is call the split function before parsing the cables values into int:

     df_all_lines = split_cells(df_all_lines, lst_col="cables")

Then, df_all_lines["cables"] = df_all_lines["cables"].astype("int")

pz-max commented 2 years ago

Hi @oayana, I would suggest following these steps 1-7 to create a pull request that can be accepted in PyPSA-Africa (it also helps to check your code and to stream it in): https://www.tomasbeuzen.com/post/git-fork-branch-pull/

I have only one doubt which we need to check to be ready for an accepted PR.

Let's assume one line has a semicolon separated values for voltage and cable at the same time:

Line['voltage'] = (100;200)
Line['cable'] = (1;2)

Then, I believe applying split cells two times could lead to the following:

Line1['voltage'] = (100)
Line1['cable'] = (1)
Line2['voltage'] = (100)
Line2['cable'] = (2)
Line3['voltage'] = (200)
Line3['cable'] = (1)
Line4['voltage'] = (200)
Line4['cable'] = (2)

Instead of the desired solution which I mentioned above (https://github.com/pypsa-meets-africa/pypsa-africa/issues/153#issuecomment-974478190). I would suggest writing a small jupyterscript to test if a one line with the semicolon in both cases can be solved by the split cell function correctly.

Almost there :)

oayana commented 2 years ago

Hi @oayana, I would suggest following these steps 1-7 to create a pull request that can be accepted in PyPSA-Africa (it also helps to check your code and to stream it in): https://www.tomasbeuzen.com/post/git-fork-branch-pull/

I have only one doubt which we need to check to be ready for an accepted PR.

Let's assume one line has a semicolon separated values for voltage and cable at the same time:
Line['voltage'] = (100;200)
Line['cable'] = (1;2)
Then, I believe applying split cells two times could lead to the following:
Line1['voltage'] = (100)
Line1['cable'] = (1)
Line2['voltage'] = (100)
Line2['cable'] = (2)
Line3['voltage'] = (200)
Line3['cable'] = (1)
Line4['voltage'] = (200)
Line4['cable'] = (2)
Instead of the desired solution which I mentioned above (#153 (comment)). I would suggest writing a small jupyterscript to test if a one line with the semicolon in both cases can be solved by the split cell function correctly.

Almost there :)

Hi Max, Maybe this code can provide the solution you want (be careful it only works for cables and voltage).

import pandas as pd
dic = {'line_id': {0: 557828023, 1: 176542356, 2: 305004661, 3: 233127728}, 'voltage': {0: '154000;380000', 1: '380000', 2: '154000;380000', 3: '154000'}, 'cables': {0: '3;2', 1: '0', 2: '3;6', 3: '0'}}
data = pd.DataFrame(dic) #example data

def split_cells_multiple(df,list_col=['cables','voltage']): # split function for cables and voltage
    for i in range(df.shape[0]): 
        sub = df.loc[i,list_col] # for each cables and voltage
        if sub.notnull().all() == True:  # check not both empty
            if [ ";" in s for s in sub].count(True) == len(list_col): # check both contain ";"
                d = [ s.split(';')  for s in sub] #split them
                r = df.loc[i,:].copy()
                df.loc[i,list_col[0]] = d[0][0] # first split  [0]
                df.loc[i,list_col[1]] = d[1][0]
                r[list_col[0]] = d[0][1] # second split [1]
                r[list_col[1]] = d[1][1]
                df = df.append(r) 
    return df # return new frame

data = split_cells_multiple(data)

pypsa-meets-earth / pypsa-earth

ValueError: invalid literal for int() with base 10: in clean_osm_data #153