prio-data / views_transformation_library

The data transformation library used by the views data transformer service
Other
1 stars 1 forks source link

Fix country spatial lag function #23

Closed jimdale closed 2 years ago

Peder2911 commented 2 years ago

I think I found a bug after testing the implementation using this script:

"""
Check that country spatial lags work correctly.
"""
import os
import numpy as np
import pandas as pd
from viewser import Queryset, Column
from views_transformation_library import splag_country

if not os.path.exists(".cache.parquet"):
    data = (Queryset("peder_country_month_skeleton", "country_month")
            .with_column(Column("name", "country", "name"))
        ).publish().fetch()
    data.to_parquet(".cache.parquet")
else:
    data = pd.read_parquet(".cache.parquet")

countries = ("Spain", "France", "Germany")

data = data[data["name"].apply(lambda n: n in countries)].copy()

for c in countries:
    data[f"is_{c.lower()}"] = (data["name"] == c).values.astype(int)

data["spain_lagged"] = splag_country.get_splag_country(data[["is_spain"]].astype(float)).values

print(data[data["is_france"] != data["spain_lagged"]])

This returns 253 rows, while the expectation was 0. Any idea what causes this?

hhegre commented 2 years ago

«Hvem vil tro meg. Pyreneene har åpnet seg fra øverst til nederst som om en usynlig øks hadde slått ned fra det høye og skåret seg ned i de dype revnene, spjæret stein og jord like til havs…»

Jose Saramago, En flåte av stein. Nå tidsfestet til november 2029.

H

On 8 Dec 2021, at 11:18, Peder G. Landsverk @.**@.>> wrote:

@Peder2911 requested changes on this pull request.

Figure out why Spain isn't France's neighbour after month 599.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/prio-data/views_transformation_library/pull/23#pullrequestreview-826250620, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACW6H5DFNQKWC7OC6A74VKTUP4WGDANCNFSM5JRR3YBQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Håvard Hegre Dag Hammarskjöld Professor of Peace and Conflict Research, Uppsala University Project director, ViEWS: http://views.pcr.uu.se Box 514, SE 751 20 Uppsala, Sweden Research Professor at Peace Research Institute Oslo Email: @.*** Twitter: @HavardHegre


När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/

E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy

Peder2911 commented 2 years ago

Interesting, did you add any hidden magic incantations to the function Jim? Now that the spirits have awoken, what other insights can we learn from this function?

Peder2911 commented 2 years ago

This seems to be an issue with the database rather than the implementation. Will do another check, but given this, the code seems to be working as expected.

Peder2911 commented 2 years ago

As expected, this works:

"""
Check that country spatial lags work correctly.
"""
import os
import numpy as np
import pandas as pd
from viewser import Queryset, Column
from views_transformation_library import splag_country

if not os.path.exists(".cache.parquet"):
    data = (Queryset("peder_country_month_skeleton", "country_month")
            .with_column(Column("name", "country", "name"))
        ).publish().fetch()
    data.to_parquet(".cache.parquet")
else:
    data = pd.read_parquet(".cache.parquet")

data.sort_index(inplace = True)
data = data.loc[:599,:]

countries = ("Spain", "France", "Germany")

data = data[data["name"].apply(lambda n: n in countries)].copy()

for c in countries:
    data[f"is_{c.lower()}"] = (data["name"] == c).values.astype(int)

data["spain_lagged"] = splag_country.get_splag_country(data[["is_spain"]].astype(float)).values

assert (data[data["is_france"] != data["spain_lagged"]]).shape[0] == 0