Issue with R manipulated/ modified datasets vs newly imported

hanjanirina commented 1 year ago

Hi, I recently encountered an error that I couldn't understand but found a workaround for it. I constructed a dataset from existing CSV and online downloads through packages on R. It's a country-year panel dataset all_merged_conflict.csv.

I ran DisplayTreatment and got this error:
```
Error in DisplayTreatment(unit.id = "iso3c_n", time.id = "year", legend.position = "none",  : 
please convert time id to consecutive integers
```
I did not understand at all why this would occur as my year variable is set correctly as an integer and consecutive. So at first I gave up and simply used another package for this purpose.

I ran PanelMatch and I got this error:

Error in panel_match(lag, time.id, unit.id, treatment, refinement.method,  : 
please convert unit id column to integer or numeric

I checked and checked again, and even changed the type from double to integer (just like wbcode2 is the dem data) but the unit id is numeric (Tried with as.integer() too but no luck).

However in the process of creating sample data and code snippets. The functions worked with the same exact data but newly imported from a CSV file. I don't know what's causing this but the code I used to recreate this problem is as below.

Hoping you can figure out the issue for your next release. Thank you for a great package.

#NOTE:
# 1. uncomment Package installation chunk if running for 1st time
# 2. change location accordingly

# Package Installation
devtools::install_github("insongkim/PanelMatch", ref = "development")

#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
# Package Loading

## pacman will apply library() to specified package or install first if needed

if (!require("pacman")) install.packages("pacman")
pacman::p_load(
  reshape, 
  dplyr,
  tidyverse,
  countrycode,
  car, 
  psData,
  validate
)

library(PanelMatch)

# Location of/ folder containing files
# 1. Create your personal path to where folder is located on your machine 
# 2. Change currentpath to that personalized path

nirinapath <- "~/Dropbox (Princeton)/02Projects/KR - New Paper/"
#yourpath <- 
currentpath <- nirinapath

#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

# This section will present differents sets of code chunks:
#   
# 1. They will create the same exact dataset (as far as the R user is concerned)
# 2. Set 1 will create the dataset by merging an older dataset with newer variables we would like. Set 1 will output errors.
# 3. Set 2 will export the dataset as it is from set 1 to a csv somewher, then import back. Set 2 will work.

#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

# SET 1: Non-functional This set of chunk will NOT work

## Importing initial dataset: Change to where all_merged_conflict is located

all_merged_conflict <- read_csv(paste0(currentpath, "all_merged_conflict.csv"))

#Downloading Polity scores and creating democracy indicators
## Downloading and selecting polity scores

polity <- psData::PolityGet("http://www.systemicpeace.org/inscr/p5v2018.sav", 
                            vars = NULL,
                            OutCountryID = "iso3c",
                            standardCountryName = FALSE,
                            na.rm = TRUE, 
                            duplicates = "message") %>% 
  dplyr::select(iso3c, p5, year, fragment, democ, autoc, polity, polity2, durable) %>% 
  filter(year > 1998)

polity_dem_indicator <- polity %>% 
  filter( polity2 >= -10 & polity2 <= 10) %>% 
  mutate(polity_democracy = case_when(polity2 >= 6 ~ 1,
                                      polity2 < 6 ~ 0),
         polity_autocracy = case_when(polity2 <= -6 ~ 1,
                                      polity2 > -6 ~ 0),
         polity_anoncracy = case_when(polity2 < 6 & polity2 > -6 ~ 1,
                                      polity2 >= 6 | polity2 <= -6 ~ 0),
         polity_norm = (polity2 + 10)/20)

## Merging with all_merged data & creating country codes numeric equivalents

all_merged_conflict <- all_merged_conflict %>% 
  mutate(all_merged_conflict_flag = 1)

all_merged_polity <- left_join(all_merged_conflict, polity_dem_indicator, by = c("iso3c", "year")) %>%
  group_by(iso3c) %>% 
  mutate(iso3c_r = cur_group_id()) %>% ## iso3c_r is ID for country in form of an integer (has no real connection to other numeric codes) 
  ungroup() %>% 
  mutate(iso3c_n = countrycode(iso3c, "iso3c", "iso3n", warn = TRUE, nomatch = NA),
         year = as.integer(year)) %>% ## iso3c_n is the official iso3 numeric equivalent
  relocate(c("iso3c_n", "iso3c_r"), .after = iso3c) 

## Checking for uniqueness by year and country

rule_polity <- validator(is_unique(iso3c_r, year))
out_polity <- confront(all_merged_polity, rule_polity)
summary(out_polity)[1:7]
violating(all_merged_polity, out_polity)

# Sudan is repeated twice for some unknown reason (almost all column values are equal)
# Fix: randomly drop either Sudan 2011
all_merged_polity$randval <- runif(nrow(all_merged_polity))
all_merged_polity <- arrange(all_merged_polity,iso3c_n, year,randval) #%>% filter(!is.na(polity2))
all_merged_polity <- all_merged_polity %>% 
  group_by(iso3c, year) %>% 
  top_n(n = 1) %>% 
  ungroup() %>% 
  filter(!is.na(iso3c_n) & !is.na(year))

### PanelMatch estimator
#### DisplayTreatment: FAIL
DisplayTreatment(unit.id = "iso3c_n",
                 time.id = "year", legend.position = "none",
                 xlab = "year", ylab = "Country Code",
                 treatment = "polity_democracy", data = all_merged_polity)

#### PanelMatch: FAIL
poli_dem_match <- all_merged_polity %>% 
  PanelMatch::PanelMatch(lag = 3, 
                         unit.id = "iso3c_n",
                         time.id = "year",
                         treatment = "polity_democracy",
                         refinement.method = "ps.match",
                         match.missing = TRUE, 
                         covs.formula = ~GDP_per_capita_current_US_dollar + GDP_per_capita_current_US_dollar2
                         + GDP_per_capita_growth + trade_per_gdp + population_density + military_conflict, 
                         size.match = 5, 
                         qoi = "att", 
                         outcome.var = "tree_cover_loss_value",
                         lead = 0:10, 
                         forbid.treatment.reversal = FALSE)

#### Confirmation: it is not about the variable type
class(all_merged_polity$iso3c_n)
typeof(all_merged_polity$iso3c_n)

#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
# SET 2: 
## Exporting the CSV
write_csv(all_merged_polity, paste0(currentpath, "all_merged_polity_err.csv"))

## Importing the CSV
all_merged_polity_new <- read.csv(paste0(currentpath, "all_merged_polity_err.csv"))

### PanelMatch estimator
#### DisplayTreatment
DisplayTreatment(unit.id = "iso3c_n",
                 time.id = "year", legend.position = "none",
                 xlab = "year", ylab = "Country Code",
                 treatment = "polity_democracy", data = all_merged_polity_new)

#### PanelMatch
poli_dem_match <- all_merged_polity_new %>% 
  PanelMatch::PanelMatch(lag = 3, 
                         unit.id = "iso3c_n",
                         time.id = "year",
                         treatment = "polity_democracy",
                         refinement.method = "ps.match",
                         match.missing = TRUE, 
                         covs.formula = ~GDP_per_capita_current_US_dollar + GDP_per_capita_current_US_dollar2
                         + GDP_per_capita_growth + trade_per_gdp + population_density + military_conflict, 
                         size.match = 5, 
                         qoi = "att", 
                         outcome.var = "tree_cover_loss_value",
                         lead = 0:10, 
                         forbid.treatment.reversal = FALSE)

poli_dem_fe <- PanelEstimate(sets = poli_dem_match, all_merged_polity_new)

summary(poli_dem_fe)

plot(poli_dem_fe)

regulyagoston commented 1 year ago

I experienced the same issue. However there is a simpler solution than saving and re-importing. What I have experienced it only accepts base data.frame object as input for data, thus if you convert your data with as.data.frame(mydata), then it works.

adamrauh commented 1 year ago

I experienced the same issue. However there is a simpler solution than saving and re-importing. What I have experienced it only accepts base data.frame object as input for data, thus if you convert your data with as.data.frame(mydata), then it works.

Thanks for posting this! This should be right. We will add an update to provide better warnings/errors about this.

insongkim / PanelMatch

Issue with R manipulated/ modified datasets vs newly imported #111