ropensci / tradestatistics

R package to access Open Trade Statistics API
https://docs.ropensci.org/tradestatistics
Apache License 2.0
76 stars 14 forks source link

inconsistencies between package versions #41

Closed michalovadek closed 3 years ago

michalovadek commented 3 years ago

I updated today (23 August) the package (from Github) and ran code that I originally wrote earlier in the year on I believe a February release of your package. I was surprised to see there were large discrepancies in the data obtained with the same function between the August and the February versions of the package. Here is an example for just one country pair but the issue is broader (I need data for EU countries):

feb aug

I understand the differences can be partially explained by the new methodology for calculating the inflation adjustment. But that doesn't seem to explain everything - certainly not the missing data. Could you help me understand what's going on?

Here is the code that I ran to get the data:

ms_iso3 <- tradestatistics::ots_countries %>% 
  filter(eu28_member == TRUE) %>% 
  select(country_iso) %>%
  deframe()

trade_data <- tradestatistics::ots_create_tidy_data(reporters = ms_iso3,
                                                    partners = ms_iso3,
                                                    years = 1968:2020,
                                                    max_attempts = 10,
                                                    table = "yrp")

trade_adjusted <- trade_data %>%
  tradestatistics::ots_inflation_adjustment(reference_year = 2015) %>% 
  as_tibble() %>% 
  mutate(year = as.integer(year)) %>% 
  filter(reporter_iso!=partner_iso)
pachadotdev commented 3 years ago

thanks for reporting this !! if I'm not wrong, I've explained on tradestatistics.io that previously the API was using curated data, but there were methodological reasons in gravity theory to disregard it. and then I updated the data with non-altered UN COMTRADE data with some minimal changes (https://tradestatistics.io/data-processing.html#data-cleaning) now I'll try to reproduce this and check if the data was not correctly updated, which may have happened since my laptop has failed a lot (2 mainboard replacements in less than 3 months, and it's 2021 unit :)

pachadotdev commented 3 years ago

@michalovadek once again, thanks a lot for reporting this. I have just sent an email to UN COMTRADE, the issue that you describe coincides exactly with years where trade classifications had changes. I shall try to set a meeting with my master's advisor, expose this, and update the data with something not as aggressive as the previous methodology, but that fills these 'gaps'

I'll do my best to come up with updated data on Friday, it was actually because of the modelling used for the thesis that I created an API 2.0 that I made public, so that more people working on gravity theory have access to less processed data

michalovadek commented 3 years ago

no worries, thanks for the clarification. I will check back if you learn more and use IMF data in the meanwhile

pachadotdev commented 3 years ago

no worries, thanks for the clarification. I will check back if you learn more and use IMF data in the meanwhile

is there a particular reason to prefer IMF data? right now, you can apply the function that I created to convert all the flows to dollars of a certain year, and the function applies to any dataset with imports/exports

michalovadek commented 3 years ago

yes, I can still use the inflation adjustment but I would also like a full time series

pachadotdev commented 3 years ago

yes, I can still use the inflation adjustment but I would also like a full time series

sure, apologies for any inconvenience, I'm sending a very long email explaining each of the steps in my previous methodology and find some common ground to come up with a complete time series that is not 'biased' (there will be bias depending if you take the exporter or the importer as source)

out of curiosity, do you work with gravity models? I ask because maybe you can give me your opinion on a model that I was testing to consolidate reporter/partner mismatches

michalovadek commented 3 years ago

no, I'm actually using the data for estimating the effect of trade on international relations. I am aware that trade data is tricky, but for my application I am not too picky about the exact trade volumes, I just need a consistent time series that roughly corresponds to reality (I guess French-German trade did not actually peak in 1979)

pachadotdev commented 3 years ago

no, I'm actually using the data for estimating the effect of trade on international relations. I am aware that trade data is tricky, but for my application I am not too picky about the exact trade volumes, I just need a consistent time series that roughly corresponds to reality (I guess French-German trade did not actually peak in 1979)

thanks! ok, I shall have it ready by friday, I just emailed my former advisor and there's a way to patch the data without falling on the a-theoretical gravity side

pachadotdev commented 3 years ago

@michalovadek Hi! I would like your opinion here. I did this, which is a part of the previous data consolidation method but less aggresive:

  1. Take sitc1, sitc2 and hs92 for the period 1962-2020 (sitc2 since 1976, hs92 since 1988)
  2. Convert each series to hs92
  3. For each year, paste the three series, group by reporter, partner and commodity and take the maximum export/import value (i.e. if for 1990 a country hasn't implemented hs92, then sitc1/2 will be used)
  4. Save the result

example with updated FRA-DEU aggregated data:

library(arrow)
library(dplyr)
library(ggplot2)

d_yrp <- open_dataset("hs92-visualization/yrp", partitioning = c("year","reporter_iso"))

d_yrp %>% 
  filter(
    reporter_iso == "reporter_iso=fra",
    partner_iso == "deu"
  ) %>% 
  collect() %>% 
  mutate(year = as.numeric(gsub("year=", "", year))) %>% 
  ggplot() +
  geom_line(aes(x = year, y = trade_value_usd_exp))

image

I also contacted UN COMTRADE about some inconsistencies in ISO-3 codes that were not on COMTRADE raw data as of 2021-02-28 (i.e. Romania appears as ROM or ROU on different years, the same for other countries). It's gonna be fixed soon, by now I changed the ISO codes with a short function.

Let me know what you think to update the server

michalovadek commented 3 years ago

thanks, I think this should be good. The upward jump in mid 1990s looks pretty steep but is not necessarily wrong

pachadotdev commented 3 years ago

thanks, I think this should be good. The upward jump in mid 1990s looks pretty steep but is not necessarily wrong

hi @michalovadek, I updated the data in the server after running some checks (i.e. compare Chile data in COMTRADE vs Chile Customs / Central Bank of Chile) and it looks ok

the steeps before 1980, 1990 and 2000 are related to the increase in reporting countries, being 1994-2019 the "best" period for analysis

this version of the data constitutes a reversion to 2020 methodology, which seemed to be the good methodology but with some "bias", for the rest of the year I shall only add some minor updates to recent years (i.e. 2020 is quite incomplete in the source at the moment)