Closed djhurio closed 5 years ago
This seems like a very useful idea. Will add in the next update. I've been also thinking (perhaps as another package) of making a few helper functions to actually calculate population percentages with the predefined weights by country, among other things.
Thanks! Probably we could collaborate and include also tools for sample error estimation. Like production of estimates of proportions with corresponding confidence intervals estimated from the ESS data.
Great, let's keep in touch!
Hi, I have made an R code to import all SDDF files into R. Maybe this helps. I have used some parts of your code. https://github.com/djhurio/ESS/blob/master/Rcode/16-data-SDDF.R
Hi @djhurio, I'm just about to leave for the holidays but I will check it out once I get back mid January. Maybe we can adjust it to work with essurvey and include it as an additional function.
Let's keep in touch in January!
This would be great! Have a nice holidays!
Hi @cimentadaj
I came here because I was wondering, like @djhurio, if the essurvey
package could also download SDDF files. Anthony Damico's lodown
package covers them.
Please let me know if you need help coding that into essurvey
. It looks like the functions that could be amended to support them are those to download country files: perhaps a 'simple' sddf = TRUE
option to import_country
would work. Ideally, the code would also show how to merge the country data and its SDDF file.
Hi @briatte
This is something that I'd really like to work on, thanks for bringing it up. With the work of @djhurio , I think we have a great start at this but I've found that it's not completely straight forward because not all country-year combinations have actual SDDF data. For example, Slovakia in round 4.
Perhaps we can work out the implementation here and I'll add it to a new branch as we go.
So, there's two issues right now.
Whether it is included as an invisible step in import_country
and the functions downloads and merges the data or whether there are other functions to download the SDDF specifically and merge it.
What happens when the country-year don't have SDDF data?
I'm slightly inclined for making it an invisible step such that import_country
already takes care of downloading the SDDF and merging it because it clears the user from the loop. Since downloading the SDDF data does not change the data at all (will just add columns to the right), I think that doing it invisibly is no problem. If we would do any sort of 'transformation' to the data then I see the point in breaking it into different steps.
But then what happens when that country doesn't have SDDF data? Add the new columns, empty row? No columns? I kinda like the idea of creating the columns with empty rows because it signals that it attempted to download the SDDF but it's empty. This, with proper documentation on the sddf
argument in the documentation could make it clear why this happens.
What do you guys think?
Hi @cimentadaj
I like the idea of silently adding the SDDF file if it exists, but since the download_format
already speaks (once) to the user, perhaps we could also communicate whether the SDDF was found and merged during the download process?
As for adding empty columns, does it really make sense? I'm asking because ESS variable names are not perfectly stable (see. e.g. party voted in last national election), and I think empty columns would make sense only if they were.
Dear @djhurio ~ let's hear from you too :)
Hello,
We will not be able to do invisible merge of ESS data and SDDF data right now. This is because there are several data conflicts in there. For example there are some strange records in SDDFs which should be deleted before merging. For example records with missing ID numbers or missing country codes. There were some missing cases in SDDFs as well. For example, there are records in ESS data which does not have corresponding records in the SDDFs.
What we can do:
Ok, now I really think we should avoid the silent merging. Merging with incomplete data will generate errors and perhaps duplication. I agree that in the mean time we should create a function to grab the SDDF file reusing most of the essurvey
infrastructure and when the time comes, perhaps we can integrate within import_country
. To me, the safest thing to do would be to create an import_sddf
and create a validation function to merge.
The first step is to get a mockup of this import_sddf
. I'll try to work on this in the next few months. As soon as I have the tiniest example, I'll post it here so we can work on it by all of us.
I also vote for import_sddf
.
Ok, so I managed to sit down for a few hours today and got a working version. It's very limited and below are some examples of what it can and cannot do.
# Install the branch version with this:
remotes::install_github("ropensci/essurvey", ref = "sddf", quiet = TRUE, upgrade = TRUE)
library(essurvey)
set_email("cimentadaj@gmail.com")
# All SDDF rounds available for a given country
show_sddf_rounds("Spain")
#> [1] 1 2 3 4 5 6
# When no SDDF rounds available:
show_sddf_rounds("Austria")
#> numeric(0)
There is a problem with compatability between rounds given that since round 7 the ESS has been uploading the SDDF files integrated into one single Stata/SPSS file instead of separate files for each country. For example, compare rounds 1:6
with rounds 7:8
here
This brings about some issues because we cannot check interactively which countries have SDDF data without downloading the data and filtering. From my perspective, this brings all sorts of problems because we don't know the specific nomenclature they use to name countries within the file or whether these change over time. Moreover, this breaks the compelete pipeline of the package which simply reads the HTML from the website to figure out who participated.
To implement this we would need to create a separate function to download the SDDF data, load it into memory and filter out which countries are present. I don't like this at all but I'm not sure whether the ESS is looking to adhere to the previous way of uploading SDDF data. As of now, to test this, show_sddf_rounds
only checks for the first 6 rounds.
Below I test the actual import_sddf_country
for downloading two rounds from Spain.
tst <- import_sddf_country("Spain", 5:6)
#> Downloading ESS5
#>
|
| | 0%
|
|=== | 5%
|
|======= | 10%
|
|========== | 16%
|
|============== | 21%
|
|================= | 26%
|
|===================== | 32%
|
|======================== | 37%
|
|=========================== | 42%
|
|=============================== | 47%
|
|================================== | 53%
|
|====================================== | 58%
|
|========================================= | 63%
|
|============================================= | 69%
|
|================================================ | 74%
|
|=================================================== | 79%
|
|======================================================= | 84%
|
|========================================================== | 90%
|
|============================================================== | 95%
|
|=================================================================| 100%
#> Downloading ESS6
#>
|
| | 0%
|
|======== | 12%
|
|=============== | 23%
|
|======================= | 35%
|
|=============================== | 47%
|
|====================================== | 59%
|
|============================================== | 71%
|
|====================================================== | 83%
|
|============================================================= | 94%
|
|=================================================================| 100%
tst
#> [[1]]
#> # A tibble: 1,885 x 6
#> CNTRY IDNO PSU SAMPPOIN STRATIFY PROB
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 ES 1 199 NA 131 NA
#> 2 ES 2 101 NA 164 NA
#> 3 ES 3 407 NA 73 NA
#> 4 ES 6 394 NA 103 NA
#> 5 ES 8 129 NA 124 NA
#> 6 ES 9 12 NA 41 NA
#> 7 ES 10 149 NA 133 NA
#> 8 ES 11 16 NA 21 NA
#> 9 ES 12 308 NA 154 NA
#> 10 ES 14 321 NA 74 NA
#> # … with 1,875 more rows
#>
#> [[2]]
#> # A tibble: 1,889 x 6
#> cntry idno psu samppoin stratify prob
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 ES 1 239 NA 131 0.0000717
#> 2 ES 2 199 NA 171 0.0000534
#> 3 ES 3 19 NA 101 0.0000738
#> 4 ES 5 345 NA 51 0.0000735
#> 5 ES 6 273 NA 11 0.0000733
#> 6 ES 8 411 NA 74 0.0000641
#> 7 ES 9 81 NA 91 0.0000717
#> 8 ES 10 210 NA 133 0.0000622
#> 9 ES 11 74 NA 91 0.0000716
#> 10 ES 13 163 NA 13 0.0000611
#> # … with 1,879 more rows
The data can now be downloaded and read successfully. However, I see some problems in terms of comparability. First, is it expected the the prob
column in round 5 is completely NA
? Second, both rounds have different cases in their column names (this is minor). Lastly, can we assume that the same columns names are standard across all waves/countries?
This example was one that worked just fine but when I tried earlier rounds, such as round 1:2
for Spain, I found a weird error with the old SPSS files (only available in SPSS for the earlier files) using the read_por
function from haven
(essurvey
uses read_por
behind the scenes).
tst <- import_sddf_country("Spain", 1:2)
#> Downloading ESS1
#>
|
| | 0%
|
|==== | 7%
|
|========= | 14%
|
|============= | 21%
|
|================== | 28%
|
|====================== | 34%
|
|=========================== | 41%
|
|=============================== | 48%
|
|==================================== | 55%
|
|======================================== | 62%
|
|============================================= | 69%
|
|================================================= | 76%
|
|====================================================== | 83%
|
|========================================================== | 90%
|
|=============================================================== | 97%
|
|=================================================================| 100%
#> Downloading ESS2
#>
|
| | 0%
|
|===== | 7%
|
|========= | 14%
|
|============== | 21%
|
|================== | 28%
|
|======================= | 35%
|
|=========================== | 42%
|
|================================ | 49%
|
|===================================== | 56%
|
|========================================= | 63%
|
|============================================== | 70%
|
|================================================== | 77%
|
|======================================================= | 85%
|
|============================================================ | 92%
|
|================================================================ | 99%
|
|=================================================================| 100%
#> Error in df_parse_por_file(spec, encoding = "", user_na = user_na): Failed to parse /tmp/RtmpwGGV1p/ESS_Spain/ESS1/ESS1_ES_SDDF.por: Unable to read from file.
I've checked and the files are downloaded successfully, the problem comes when reading the actual .por
files with read_por
. Any idea why this happens/has it happened to any of you? I don't have SPSS to test it but this is equivalent to downloading the data manually and opening it in SPSS. If I run the same code as above but for more recent datasets, then it works (try rounds 5:6
).
Finally, I've limited the function to only download rounds which are available. That is, the function raises an error (pointing to the specific round which is not available) if any of the rounds you're asking for does not have SDDF files.
tst <- import_sddf_country("Denmark", 1:3)
#> Error in country_url_sddf(country, rounds): Rounds ESS2 don't have SDDF data available for Denmark
I'm very interested in making this work so that the next step concentrates on building a package for analyzing the ESS. However, it is a bit difficult if we can't get the SDDF files working properly across all rounds/countries. If you feel like it, take it for a spin, use it a bit and let me know what you find wrong/think of it.
Thanks a lot for import_sddf_country
, @cimentadaj !
At that stage, we should ask for some comments from @daob, who might know about some of the SDDF inconsistencies above. @ajdamico might also know more.
Thanks @briatte . Also, there's the problem that neither of the downloaded rounds contains a round
or wave
column. For this particular example, it is not important but whenever we build a 'data validation' function it will surely need to have the round
specified to do the merging; we should add it possibly with the same name that is available in the comprehensive data with the same coding (e.g. ESS1
or just 1
).
That variable can be inferred from the filenames, right?
Yep, but we should be sure that the column name and coding doesn't change across rounds.
Hi, I have done some very small testing using the data for Croatia. As I understand, currently the SPSS files are downloaded. So it works for me.
However, I have found a problem with the SDDF in SAS format for Croatia. I am not able to read the first 39 rows correctly using the haven
. Is it the same for you? There is only one SDDF for Croatia - for the round 5.
I believe in future we should allow to chose the format - SAS or SPSS also for the SDDFs. Actually I am considering to push the ESS to publish the SDDFs also in CSV format.
@cimentadaj Working this out for the user would help her a lot, right? Would bring lots of value to the package in my view. I'm still working out those issues (there are many) with the French data.
@djhurio That has happened to me too. In my experience, foreign
won't fail in those cases. And in my view, the SPSS files are the best choices right now (CSV would also work since, AFAIK, none of the SDDF variables are labelled).
I am importing SDDF data in the following way now:
foreign::read.spss
;foreign::read.spss
;haven::read_sas
.See the full code if needed: https://github.com/djhurio/ESS/blob/master/Rcode/16-data-SDDF.R
The SDDF has been published as an integrated data file for the last two rounds. We could assume this is the way how the SDDFs will be published in a future. I do not have any other information available. So probably we should create SDDF import function taking this assumption into account.
My idea is that we need to drop the country argument from the download_sddf
and import_sddf
functions. I propose to make a function import_sddf
with arguments rounds
and ess_email
. It will download SDDFs for a specified rounds for all countries available. What do you think?
I have been in touch by email with someone at ESS recently, and can confirm that SDDFs should now get published as all-country, round-level files.
As far as I understand there are four situations to take into account:
My guess is that @cimentadaj's solution (always download SDDF information for the countries the user is interested in, and handle the merge) is the best one, but I did not find the time to help on that yet.
I did, however, get one of the SDDFs for France corrected (it could not be merged before that; there might be other such problematic cases in old editions).
Hmm, I see. @briatte summarized it well. So one experimental way to do this would be to only read in the complete SDDF for each round (the user could then just filter out any country they want specifically or keep the complete round, what do you think?). For the older rounds this is already done and for the recent rounds this could mean listing and downloading all separate countries and then rbind
them. Can we be 100% sure that each country SDDF has the same columns and in the same order @djhurio? If that is the case, then it's doable.
I'm also worried about mixing foreign
and haven
dependencies. I'm going to try to make some reproducible example on where the haven
problems are coming from and perhaps they can fix it.
Here are the variables in each SDDF file that I downloaded in February 2019:
2002/ESS1_CZ_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2002/ESS1_DE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2002/ESS1_DK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2002/ESS1_ES_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2002/ESS1_FI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2002/ESS1_FR_SDDF.rds :idno,cntry,psu,samppoin,stratify,prob
2002/ESS1_GB_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2002/ESS1_HU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2002/ESS1_IE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2002/ESS1_IL_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2002/ESS1_LU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2002/ESS1_SI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2004/ESS2_CZ_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2004/ESS2_DE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2004/ESS2_ES_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2004/ESS2_FI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2004/ESS2_FR_SDDF.rds :idno,cntry,psu,samppoin,stratify,prob
2004/ESS2_GB_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2004/ESS2_HU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2004/ESS2_IE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2004/ESS2_LU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2004/ESS2_SI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2004/ESS2_UA_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2006/ESS3_DE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2006/ESS3_DK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2006/ESS3_ES_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2006/ESS3_FI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2006/ESS3_FR_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2006/ESS3_GB_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2006/ESS3_HU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2006/ESS3_IE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2006/ESS3_RU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2006/ESS3_SI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2006/ESS3_UA_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2008/ESS4_CZ_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2008/ESS4_DE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2008/ESS4_DK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2008/ESS4_ES_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2008/ESS4_FI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2008/ESS4_FR_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2008/ESS4_GB_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2008/ESS4_HU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2008/ESS4_RU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2008/ESS4_SI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2008/ESS4_UA_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_BE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_BG_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_CH_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_CY_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_CZ_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_DE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_DK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_EE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_ES_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_FI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_FR_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_GB_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_GR_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_HR_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_HU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_IE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_IL_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_LT_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_NL_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_NO_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_PL_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_PT_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_RU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_SE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_SI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_SK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2010/ESS5_UA_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_AL_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
# this one is weird but has the common variables
2012/ESS6_BE_SDDF.rds :name,edition,proddate,essround,cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_BG_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_CH_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_CY_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_CZ_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_DE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_DK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_EE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_ES_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_FI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_FR_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_GB_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_HU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_IE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_IS_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_IT_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_LT_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_NL_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_NO_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_PL_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
# this one is weird but has the common variables
2012/ESS6_PT_SDDF.rds :name,edition,proddate,essround,cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_RU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_SE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_SI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_SK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_UA_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
2012/ESS6_XK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob
# files below are round-level
2014/ESS7SDDFe01_1.rds :name,essround,edition,proddate,cntry,idno,psu,domain,stratify,prob
2016/ESS8SDDFe01_1.rds :name,essround,edition,proddate,cntry,idno,psu,domain,stratum,prob
I could not test reading the files with haven
vs. foreign
because I collected the data with @ajdamico's R code, which converts the files on the fly, but I strongly suspect that some files will be problematic with haven
("invalid multibyte whatever"). It's happened to me several times while reading old SPSS files with haven
.
It would be great to have those functions available as well:
import_sddf_rounds(rounds, ess_email = NULL)
import_all_sddf_rounds(ess_email = NULL)
download_sddf_rounds(rounds, ess_email = NULL, output_dir = getwd(), format = "stata")
Ok, so good news. I just wrote some code to read all SDDF file available to figure out which errors are occuring when reading the SPSS files. The error is present only in rounds 1:4
and there error is the same for all countries/rounds:
#> Error in df_parse_por_file(spec, encoding = "", user_na = user_na): Failed to parse /Users/cimentadaj/Downloads/ESS1_SI_SDDF.spss/ESS1_SI_SDDF.por: Unable to read from file.
Luckily for us, this is just a mistake in the file extension! All SPSS files are actually .sav
files rather than .por
files.
Here's an example using the SDDF file for Slovenia in round 1:
haven::read_por("/Users/cimentadaj/Downloads/ESS1_SI_SDDF.spss/ESS1_SI_SDDF.por")
#> Error in df_parse_por_file(spec, encoding = "", user_na = user_na): Failed to parse /Users/cimentadaj/Downloads/ESS1_SI_SDDF.spss/ESS1_SI_SDDF.por: Unable to read from file.
haven::read_sav("/Users/cimentadaj/Downloads/ESS1_SI_SDDF.spss/ESS1_SI_SDDF.por")
#> # A tibble: 1,519 x 6
#> cntry idno psu samppoin stratify prob
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 SI 1 149 NA 04-1 0.00000755
#> 2 SI 2 149 NA 04-1 0.00000755
#> 3 SI 3 149 NA 04-1 0.00000755
#> 4 SI 4 149 NA 04-1 0.00000755
#> 5 SI 5 149 NA 04-1 0.00000755
#> 6 SI 6 149 NA 04-1 0.00000755
#> 7 SI 7 93 NA 07-4 0.00000755
#> 8 SI 8 94 NA 07-4 0.00000755
#> 9 SI 9 94 NA 07-4 0.00000755
#> 10 SI 10 94 NA 07-4 0.00000755
#> # … with 1,509 more rows
I have yet to implement this in the sddf
branch but I think that with this fix the SDDF download functions can mimic the import_
and download_
functions for countries and rounds. That is, there could be both import_sddf_country
and import_sddf_round
together with their corresponding download_
functions just as @djhurio proposes. Of course, I'm not sure whether overriding the file extension of .por
to .sav
is the wisest idea because at some point a .por
file could actually be a .por
file rather than a .sav
file.
I'm gonna try to communicate this to the ESS directly. Feel free to also push for this through your own channels. Btw, anyone has access to an SPSS copy where they can test whether the .por
file is actually read as .sav
? That way we can always argue that the extension is wrong as well in SPSS.
Everyone, I want to get @BernStZi involved in the thread. He's been an expert in the sampling/weighting of the ESS in the past. He suggested to be included as he can answer some of the questions on the nature of the SDDF data. I'm going to try to implement the changes I discussed above in the sddf
branch soon so we can test it. @BernStZi, I'll prepare some questions regarding the nature of the data that seem a bit weird for me.
Hi everyone, I worked for some time for the ESS as an expert on sampling & weighting (mostly). So if you have any questions regarding the nature of the SDDFs or on the sampling designs of ESS countries from rounds 1 - 7, I might be able to help you.
@cimentadaj, I believe there is no need to have IBM SPSS Statistics to distinguish .sav
files from .por
files. The file
command in Linux is handy for this. For example we can see that all .sav
files are in SPSS data format (binary).
djhurio@Skyforger ~/Dropbox/Darbs/ESS9-SWEP-2017-2019/ESS-git/data/SDDF $ file *.sav
ESS5_BE_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_BG_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_CH_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_CY_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_CZ_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_DE_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_DK_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_EE_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_ES_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_FI_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_FR_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_GB_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_GR_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_HR_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_HU_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_IE_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_IL_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_LT_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_NL_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_NO_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_PL_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_PT_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_RU_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_SE_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_SI_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_SK_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS5_UA_SDDF.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
ESS7SDDFe1_1.sav: SPSS System File TICS 64-bit MS Windows 22.0.0.0 \002
ESS8SDDFe01_1.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0 \002
It looks the .por
files are plain text files.
djhurio@Skyforger ~/Dropbox/Darbs/ESS9-SWEP-2017-2019/ESS-git/data/SDDF $ file *.por
ESS1_CZ_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS1_DE_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS1_DK_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS1_ES_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS1-FI-SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS1_FR_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS1_GB_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS1_HU_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS1_IE_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS1_IL_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS1_LU_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS1_SI_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS2_CZ_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS2_DE_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS2_ES_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS2_FI_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS2_FR_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS2_GB_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS2_HU_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS2_IE_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS2_LU_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS2_SI_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS2_UA_SDDF.por: SPSS System File MS Windows Release 14.0.1 \002
ESS3_DE_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS3_DK_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS3_ES_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS3_FI_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS3_FR_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS3_GB_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS3_HU_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS3_IE_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS3_RU_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS3_SI_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS3_UA_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS4_CZ_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS4_DE_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS4_DK_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS4_ES_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS4_FI_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS4_FR_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS4_GB_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS4_HU_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS4_RU_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS4_SI_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS4_UA_SDDF.por: SPSS System File MS Windows Release 15.0.1 \002
ESS6_AL_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_BE_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_BG_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_CH_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_CY_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_CZ_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_DE_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_DK_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_EE_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_ES_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_FI_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_FR_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_GB_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_HU_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_IE_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_IS_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_IT_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_LT_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_NL_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_NO_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_PL_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_PT_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_RU_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_SE_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_SI_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_SK_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_UA_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_XK_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS7SDDFe1_1.por: ISO-8859 text, with CRLF line terminators
Now I see why foreign
works but haven
fails on wrongly named SPSS data files.
The haven
decides to use read_sav
or read_por
based on the file extension. See the code from haven.R:
ext <- tolower(tools::file_ext(file))
switch(ext,
sav = read_sav(file, user_na = user_na),
zsav = read_sav(file, user_na = user_na),
por = read_por(file, user_na = user_na),
stop("Unknown extension '.", ext, "'", call. = FALSE)
)
The foreign
does it in much smarter way. See the code from spss.c:
if (0 == strncmp("$FL2", buf, 4)) {
fclose(fp);
ans = read_SPSS_SAVE(filename);
} else {
if (!is_PORT(fp)) {
fclose(fp);
error(_("file '%s' is not in any supported SPSS format"),
filename);
}
fclose(fp);
ans = read_SPSS_PORT(filename);
}
Nice hunting, @djhurio
@BernStZi, FWIW, thanks a lot for your work on the ESS, one of the rare European surveys out there with a proper weighting structure! The SDDF files are still a bit of a mess, as the discussion above highlights, but still, this survey is exceptionally high-quality.
I've just pushed the SDDF changes to the master branch of essurvey
on Github. You can use the functions show_sddf_rounds
and import_sddf_cntrounds
with devtools::install_github("ropensci/essurvey")
. Currently, only rounds 1:6
are supported because the new two rounds are integrated; I'm gonna have to write some separate codes for these rounds.
Give it a test if you can because I'd really like to have this robust for the next essurvey
release.
Excellent, thanks!
I've tested the function using the loop below:
library(essurvey)
library(magrittr)
library(tibble)
library(tidyr)
d <- essurvey::show_countries()
d <- tidyr::crossing(country = d, round = 1:6) %>%
tibble::add_column(file = list(NULL))
for (j in unique(d$country)[11:38]) {
for (i in essurvey::show_sddf_rounds(j)) {
cat(j, "Round", i, "\n")
x <- try(
essurvey::import_sddf_country(j, i, "f.briatte@ed.ac.uk"),
silent = TRUE
)
if ("try-error" %in% class(x)) {
d$file[[ which(d$country == j & d$round == i) ]] <- "error"
} else {
d$file[[ which(d$country == j & d$round == i) ]] <- x
}
}
}
The tests show that only a few SDDF files are unreadable under the current implementation: those are the SDDFs for France, Rounds 1, 2, 3.
The error that shows up in all cases is:
Invalid time string (length=8): 0-------
Perhaps SDDF files can be more safely read from their SPSS versions? In which case, all that would be needed is switching download_sddf_country
to use format = 'spss'
rather than 'stata'
.
The results of my test, including all SDDF data that I managed to download with the current version fo the package, are there:
Thanks a lot @briatte! I've spotted the problem and it's related to the .por
and .sav
problems I outlined above.
For France 1:3
it seems that .por
files are ACTUALLY .por
files and not .sav
files as described above. See the examples below:
library(essurvey)
library(haven)
set_email("cimentadaj@gmail.com")
dwn_dir <- download_sddf_country("France", 1, output_dir = tempdir(),
format = 'spss')
#> All files saved to /tmp/RtmpRkSnJs
all_dirs <- list.files(dwn_dir, full.names = TRUE)
# The file is a `.por` file.
all_dirs
#> [1] "/tmp/RtmpRkSnJs/ESS_France/ESS1/ESS1_FR_SDDF.por"
#> [2] "/tmp/RtmpRkSnJs/ESS_France/ESS1/ESS1.zip"
#> [3] "/tmp/RtmpRkSnJs/ESS_France/ESS1/sddf-documentation.pdf"
# Read using `read_sav`
read_sav(all_dirs[1])
#> Invalid time string (length=8): 0-------
#> Error in df_parse_sav_file(spec, encoding, user_na): Failed to parse /tmp/RtmpRkSnJs/ESS_France/ESS1/ESS1_FR_SDDF.por: The file's timestamp string is invalid.
# Read using `read_por`
read_por(all_dirs[1])
#> # A tibble: 1,503 x 6
#> IDNO CNTRY PSU SAMPPOIN STRATIFY PROB
#> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
#> 1 10001 FR 130 NA "" Inf
#> 2 10002 FR 401 NA "" Inf
#> 3 10003 FR 108 NA "" Inf
#> 4 10004 FR 711 NA "" Inf
#> 5 10005 FR 102 NA "" Inf
#> 6 10006 FR 404 NA "" Inf
#> 7 10007 FR 623 NA "" Inf
#> 8 10008 FR 908 NA "" Inf
#> 9 10009 FR 909 NA "" Inf
#> 10 10010 FR 608 NA "" Inf
#> # … with 1,493 more rows
On the other hand, If I run EXACTLY the same code from above for Spain, I get that the file is a .por
file but only is read successfully with read_sav
.
library(essurvey)
library(haven)
set_email("cimentadaj@gmail.com")
dwn_dir <- download_sddf_country("Spain", 1, output_dir = tempdir(),
format = 'spss')
all_dirs <- list.files(dwn_dir, full.names = TRUE)
# The file is a `.por` file.
all_dirs
#> [1] "/tmp/RtmpCpqs02/ESS_Spain/ESS1/ESS1_ES_SDDF.por"
#> [2] "/tmp/RtmpCpqs02/ESS_Spain/ESS1/ESS1.zip"
#> [3] "/tmp/RtmpCpqs02/ESS_Spain/ESS1/sddf-documentation.pdf"
# Read using `read_sav`
read_sav(all_dirs[1])
#> # A tibble: 1,729 x 6
#> cntry idno psu samppoin stratify prob
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 ES 2 105903019 NA 211 0.00000789
#> 2 ES 5 105903019 NA 211 0.0000118
#> 3 ES 8 105903019 NA 211 0.00000789
#> 4 ES 13 200301002 NA 421 0.0000116
#> 5 ES 15 200301002 NA 421 0.0000116
#> 6 ES 22 200301002 NA 421 0.00000578
#> 7 ES 23 200301002 NA 421 0.0000231
#> 8 ES 25 200305002 NA 421 0.00000558
#> 9 ES 27 200305002 NA 421 0.00000743
#> 10 ES 31 200305002 NA 421 0.0000112
#> # … with 1,719 more rows
# Read using `read_por`
read_por(all_dirs[1])
#> Error in df_parse_por_file(spec, encoding = "", user_na = user_na): Failed to parse /tmp/RtmpCpqs02/ESS_Spain/ESS1/ESS1_ES_SDDF.por: Unable to read from file.
I'm not very happy with the strategy of reading .por
files with read_sav
because of this problem precisely. @djhurio or @BernStZi is there any way we can contact the ESS so they can save files accordingly? I just want to make sure the .por
files are actual .por
files and not .sav
files. In the same line, can someone try reading these files with SPSS? Does it work? Does SPSS know how to read .por
files which are .sav
files? @djhurio suggested using foreign
which can read them successfully but that would be a problem in backwards dependency for the package because everything uses haven
.
Any ideas?
Yes, I will inform the ESS Data Archive about this issue.
I thought to create an alternative for the haven::read_spss
. Which would detect POR and SAV files according to the data. Something like this:
read_spss_2 <- function(file, encoding = NULL, user_na = FALSE) {
x <- fread(input = file, sep = "", nrows = 1, header = F)$V1
if (grepl("SPSS PORT FILE", x, useBytes = T)) {
read_por(file = file, user_na = user_na)
} else if (grepl("SPSS DATA FILE", x, useBytes = T)) {
read_sav(file = file, encoding = encoding, user_na = user_na)
} else {
stop("Unknow file format")
}
}
However, I realised that the haven::read_por
does not read POR files correctly. See for example:
> as.data.table(haven::read_por("data/SDDF/ESS1_FR_SDDF.por"))
IDNO CNTRY PSU SAMPPOIN STRATIFY PROB
1: 10001 FR 130 NA Inf
2: 10002 FR 401 NA Inf
3: 10003 FR 108 NA Inf
4: 10004 FR 711 NA Inf
5: 10005 FR 102 NA Inf
---
1499: 11499 FR 709 NA Inf
1500: 11500 FR 126 NA Inf
1501: 11501 FR 612 NA Inf
1502: 11502 FR 813 NA Inf
1503: 11503 FR 408 NA Inf
> as.data.table(foreign::read.spss("data/SDDF/ESS1_FR_SDDF.por"))
IDNO CNTRY PSU SAMPPOIN STRATIFY PROB
1: 10001 FR 130 NA 1.651397e-05
2: 10002 FR 401 NA 8.236820e-07
3: 10003 FR 108 NA 1.651397e-05
4: 10004 FR 711 NA 7.193571e-06
5: 10005 FR 102 NA 5.061940e-07
---
1499: 11499 FR 709 NA 4.306910e-07
1500: 11500 FR 126 NA 1.651397e-05
1501: 11501 FR 612 NA 4.427837e-06
1502: 11502 FR 813 NA 1.933014e-06
1503: 11503 FR 408 NA 3.924291e-06
I tried to read the data with the PSPP. I can open any data. So, I assume it detects the file type according to the file content.
From both your posts, I think the solution involves the following:
foreign
when haven
fails..por
files as .sav
files..por
files being misread by haven
.Step 1 should be easy enough:
read_sav <- function(x, quiet = TRUE) {
x <- try(haven::read_sav(x), silent = TRUE)
if ("try-error" %in% class (x)) {
if (!quiet) warning("File ", basename(x), " read with `foreign::read.spss` instead of `haven::read_sav`")
foreign::read.spss(x)
} else {
x
}
}
@djhurio seems to have handled Step 2.
I can handle Step 3 if needed.
Example issues with reading POR files using haven
, including one by @cimentadaj that looks very much like the problem we're having here:
haven
does not use the very last version of ReadStat
.por
issue (old, closed)I have contacted the ESS and have received the following answer:
The files were created using the “save outfile”-command instead of the “export outfile”-command in SPSS. When using “save outfile” the file type is actually SAV (SPSS data) as you mentioned below. We will replace the listed “.por”-files with “. sav”-files (unicode) in near future. Please note that we no longer create and support
.por
files in future data releases.
Cool! One problem off the list.
Also -- I've checked the prob
columns in the SDDF data that I posted earlier, and two files are weird:
In the RDS file that I posted earlier, those are rows 210 and 226 respectively (check the tibble stored in the file
column).
This is great @djhurio, thanks!
As per @briatte and @djhurio solution to fall back on foreign
I'm a bit concerned on the output. Thing is, since haven
outputs tibble
's, I wouldn't want to mess up the consistent output. We could simply wrap the foreign call with tibble
and test it (also taking care of the new column classes that haven assigns extracting the labeld columns from Stata in a custom labelled class in R). If you guys think it's a good idea then I'll work on your previous examples to build this fall back mechanism.
@briatte I was also thinking of opening an issue with haven
on the example of @djhurio because it seems it is not reading them correctly as `foreign. Would you have some time or should I go for it?
@briatte I did some digging
Also -- I've checked the
prob
columns in the SDDF data that I posted earlier, and two files are weird:* Switzerland Round 6 -- all weights equal to 1
Switzerland had, or at least that is how they explained it, a single stage equal probability design. That is probability the reason why they simple decided to set the
prob
column to one, which equals the (scaled) design weight for such a design.* UK Round 4 -- all weights missing
I don't know why they didn't include the
prob
column here. The UK didn't had an equal probability design, but the design weights must be included in the main data set.
Also, I noticed that the design weights in the main data set can differ, for some countries, from those that you get if you compute them from the SDDF files. In particular the SDDFs of early rounds (1 - 4) can have some quality issues.
@cimentadaj
Concerning your first point, yes, I do believe a foreign
fallback would be useful. Although technically, the fallback should be offered by… haven
itself, since the issue is related to haven
(or ReadStat), not to essurvey
.
I'll open an issue on the haven
repo, submitting the file as an example.
@BernStZi
Thanks a lot for digging. I understand that Switzerland Round 6 is a non-issue, while UK Round 4 is a weird issue, since indeed, the main data file ESS4GB
has a valid dweight
column.
I am more concerned about your second point, since you are basically saying that the best strategy is to use dweight
(from the main data files) rather than the SDDF files.
Does that mean that, instead, of using, e.g.
ess4gb_design <- svydesign(
ids = ~ psu,
strata = ~ stratify,
probs = ~ prob,
data = ess4gb,
nest = TRUE
)
… the user should rather use:
ess4gb_design <- svydesign(
ids = ~ idno,
weights = ~ dweight,
data = ess4gb
)
… since that column carries the same information, and possibly better information?
Again, many thanks for your help. The ESS is such a high-quality dataset that I'm eager to learn exactly what is the best method to weigh it.
I have seen inconsistency between sampling probabilities (from the SDDFs) and design weights (from survey data). I believe in most of those cases trimming of design weights has been applied.
dweight
is taken from the survey data and dweight2
is computed from the prob
.
@briatte
Yes, my general recommendation is to use dweight
from the main data set as design weights. My impression was that, at least for the early Round dweight
, was construed with more care and might not originate solely from the prob
variable in the published SDDFs.
Also prob
contains the inclusion probabilities of the gross sample, i.e. formally they should be corrected for the loss in sample size from gross to net sample. Using dweight
implicitly assumes a uniform response process, because of the scaling to the net sample size. i.e. you correct with a constant factor,
For a two-stage stratified design you could use:
ess4gb_design <-
svydesign(
ids = ~ psu+idno,
strata = ~ stratify,
weights = ~ dweight,
nest=TRUE,
data = ess4gb,
)
However, the survey
package will treat the above design as a single-stage cluster sample, ignoring any sampling beyond the first stage (i.e. instead of ids = ~ psu+idno
you can simply use ids = ~ psu
). If you want to consider all sampling stages for variance estimation you would need to use the fpc
argument and supply it with the inclusion probabilities of the sampling units at each sampling stage, e.g. fpc = ~ prob1 + prob2
, where prob1
are the inclusion probabilities of the PSU's and prob2
those of the secondary sampling units (SSUs). The information on the inclusion probabilities on all sampling stages is collected by the ESS but, mostly for reasons of disclosure control, is not included into the SDDFs that are released to the public.
The real challenge in analysing ESS data is not so much about the correct use of weights but considering the complex sampling design when doing variance estimation. The SDDF is, with its information on clustering and stratification, mainly necessary for SE or variance estimation. For correct point estimation you wouldn't need it.
@djhurio
Your plot is very interesting. I also think that most difference are due to the trimming procedure, which would explain the observed patterns for most rounds and countries. But there are also the cases of Israel Round 1 and France Round 1 and UK Round 2 - 3 which I think cannot solely contributed to the trimming. In particular France Round 1 looks very wired.
I think there were less quality controls in place back then and I cannot say for sure if the prob
variable in the public SDDF in these early rounds are the source for dweight
in the main data file.
I still have summaries of the sampling designs for ESS Round 1 - 7 that hold the necessary information to specify svydesign
objects with regard to sampling stages, stratification, sampling domains, etc.. Maybe the ESS can even publish these summaries, which were complied from the Sampling Sign-off Forms back to Round 1. I found these summaries to be very useful, as they reduced the need to dig through dozens of different documents to find the necessary information, especially if you are using data from multiple rounds and countries. (You would hope that you could deduce the sampling design from the structure of the SDDF alone, but this is not always the case.)
@BernStZi, thanks a lot for this explanation! However I do not fully agree with you regarding the design weights. What I have learned and I have always assumed is that designs weights are derived directly from the sampling probabilities. Namely, dweight
should be equal to 1 / prob
. The name of those weights indicates that those weights are purely derived from the sample design.
I agree that design weights can not be applied in case of non-response. Well, you can, but better results can be gained by applying the so called non-response corrections on weights. This is a usual practice. However, those corrected weights cannot be called design weights, as they are derived taking into account extra information which is more than sampling design is providing.
Everyone, these are some very interesting comments. I'm very happy that we're discussing this as I told @BernStZi (and @djhurio at the beginning of this thread), the idea is that when we get these SDDF files ready, the next step is to explore developing a package to analyze the ESS data. Essentially, this is compiling all survey design information for each round/country and build some API so that the essurvey
can download SDDF, merge it and create an automatic survey object based on the survey information we have for that round/country.
Keep it up and let's see if we can finish up the SDDF function soon.
@BernStZi -- Thank you very much for your detailed answer, which is extremely useful since, as you note, this information is not easily deducible from the ESS documentation:
I still have summaries of the sampling designs for ESS Round 1 - 7 that hold the necessary information to specify
svydesign
objects with regard to sampling stages, stratification, sampling domains, etc.. Maybe the ESS can even publish these summaries, which were complied from the Sampling Sign-off Forms back to Round 1.
I think we can suggest that to the ESS team when we email with the list of things that we want to tell them, based on our discussion (weird differences between design weights and SDDF files in Israel Round 1, France Round 1 and UK Rounds 2-3, missing prob
in UK Round 4).
@cimentadaj -- The information that shows up in this discussion could be summarised in many ways:
haven
does for col_types
)parse_sddf
or create_svydesign
functionP.S. I've reported the issue identified by @djhurio in an earlier comment -- it's a ReadStat
issue, so I reported it there. I also posted it on the haven
repo for reference. See cross-refs below.
I'm happy to report that the issue in ReadStat reported above is now fixed in the dev branch, which means that the next version of ReadStat/haven
should solve the problem.
Thanks a lot to @evanmiller.
Awesome, thanks @briatte! I will be busy until the end of June, I'll get back to this then. So just to organize ourselves here's the TODO list I've summarized (keep adding if I forgot anything):
foreign
read_*
function OR wait until the ESS saves .por
files as .sav
as @djhurio already told them this.When that's done, then we could potentially read all SDDF files from rounds 1:6
. Additionally:
import_sddf_*
functions to read rounds 7:8
which are integratedI think also we could harmonize the SDDF data to lower case in column names because this differs between country/rounds.
I've just submitted a PR that does nothing else than falling back onto foreign
when haven
fails to read a file. I have not updated the package documentation, but the code will issue warning to the user if that happens.
I've tested the code on all SDDF files currently available, using the loop below: everything gets read correctly. The only problematic files are France 1, 2 and 3.
library(essurvey) # remotes::install_github('briatte/essurvey')
dir.create("ess-sddf", showWarnings = FALSE)
for (i in show_countries()) {
for (j in show_sddf_rounds(i)) {
cat(i, j, "\n")
f <- paste0("ess-sddf/", i, j, ".rds")
if (!file.exists(f)) {
x <- import_sddf_country(i, j, format = "spss")
readr::write_rds(x, f)
}
}
}
Hi everyone, I merged the changes introduced by @briatte on having a fall-back mechanism for reading ESS data with foreign. This means that we can read any SDDF data from rounds 1:6
. This is a great addition! Thanks @briatte. I will try to push forward reading rounds 7:8
so that we can finish this in the next month or so. Feel free to test the new work by installing the latest essurvey
with devtools::install_github("ropensci/essurvey")
and downloading your SDDF data with essurvey::import_sddf_country
. Let us know of any problems you might find.
It would be useful to make a function or functions for downloading Integrated sample design data files (SDDF) from the ESS website. They are useful for computing sampling errors. This is an example of SDDF for the round 7.