ropensci / essurvey

Download data from the European Social Survey
https://docs.ropensci.org/essurvey
Other
49 stars 9 forks source link

Integrated sample design data file (SDDF) #9

Closed djhurio closed 5 years ago

djhurio commented 6 years ago

It would be useful to make a function or functions for downloading Integrated sample design data files (SDDF) from the ESS website. They are useful for computing sampling errors. This is an example of SDDF for the round 7.

cimentadaj commented 6 years ago

This seems like a very useful idea. Will add in the next update. I've been also thinking (perhaps as another package) of making a few helper functions to actually calculate population percentages with the predefined weights by country, among other things.

djhurio commented 6 years ago

Thanks! Probably we could collaborate and include also tools for sample error estimation. Like production of estimates of proportions with corresponding confidence intervals estimated from the ESS data.

cimentadaj commented 6 years ago

Great, let's keep in touch!

djhurio commented 5 years ago

Hi, I have made an R code to import all SDDF files into R. Maybe this helps. I have used some parts of your code. https://github.com/djhurio/ESS/blob/master/Rcode/16-data-SDDF.R

cimentadaj commented 5 years ago

Hi @djhurio, I'm just about to leave for the holidays but I will check it out once I get back mid January. Maybe we can adjust it to work with essurvey and include it as an additional function.

Let's keep in touch in January!

djhurio commented 5 years ago

This would be great! Have a nice holidays!

briatte commented 5 years ago

Hi @cimentadaj

I came here because I was wondering, like @djhurio, if the essurvey package could also download SDDF files. Anthony Damico's lodown package covers them.

Please let me know if you need help coding that into essurvey. It looks like the functions that could be amended to support them are those to download country files: perhaps a 'simple' sddf = TRUE option to import_country would work. Ideally, the code would also show how to merge the country data and its SDDF file.

cimentadaj commented 5 years ago

Hi @briatte

This is something that I'd really like to work on, thanks for bringing it up. With the work of @djhurio , I think we have a great start at this but I've found that it's not completely straight forward because not all country-year combinations have actual SDDF data. For example, Slovakia in round 4.

Perhaps we can work out the implementation here and I'll add it to a new branch as we go.

So, there's two issues right now.

I'm slightly inclined for making it an invisible step such that import_country already takes care of downloading the SDDF and merging it because it clears the user from the loop. Since downloading the SDDF data does not change the data at all (will just add columns to the right), I think that doing it invisibly is no problem. If we would do any sort of 'transformation' to the data then I see the point in breaking it into different steps.

But then what happens when that country doesn't have SDDF data? Add the new columns, empty row? No columns? I kinda like the idea of creating the columns with empty rows because it signals that it attempted to download the SDDF but it's empty. This, with proper documentation on the sddf argument in the documentation could make it clear why this happens.

What do you guys think?

briatte commented 5 years ago

Hi @cimentadaj

I like the idea of silently adding the SDDF file if it exists, but since the download_format already speaks (once) to the user, perhaps we could also communicate whether the SDDF was found and merged during the download process?

As for adding empty columns, does it really make sense? I'm asking because ESS variable names are not perfectly stable (see. e.g. party voted in last national election), and I think empty columns would make sense only if they were.

Dear @djhurio ~ let's hear from you too :)

djhurio commented 5 years ago

Hello,

We will not be able to do invisible merge of ESS data and SDDF data right now. This is because there are several data conflicts in there. For example there are some strange records in SDDFs which should be deleted before merging. For example records with missing ID numbers or missing country codes. There were some missing cases in SDDFs as well. For example, there are records in ESS data which does not have corresponding records in the SDDFs.

What we can do:

  1. I will work with the ESS data archive to fix those data conflicts. However I am not sure if they will be able to fix everything. If there are conflicting data right now it is possible to have conflicting data also in future for the next rounds. It means some kind of data validation procedures should be in place. And if there is conflict - what should we do? Should we return error or should we return data without SDDF information with a warning. It becomes too complex.
  2. Mean wile we can work on separate function for SDDF downloads. I believe it is better to do ESS data and SDDF data downloads separately. Afterwards we can think about some kind of ESS data merger function with data validation etc.
cimentadaj commented 5 years ago

Ok, now I really think we should avoid the silent merging. Merging with incomplete data will generate errors and perhaps duplication. I agree that in the mean time we should create a function to grab the SDDF file reusing most of the essurvey infrastructure and when the time comes, perhaps we can integrate within import_country. To me, the safest thing to do would be to create an import_sddf and create a validation function to merge.

The first step is to get a mockup of this import_sddf. I'll try to work on this in the next few months. As soon as I have the tiniest example, I'll post it here so we can work on it by all of us.

briatte commented 5 years ago

I also vote for import_sddf.

cimentadaj commented 5 years ago

Ok, so I managed to sit down for a few hours today and got a working version. It's very limited and below are some examples of what it can and cannot do.

# Install the branch version with this:
remotes::install_github("ropensci/essurvey", ref = "sddf", quiet = TRUE, upgrade = TRUE)

library(essurvey)
set_email("cimentadaj@gmail.com")

# All SDDF rounds available for a given country
show_sddf_rounds("Spain")
#> [1] 1 2 3 4 5 6

# When no SDDF rounds available:
show_sddf_rounds("Austria")
#> numeric(0)

There is a problem with compatability between rounds given that since round 7 the ESS has been uploading the SDDF files integrated into one single Stata/SPSS file instead of separate files for each country. For example, compare rounds 1:6 with rounds 7:8 here

This brings about some issues because we cannot check interactively which countries have SDDF data without downloading the data and filtering. From my perspective, this brings all sorts of problems because we don't know the specific nomenclature they use to name countries within the file or whether these change over time. Moreover, this breaks the compelete pipeline of the package which simply reads the HTML from the website to figure out who participated.

To implement this we would need to create a separate function to download the SDDF data, load it into memory and filter out which countries are present. I don't like this at all but I'm not sure whether the ESS is looking to adhere to the previous way of uploading SDDF data. As of now, to test this, show_sddf_rounds only checks for the first 6 rounds.

Below I test the actual import_sddf_country for downloading two rounds from Spain.

tst <- import_sddf_country("Spain", 5:6)
#> Downloading ESS5
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |==============                                                   |  21%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |==================================                               |  53%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |==============================================================   |  95%
  |                                                                       
  |=================================================================| 100%
#> Downloading ESS6
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |======================================                           |  59%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |=================================================================| 100%

tst 
#> [[1]]
#> # A tibble: 1,885 x 6
#>    CNTRY  IDNO   PSU SAMPPOIN STRATIFY  PROB
#>    <chr> <dbl> <dbl>    <dbl> <chr>    <dbl>
#>  1 ES        1   199       NA 131         NA
#>  2 ES        2   101       NA 164         NA
#>  3 ES        3   407       NA 73          NA
#>  4 ES        6   394       NA 103         NA
#>  5 ES        8   129       NA 124         NA
#>  6 ES        9    12       NA 41          NA
#>  7 ES       10   149       NA 133         NA
#>  8 ES       11    16       NA 21          NA
#>  9 ES       12   308       NA 154         NA
#> 10 ES       14   321       NA 74          NA
#> # … with 1,875 more rows
#> 
#> [[2]]
#> # A tibble: 1,889 x 6
#>    cntry  idno   psu samppoin stratify      prob
#>    <chr> <dbl> <dbl>    <dbl> <chr>        <dbl>
#>  1 ES        1   239       NA 131      0.0000717
#>  2 ES        2   199       NA 171      0.0000534
#>  3 ES        3    19       NA 101      0.0000738
#>  4 ES        5   345       NA 51       0.0000735
#>  5 ES        6   273       NA 11       0.0000733
#>  6 ES        8   411       NA 74       0.0000641
#>  7 ES        9    81       NA 91       0.0000717
#>  8 ES       10   210       NA 133      0.0000622
#>  9 ES       11    74       NA 91       0.0000716
#> 10 ES       13   163       NA 13       0.0000611
#> # … with 1,879 more rows

The data can now be downloaded and read successfully. However, I see some problems in terms of comparability. First, is it expected the the prob column in round 5 is completely NA? Second, both rounds have different cases in their column names (this is minor). Lastly, can we assume that the same columns names are standard across all waves/countries?

This example was one that worked just fine but when I tried earlier rounds, such as round 1:2 for Spain, I found a weird error with the old SPSS files (only available in SPSS for the earlier files) using the read_por function from haven (essurvey uses read_por behind the scenes).

tst <- import_sddf_country("Spain", 1:2)
#> Downloading ESS1
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |====                                                             |   7%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |=============                                                    |  21%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |====================================                             |  55%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |=================================================================| 100%
#> Downloading ESS2
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==============                                                   |  21%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |================================================================ |  99%
  |                                                                       
  |=================================================================| 100%
#> Error in df_parse_por_file(spec, encoding = "", user_na = user_na): Failed to parse /tmp/RtmpwGGV1p/ESS_Spain/ESS1/ESS1_ES_SDDF.por: Unable to read from file.

I've checked and the files are downloaded successfully, the problem comes when reading the actual .por files with read_por. Any idea why this happens/has it happened to any of you? I don't have SPSS to test it but this is equivalent to downloading the data manually and opening it in SPSS. If I run the same code as above but for more recent datasets, then it works (try rounds 5:6).

Finally, I've limited the function to only download rounds which are available. That is, the function raises an error (pointing to the specific round which is not available) if any of the rounds you're asking for does not have SDDF files.

tst <- import_sddf_country("Denmark", 1:3)
#> Error in country_url_sddf(country, rounds): Rounds ESS2 don't have SDDF data available for Denmark

I'm very interested in making this work so that the next step concentrates on building a package for analyzing the ESS. However, it is a bit difficult if we can't get the SDDF files working properly across all rounds/countries. If you feel like it, take it for a spin, use it a bit and let me know what you find wrong/think of it.

briatte commented 5 years ago

Thanks a lot for import_sddf_country, @cimentadaj !

At that stage, we should ask for some comments from @daob, who might know about some of the SDDF inconsistencies above. @ajdamico might also know more.

cimentadaj commented 5 years ago

Thanks @briatte . Also, there's the problem that neither of the downloaded rounds contains a round or wave column. For this particular example, it is not important but whenever we build a 'data validation' function it will surely need to have the round specified to do the merging; we should add it possibly with the same name that is available in the comprehensive data with the same coding (e.g. ESS1 or just 1).

briatte commented 5 years ago

That variable can be inferred from the filenames, right?

cimentadaj commented 5 years ago

Yep, but we should be sure that the column name and coding doesn't change across rounds.

djhurio commented 5 years ago

Hi, I have done some very small testing using the data for Croatia. As I understand, currently the SPSS files are downloaded. So it works for me.

However, I have found a problem with the SDDF in SAS format for Croatia. I am not able to read the first 39 rows correctly using the haven. Is it the same for you? There is only one SDDF for Croatia - for the round 5.

I believe in future we should allow to chose the format - SAS or SPSS also for the SDDFs. Actually I am considering to push the ESS to publish the SDDFs also in CSV format.

briatte commented 5 years ago

@cimentadaj Working this out for the user would help her a lot, right? Would bring lots of value to the package in my view. I'm still working out those issues (there are many) with the French data.

@djhurio That has happened to me too. In my experience, foreign won't fail in those cases. And in my view, the SPSS files are the best choices right now (CSV would also work since, AFAIK, none of the SDDF variables are labelled).

djhurio commented 5 years ago

I am importing SDDF data in the following way now:

See the full code if needed: https://github.com/djhurio/ESS/blob/master/Rcode/16-data-SDDF.R

The SDDF has been published as an integrated data file for the last two rounds. We could assume this is the way how the SDDFs will be published in a future. I do not have any other information available. So probably we should create SDDF import function taking this assumption into account.

My idea is that we need to drop the country argument from the download_sddf and import_sddf functions. I propose to make a function import_sddf with arguments rounds and ess_email. It will download SDDFs for a specified rounds for all countries available. What do you think?

briatte commented 5 years ago

I have been in touch by email with someone at ESS recently, and can confirm that SDDFs should now get published as all-country, round-level files.

As far as I understand there are four situations to take into account:

  1. User wants 1+ country SDDF file(s), ESS offers that (older rounds)
  2. User wants 1+ country SDDF file(s), ESS offers only rounds (recent rounds)
  3. User wants all SDDF for 1+ round(s), ESS offers only country SDDFs (older rounds)
  4. User wants all SDDF for 1+ round(s), ESS offers that (recent rounds)

My guess is that @cimentadaj's solution (always download SDDF information for the countries the user is interested in, and handle the merge) is the best one, but I did not find the time to help on that yet.

I did, however, get one of the SDDFs for France corrected (it could not be merged before that; there might be other such problematic cases in old editions).

cimentadaj commented 5 years ago

Hmm, I see. @briatte summarized it well. So one experimental way to do this would be to only read in the complete SDDF for each round (the user could then just filter out any country they want specifically or keep the complete round, what do you think?). For the older rounds this is already done and for the recent rounds this could mean listing and downloading all separate countries and then rbind them. Can we be 100% sure that each country SDDF has the same columns and in the same order @djhurio? If that is the case, then it's doable.

cimentadaj commented 5 years ago

I'm also worried about mixing foreign and haven dependencies. I'm going to try to make some reproducible example on where the haven problems are coming from and perhaps they can fix it.

briatte commented 5 years ago

Here are the variables in each SDDF file that I downloaded in February 2019:

2002/ESS1_CZ_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2002/ESS1_DE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2002/ESS1_DK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2002/ESS1_ES_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2002/ESS1_FI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2002/ESS1_FR_SDDF.rds :idno,cntry,psu,samppoin,stratify,prob 
2002/ESS1_GB_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2002/ESS1_HU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2002/ESS1_IE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2002/ESS1_IL_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2002/ESS1_LU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2002/ESS1_SI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2004/ESS2_CZ_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2004/ESS2_DE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2004/ESS2_ES_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2004/ESS2_FI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2004/ESS2_FR_SDDF.rds :idno,cntry,psu,samppoin,stratify,prob 
2004/ESS2_GB_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2004/ESS2_HU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2004/ESS2_IE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2004/ESS2_LU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2004/ESS2_SI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2004/ESS2_UA_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2006/ESS3_DE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2006/ESS3_DK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2006/ESS3_ES_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2006/ESS3_FI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2006/ESS3_FR_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2006/ESS3_GB_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2006/ESS3_HU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2006/ESS3_IE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2006/ESS3_RU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2006/ESS3_SI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2006/ESS3_UA_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2008/ESS4_CZ_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2008/ESS4_DE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2008/ESS4_DK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2008/ESS4_ES_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2008/ESS4_FI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2008/ESS4_FR_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2008/ESS4_GB_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2008/ESS4_HU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2008/ESS4_RU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2008/ESS4_SI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2008/ESS4_UA_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_BE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_BG_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_CH_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_CY_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_CZ_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_DE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_DK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_EE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_ES_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_FI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_FR_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_GB_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_GR_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_HR_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_HU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_IE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_IL_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_LT_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_NL_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_NO_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_PL_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_PT_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_RU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_SE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_SI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_SK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2010/ESS5_UA_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_AL_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
# this one is weird but has the common variables
2012/ESS6_BE_SDDF.rds :name,edition,proddate,essround,cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_BG_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_CH_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_CY_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_CZ_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_DE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_DK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_EE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_ES_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_FI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_FR_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_GB_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_HU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_IE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_IS_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_IT_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_LT_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_NL_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_NO_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_PL_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
# this one is weird but has the common variables
2012/ESS6_PT_SDDF.rds :name,edition,proddate,essround,cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_RU_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_SE_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_SI_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_SK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_UA_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
2012/ESS6_XK_SDDF.rds :cntry,idno,psu,samppoin,stratify,prob 
# files below are round-level
2014/ESS7SDDFe01_1.rds :name,essround,edition,proddate,cntry,idno,psu,domain,stratify,prob 
2016/ESS8SDDFe01_1.rds :name,essround,edition,proddate,cntry,idno,psu,domain,stratum,prob 

I could not test reading the files with haven vs. foreign because I collected the data with @ajdamico's R code, which converts the files on the fly, but I strongly suspect that some files will be problematic with haven ("invalid multibyte whatever"). It's happened to me several times while reading old SPSS files with haven.

djhurio commented 5 years ago

It would be great to have those functions available as well:

import_sddf_rounds(rounds, ess_email = NULL)
import_all_sddf_rounds(ess_email = NULL)
download_sddf_rounds(rounds, ess_email = NULL, output_dir = getwd(), format = "stata")
cimentadaj commented 5 years ago

Ok, so good news. I just wrote some code to read all SDDF file available to figure out which errors are occuring when reading the SPSS files. The error is present only in rounds 1:4 and there error is the same for all countries/rounds:

#> Error in df_parse_por_file(spec, encoding = "", user_na = user_na): Failed to parse /Users/cimentadaj/Downloads/ESS1_SI_SDDF.spss/ESS1_SI_SDDF.por: Unable to read from file.

Luckily for us, this is just a mistake in the file extension! All SPSS files are actually .sav files rather than .por files.

Here's an example using the SDDF file for Slovenia in round 1:


haven::read_por("/Users/cimentadaj/Downloads/ESS1_SI_SDDF.spss/ESS1_SI_SDDF.por")
#> Error in df_parse_por_file(spec, encoding = "", user_na = user_na): Failed to parse /Users/cimentadaj/Downloads/ESS1_SI_SDDF.spss/ESS1_SI_SDDF.por: Unable to read from file.

haven::read_sav("/Users/cimentadaj/Downloads/ESS1_SI_SDDF.spss/ESS1_SI_SDDF.por")
#> # A tibble: 1,519 x 6
#>    cntry  idno   psu samppoin stratify       prob
#>    <chr> <dbl> <dbl>    <dbl> <chr>         <dbl>
#>  1 SI        1   149       NA 04-1     0.00000755
#>  2 SI        2   149       NA 04-1     0.00000755
#>  3 SI        3   149       NA 04-1     0.00000755
#>  4 SI        4   149       NA 04-1     0.00000755
#>  5 SI        5   149       NA 04-1     0.00000755
#>  6 SI        6   149       NA 04-1     0.00000755
#>  7 SI        7    93       NA 07-4     0.00000755
#>  8 SI        8    94       NA 07-4     0.00000755
#>  9 SI        9    94       NA 07-4     0.00000755
#> 10 SI       10    94       NA 07-4     0.00000755
#> # … with 1,509 more rows

I have yet to implement this in the sddf branch but I think that with this fix the SDDF download functions can mimic the import_ and download_ functions for countries and rounds. That is, there could be both import_sddf_country and import_sddf_round together with their corresponding download_ functions just as @djhurio proposes. Of course, I'm not sure whether overriding the file extension of .por to .sav is the wisest idea because at some point a .por file could actually be a .por file rather than a .sav file.

I'm gonna try to communicate this to the ESS directly. Feel free to also push for this through your own channels. Btw, anyone has access to an SPSS copy where they can test whether the .por file is actually read as .sav? That way we can always argue that the extension is wrong as well in SPSS.

cimentadaj commented 5 years ago

Everyone, I want to get @BernStZi involved in the thread. He's been an expert in the sampling/weighting of the ESS in the past. He suggested to be included as he can answer some of the questions on the nature of the SDDF data. I'm going to try to implement the changes I discussed above in the sddf branch soon so we can test it. @BernStZi, I'll prepare some questions regarding the nature of the data that seem a bit weird for me.

BernStZi commented 5 years ago

Hi everyone, I worked for some time for the ESS as an expert on sampling & weighting (mostly). So if you have any questions regarding the nature of the SDDFs or on the sampling designs of ESS countries from rounds 1 - 7, I might be able to help you.

djhurio commented 5 years ago

@cimentadaj, I believe there is no need to have IBM SPSS Statistics to distinguish .sav files from .por files. The file command in Linux is handy for this. For example we can see that all .sav files are in SPSS data format (binary).

djhurio@Skyforger ~/Dropbox/Darbs/ESS9-SWEP-2017-2019/ESS-git/data/SDDF $ file *.sav
ESS5_BE_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_BG_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_CH_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_CY_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_CZ_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_DE_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_DK_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_EE_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_ES_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_FI_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_FR_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_GB_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_GR_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_HR_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_HU_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_IE_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_IL_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_LT_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_NL_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_NO_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_PL_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_PT_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_RU_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_SE_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_SI_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_SK_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS5_UA_SDDF.sav:  SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002
ESS7SDDFe1_1.sav:  SPSS System File TICS 64-bit MS Windows 22.0.0.0         \002
ESS8SDDFe01_1.sav: SPSS System File TICS 64-bit MS Windows 25.0.0.0         \002

It looks the .por files are plain text files.

djhurio@Skyforger ~/Dropbox/Darbs/ESS9-SWEP-2017-2019/ESS-git/data/SDDF $ file *.por
ESS1_CZ_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS1_DE_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS1_DK_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS1_ES_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS1-FI-SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS1_FR_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS1_GB_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS1_HU_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS1_IE_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS1_IL_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS1_LU_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS1_SI_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS2_CZ_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS2_DE_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS2_ES_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS2_FI_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS2_FR_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS2_GB_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS2_HU_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS2_IE_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS2_LU_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS2_SI_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS2_UA_SDDF.por: SPSS System File MS Windows Release 14.0.1               \002
ESS3_DE_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS3_DK_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS3_ES_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS3_FI_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS3_FR_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS3_GB_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS3_HU_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS3_IE_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS3_RU_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS3_SI_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS3_UA_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS4_CZ_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS4_DE_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS4_DK_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS4_ES_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS4_FI_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS4_FR_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS4_GB_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS4_HU_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS4_RU_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS4_SI_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS4_UA_SDDF.por: SPSS System File MS Windows Release 15.0.1               \002
ESS6_AL_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_BE_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_BG_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_CH_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_CY_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_CZ_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_DE_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_DK_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_EE_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_ES_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_FI_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_FR_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_GB_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_HU_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_IE_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_IS_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_IT_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_LT_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_NL_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_NO_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_PL_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_PT_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_RU_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_SE_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_SI_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_SK_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_UA_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS6_XK_SDDF.por: ISO-8859 text, with CRLF line terminators
ESS7SDDFe1_1.por: ISO-8859 text, with CRLF line terminators
djhurio commented 5 years ago

Now I see why foreign works but haven fails on wrongly named SPSS data files.

The haven decides to use read_sav or read_por based on the file extension. See the code from haven.R:

  ext <- tolower(tools::file_ext(file))
  switch(ext,
    sav = read_sav(file, user_na = user_na),
    zsav = read_sav(file, user_na = user_na),
    por = read_por(file, user_na = user_na),
    stop("Unknown extension '.", ext, "'", call. = FALSE)
  )

The foreign does it in much smarter way. See the code from spss.c:

    if (0 == strncmp("$FL2", buf, 4)) {
    fclose(fp);
    ans = read_SPSS_SAVE(filename);
    } else {
    if (!is_PORT(fp)) {
        fclose(fp);
        error(_("file '%s' is not in any supported SPSS format"),
          filename);
    }
    fclose(fp);
    ans = read_SPSS_PORT(filename);
    }
briatte commented 5 years ago

Nice hunting, @djhurio

@BernStZi, FWIW, thanks a lot for your work on the ESS, one of the rare European surveys out there with a proper weighting structure! The SDDF files are still a bit of a mess, as the discussion above highlights, but still, this survey is exceptionally high-quality.

cimentadaj commented 5 years ago

I've just pushed the SDDF changes to the master branch of essurvey on Github. You can use the functions show_sddf_rounds and import_sddf_cntrounds with devtools::install_github("ropensci/essurvey"). Currently, only rounds 1:6 are supported because the new two rounds are integrated; I'm gonna have to write some separate codes for these rounds.

Give it a test if you can because I'd really like to have this robust for the next essurvey release.

briatte commented 5 years ago

Excellent, thanks!

I've tested the function using the loop below:

library(essurvey)
library(magrittr)
library(tibble)
library(tidyr)

d <- essurvey::show_countries()
d <- tidyr::crossing(country = d, round = 1:6) %>%
  tibble::add_column(file = list(NULL))

for (j in unique(d$country)[11:38]) {

  for (i in essurvey::show_sddf_rounds(j)) {

    cat(j, "Round", i, "\n")
    x <- try(
      essurvey::import_sddf_country(j, i, "f.briatte@ed.ac.uk"),
      silent = TRUE
    )
    if ("try-error" %in% class(x)) {
      d$file[[ which(d$country == j & d$round == i) ]] <- "error"
    } else {
      d$file[[ which(d$country == j & d$round == i) ]] <- x
    }

  }

}

The tests show that only a few SDDF files are unreadable under the current implementation: those are the SDDFs for France, Rounds 1, 2, 3.

The error that shows up in all cases is:

Invalid time string (length=8): 0-------

Perhaps SDDF files can be more safely read from their SPSS versions? In which case, all that would be needed is switching download_sddf_country to use format = 'spss' rather than 'stata'.

The results of my test, including all SDDF data that I managed to download with the current version fo the package, are there:

https://f.briatte.org/temp/ess-sddf-test/

cimentadaj commented 5 years ago

Thanks a lot @briatte! I've spotted the problem and it's related to the .por and .sav problems I outlined above.

For France 1:3 it seems that .porfiles are ACTUALLY .por files and not .sav files as described above. See the examples below:

library(essurvey)
library(haven)

set_email("cimentadaj@gmail.com")

dwn_dir <- download_sddf_country("France", 1, output_dir = tempdir(),
                                 format = 'spss')

#> All files saved to /tmp/RtmpRkSnJs
all_dirs <- list.files(dwn_dir, full.names = TRUE)

# The file is a `.por` file.
all_dirs
#> [1] "/tmp/RtmpRkSnJs/ESS_France/ESS1/ESS1_FR_SDDF.por"      
#> [2] "/tmp/RtmpRkSnJs/ESS_France/ESS1/ESS1.zip"              
#> [3] "/tmp/RtmpRkSnJs/ESS_France/ESS1/sddf-documentation.pdf"

# Read using `read_sav`
read_sav(all_dirs[1])
#> Invalid time string (length=8): 0-------
#> Error in df_parse_sav_file(spec, encoding, user_na): Failed to parse /tmp/RtmpRkSnJs/ESS_France/ESS1/ESS1_FR_SDDF.por: The file's timestamp string is invalid.

# Read using `read_por`
read_por(all_dirs[1])
#> # A tibble: 1,503 x 6
#>     IDNO CNTRY   PSU SAMPPOIN STRATIFY  PROB
#>    <dbl> <chr> <dbl>    <dbl> <chr>    <dbl>
#>  1 10001 FR      130       NA ""         Inf
#>  2 10002 FR      401       NA ""         Inf
#>  3 10003 FR      108       NA ""         Inf
#>  4 10004 FR      711       NA ""         Inf
#>  5 10005 FR      102       NA ""         Inf
#>  6 10006 FR      404       NA ""         Inf
#>  7 10007 FR      623       NA ""         Inf
#>  8 10008 FR      908       NA ""         Inf
#>  9 10009 FR      909       NA ""         Inf
#> 10 10010 FR      608       NA ""         Inf
#> # … with 1,493 more rows

On the other hand, If I run EXACTLY the same code from above for Spain, I get that the file is a .por file but only is read successfully with read_sav.

library(essurvey)
library(haven)

set_email("cimentadaj@gmail.com")

dwn_dir <- download_sddf_country("Spain", 1, output_dir = tempdir(),
                                 format = 'spss')
all_dirs <- list.files(dwn_dir, full.names = TRUE)

# The file is a `.por` file.
all_dirs
#> [1] "/tmp/RtmpCpqs02/ESS_Spain/ESS1/ESS1_ES_SDDF.por"      
#> [2] "/tmp/RtmpCpqs02/ESS_Spain/ESS1/ESS1.zip"              
#> [3] "/tmp/RtmpCpqs02/ESS_Spain/ESS1/sddf-documentation.pdf"

# Read using `read_sav`
read_sav(all_dirs[1])
#> # A tibble: 1,729 x 6
#>    cntry  idno       psu samppoin stratify       prob
#>    <chr> <dbl>     <dbl>    <dbl> <chr>         <dbl>
#>  1 ES        2 105903019       NA 211      0.00000789
#>  2 ES        5 105903019       NA 211      0.0000118 
#>  3 ES        8 105903019       NA 211      0.00000789
#>  4 ES       13 200301002       NA 421      0.0000116 
#>  5 ES       15 200301002       NA 421      0.0000116 
#>  6 ES       22 200301002       NA 421      0.00000578
#>  7 ES       23 200301002       NA 421      0.0000231 
#>  8 ES       25 200305002       NA 421      0.00000558
#>  9 ES       27 200305002       NA 421      0.00000743
#> 10 ES       31 200305002       NA 421      0.0000112 
#> # … with 1,719 more rows

# Read using `read_por`
read_por(all_dirs[1])
#> Error in df_parse_por_file(spec, encoding = "", user_na = user_na): Failed to parse /tmp/RtmpCpqs02/ESS_Spain/ESS1/ESS1_ES_SDDF.por: Unable to read from file.

I'm not very happy with the strategy of reading .por files with read_sav because of this problem precisely. @djhurio or @BernStZi is there any way we can contact the ESS so they can save files accordingly? I just want to make sure the .por files are actual .por files and not .sav files. In the same line, can someone try reading these files with SPSS? Does it work? Does SPSS know how to read .por files which are .sav files? @djhurio suggested using foreign which can read them successfully but that would be a problem in backwards dependency for the package because everything uses haven.

Any ideas?

djhurio commented 5 years ago

Yes, I will inform the ESS Data Archive about this issue.

I thought to create an alternative for the haven::read_spss. Which would detect POR and SAV files according to the data. Something like this:

read_spss_2 <- function(file, encoding = NULL, user_na = FALSE) {
  x <- fread(input = file, sep = "", nrows = 1, header = F)$V1
  if (grepl("SPSS PORT FILE", x, useBytes = T)) {
    read_por(file = file, user_na = user_na)
  } else if (grepl("SPSS DATA FILE", x, useBytes = T)) {
    read_sav(file = file, encoding = encoding, user_na = user_na)
  } else {
    stop("Unknow file format")
  }
}

However, I realised that the haven::read_por does not read POR files correctly. See for example:

> as.data.table(haven::read_por("data/SDDF/ESS1_FR_SDDF.por"))
       IDNO CNTRY PSU SAMPPOIN STRATIFY PROB
   1: 10001    FR 130       NA           Inf
   2: 10002    FR 401       NA           Inf
   3: 10003    FR 108       NA           Inf
   4: 10004    FR 711       NA           Inf
   5: 10005    FR 102       NA           Inf
  ---                                       
1499: 11499    FR 709       NA           Inf
1500: 11500    FR 126       NA           Inf
1501: 11501    FR 612       NA           Inf
1502: 11502    FR 813       NA           Inf
1503: 11503    FR 408       NA           Inf
> as.data.table(foreign::read.spss("data/SDDF/ESS1_FR_SDDF.por"))
       IDNO CNTRY PSU SAMPPOIN STRATIFY         PROB
   1: 10001    FR 130       NA          1.651397e-05
   2: 10002    FR 401       NA          8.236820e-07
   3: 10003    FR 108       NA          1.651397e-05
   4: 10004    FR 711       NA          7.193571e-06
   5: 10005    FR 102       NA          5.061940e-07
  ---                                               
1499: 11499    FR 709       NA          4.306910e-07
1500: 11500    FR 126       NA          1.651397e-05
1501: 11501    FR 612       NA          4.427837e-06
1502: 11502    FR 813       NA          1.933014e-06
1503: 11503    FR 408       NA          3.924291e-06

I tried to read the data with the PSPP. I can open any data. So, I assume it detects the file type according to the file content.

briatte commented 5 years ago

From both your posts, I think the solution involves the following:

  1. Implement a reader that falls back onto foreign when haven fails.
  2. Ask the ESS whether they could re-upload the .por files as .sav files.
  3. Open an issue about .por files being misread by haven.

Step 1 should be easy enough:

read_sav <- function(x, quiet = TRUE) {
  x <- try(haven::read_sav(x), silent = TRUE)
  if ("try-error" %in% class (x)) {
    if (!quiet) warning("File ", basename(x), " read with `foreign::read.spss` instead of `haven::read_sav`")
    foreign::read.spss(x)
  } else {
    x
  }
}

@djhurio seems to have handled Step 2.

I can handle Step 3 if needed.

Example issues with reading POR files using haven, including one by @cimentadaj that looks very much like the problem we're having here:

djhurio commented 5 years ago

I have contacted the ESS and have received the following answer:

The files were created using the “save outfile”-command instead of the “export outfile”-command in SPSS. When using “save outfile” the file type is actually SAV (SPSS data) as you mentioned below. We will replace the listed “.por”-files with “. sav”-files (unicode) in near future. Please note that we no longer create and support .por files in future data releases.

briatte commented 5 years ago

Cool! One problem off the list.

Also -- I've checked the prob columns in the SDDF data that I posted earlier, and two files are weird:

In the RDS file that I posted earlier, those are rows 210 and 226 respectively (check the tibble stored in the file column).

cimentadaj commented 5 years ago

This is great @djhurio, thanks!

As per @briatte and @djhurio solution to fall back on foreign I'm a bit concerned on the output. Thing is, since haven outputs tibble's, I wouldn't want to mess up the consistent output. We could simply wrap the foreign call with tibble and test it (also taking care of the new column classes that haven assigns extracting the labeld columns from Stata in a custom labelled class in R). If you guys think it's a good idea then I'll work on your previous examples to build this fall back mechanism.

@briatte I was also thinking of opening an issue with haven on the example of @djhurio because it seems it is not reading them correctly as `foreign. Would you have some time or should I go for it?

BernStZi commented 5 years ago

@briatte I did some digging

Also -- I've checked the prob columns in the SDDF data that I posted earlier, and two files are weird:

* Switzerland Round 6 -- all weights equal to 1

Switzerland had, or at least that is how they explained it, a single stage equal probability design. That is probability the reason why they simple decided to set the prob column to one, which equals the (scaled) design weight for such a design.

* UK Round 4 -- all weights missing

I don't know why they didn't include the prob column here. The UK didn't had an equal probability design, but the design weights must be included in the main data set.

Also, I noticed that the design weights in the main data set can differ, for some countries, from those that you get if you compute them from the SDDF files. In particular the SDDFs of early rounds (1 - 4) can have some quality issues.

briatte commented 5 years ago

@cimentadaj

  1. Concerning your first point, yes, I do believe a foreign fallback would be useful. Although technically, the fallback should be offered by… haven itself, since the issue is related to haven (or ReadStat), not to essurvey.

  2. I'll open an issue on the haven repo, submitting the file as an example.

@BernStZi

Thanks a lot for digging. I understand that Switzerland Round 6 is a non-issue, while UK Round 4 is a weird issue, since indeed, the main data file ESS4GB has a valid dweight column.

I am more concerned about your second point, since you are basically saying that the best strategy is to use dweight (from the main data files) rather than the SDDF files.

Does that mean that, instead, of using, e.g.

ess4gb_design <- svydesign(
  ids = ~ psu,
  strata = ~ stratify,
  probs = ~ prob,
  data = ess4gb,
  nest = TRUE
)

… the user should rather use:

ess4gb_design <- svydesign(
  ids = ~ idno,
  weights = ~ dweight,
  data = ess4gb
)

… since that column carries the same information, and possibly better information?

Again, many thanks for your help. The ESS is such a high-quality dataset that I'm eager to learn exactly what is the best method to weigh it.

djhurio commented 5 years ago

I have seen inconsistency between sampling probabilities (from the SDDFs) and design weights (from survey data). I believe in most of those cases trimming of design weights has been applied.

image

dweight is taken from the survey data and dweight2 is computed from the prob.

BernStZi commented 5 years ago

@briatte

Yes, my general recommendation is to use dweight from the main data set as design weights. My impression was that, at least for the early Round dweight, was construed with more care and might not originate solely from the prob variable in the published SDDFs.
Also prob contains the inclusion probabilities of the gross sample, i.e. formally they should be corrected for the loss in sample size from gross to net sample. Using dweight implicitly assumes a uniform response process, because of the scaling to the net sample size. i.e. you correct with a constant factor,

For a two-stage stratified design you could use:

ess4gb_design <- 
svydesign(
  ids = ~ psu+idno,
  strata = ~ stratify,
  weights = ~ dweight,
  nest=TRUE,
  data = ess4gb,
)

However, the survey package will treat the above design as a single-stage cluster sample, ignoring any sampling beyond the first stage (i.e. instead of ids = ~ psu+idno you can simply use ids = ~ psu). If you want to consider all sampling stages for variance estimation you would need to use the fpc argument and supply it with the inclusion probabilities of the sampling units at each sampling stage, e.g. fpc = ~ prob1 + prob2, where prob1 are the inclusion probabilities of the PSU's and prob2 those of the secondary sampling units (SSUs). The information on the inclusion probabilities on all sampling stages is collected by the ESS but, mostly for reasons of disclosure control, is not included into the SDDFs that are released to the public.

The real challenge in analysing ESS data is not so much about the correct use of weights but considering the complex sampling design when doing variance estimation. The SDDF is, with its information on clustering and stratification, mainly necessary for SE or variance estimation. For correct point estimation you wouldn't need it.

@djhurio

Your plot is very interesting. I also think that most difference are due to the trimming procedure, which would explain the observed patterns for most rounds and countries. But there are also the cases of Israel Round 1 and France Round 1 and UK Round 2 - 3 which I think cannot solely contributed to the trimming. In particular France Round 1 looks very wired. I think there were less quality controls in place back then and I cannot say for sure if the prob variable in the public SDDF in these early rounds are the source for dweight in the main data file.

I still have summaries of the sampling designs for ESS Round 1 - 7 that hold the necessary information to specify svydesign objects with regard to sampling stages, stratification, sampling domains, etc.. Maybe the ESS can even publish these summaries, which were complied from the Sampling Sign-off Forms back to Round 1. I found these summaries to be very useful, as they reduced the need to dig through dozens of different documents to find the necessary information, especially if you are using data from multiple rounds and countries. (You would hope that you could deduce the sampling design from the structure of the SDDF alone, but this is not always the case.)

djhurio commented 5 years ago

@BernStZi, thanks a lot for this explanation! However I do not fully agree with you regarding the design weights. What I have learned and I have always assumed is that designs weights are derived directly from the sampling probabilities. Namely, dweight should be equal to 1 / prob. The name of those weights indicates that those weights are purely derived from the sample design.

I agree that design weights can not be applied in case of non-response. Well, you can, but better results can be gained by applying the so called non-response corrections on weights. This is a usual practice. However, those corrected weights cannot be called design weights, as they are derived taking into account extra information which is more than sampling design is providing.

cimentadaj commented 5 years ago

Everyone, these are some very interesting comments. I'm very happy that we're discussing this as I told @BernStZi (and @djhurio at the beginning of this thread), the idea is that when we get these SDDF files ready, the next step is to explore developing a package to analyze the ESS data. Essentially, this is compiling all survey design information for each round/country and build some API so that the essurvey can download SDDF, merge it and create an automatic survey object based on the survey information we have for that round/country.

Keep it up and let's see if we can finish up the SDDF function soon.

briatte commented 5 years ago

@BernStZi -- Thank you very much for your detailed answer, which is extremely useful since, as you note, this information is not easily deducible from the ESS documentation:

I still have summaries of the sampling designs for ESS Round 1 - 7 that hold the necessary information to specify svydesign objects with regard to sampling stages, stratification, sampling domains, etc.. Maybe the ESS can even publish these summaries, which were complied from the Sampling Sign-off Forms back to Round 1.

I think we can suggest that to the ESS team when we email with the list of things that we want to tell them, based on our discussion (weird differences between design weights and SDDF files in Israel Round 1, France Round 1 and UK Rounds 2-3, missing prob in UK Round 4).

@cimentadaj -- The information that shows up in this discussion could be summarised in many ways:

P.S. I've reported the issue identified by @djhurio in an earlier comment -- it's a ReadStat issue, so I reported it there. I also posted it on the haven repo for reference. See cross-refs below.

briatte commented 5 years ago

I'm happy to report that the issue in ReadStat reported above is now fixed in the dev branch, which means that the next version of ReadStat/haven should solve the problem.

Thanks a lot to @evanmiller.

cimentadaj commented 5 years ago

Awesome, thanks @briatte! I will be busy until the end of June, I'll get back to this then. So just to organize ourselves here's the TODO list I've summarized (keep adding if I forgot anything):

When that's done, then we could potentially read all SDDF files from rounds 1:6. Additionally:

I think also we could harmonize the SDDF data to lower case in column names because this differs between country/rounds.

briatte commented 5 years ago

I've just submitted a PR that does nothing else than falling back onto foreign when haven fails to read a file. I have not updated the package documentation, but the code will issue warning to the user if that happens.

I've tested the code on all SDDF files currently available, using the loop below: everything gets read correctly. The only problematic files are France 1, 2 and 3.

library(essurvey) # remotes::install_github('briatte/essurvey')
dir.create("ess-sddf", showWarnings = FALSE)

for (i in show_countries()) {
  for (j in show_sddf_rounds(i)) {
    cat(i, j, "\n")
    f <- paste0("ess-sddf/", i, j, ".rds")
    if (!file.exists(f)) {
      x <- import_sddf_country(i, j, format = "spss")
      readr::write_rds(x, f)
    }
  }
}
cimentadaj commented 5 years ago

Hi everyone, I merged the changes introduced by @briatte on having a fall-back mechanism for reading ESS data with foreign. This means that we can read any SDDF data from rounds 1:6. This is a great addition! Thanks @briatte. I will try to push forward reading rounds 7:8 so that we can finish this in the next month or so. Feel free to test the new work by installing the latest essurvey with devtools::install_github("ropensci/essurvey") and downloading your SDDF data with essurvey::import_sddf_country. Let us know of any problems you might find.