ropensci / essurvey

Download data from the European Social Survey
https://docs.ropensci.org/essurvey
Other
49 stars 9 forks source link

New Maintainer Wanted :-) #57

Open maelle opened 1 year ago

maelle commented 1 year ago

Or new maintainer team. :smile_cat:

Because of #56 a whole overhaul of the package is needed

If you're interested, please comment in the issue. For more info, see

maelle commented 1 year ago

@ropensci/admin

etiennebacher commented 1 year ago

Hello, I can't say that I will have time or knowledge to be maintainer but I have used essurvey in the past so I'd like to contribute back to it.

However, before thinking whether I should be a maintainer, I'm not even sure it's technically possible to write a package that automatically downloads the ESS datasets anymore. First, there's no API. Second, from what I can see from the current code, the old way was "easy" in the sense that each dataset had its own fixed URL to download the data. However, it seems that it is no longer the case. Indeed, all "Download" buttons now lead to the same URL: https://ess-search.nsd.no/en/download

This means that it is not possible to distinguish one dataset from the other based on the URL. Maybe we could perform the POST request to trigger the download ourselves using httr/httr2, but it seems that the parameters passed in the POST request are random and change at each download:

image

There is also a GET request made but the parameters are also random.

Therefore, it seems to me that the only way to bring essurvey back to life is to contact directly the organization that manages it to ask if they have some plans to make an API, or at least make it possible to programmatically download the datasets. But maybe I'm wrong and there's a way to do it. If someone finds a solution and doesn't have time to implement it, I could try to integrate it in the package.

maelle commented 1 year ago

Wow, thanks a lot @etiennebacher for the digging! So yes, I suppose a new maintainer would need to contact the data provider first.

cimentadaj commented 1 year ago

@djhurio can help as he has contact with the ESS organizing team. @etiennebacher, you're right as we're a bit lost on how the data is now being downloaded. I haven't looked into how it works now but you're on the right track.

gorcha commented 1 year ago

Hi all,

@etiennebacher it looks like programmatically accessing metadata and downloading data can be done on the new data portal using a GraphQL API, but data downloads need a few steps. There are API docs here but they're not particularly illuminating, I've just been following the request flow in dev tools to see what it does. It looks a bit fiddly but very doable.

It also needs an OAuth2 authentication flow, which is a bit of a pain. I can't see any option to register an OAuth2 client app through the ESS data portal so that would probably need help from the ESS organising team.

I'm a bit time-strapped at the moment but would be more than happy to help get something up and running!

maelle commented 1 year ago

Thanks @gorcha!! Should I go ahead and give you write access to this repository? (if you decide to become the maintainer, you'd get admin access)

etiennebacher commented 1 year ago

@gorcha I see the GraphQL queries, but I don't know how to reproduce them. Some elements look random, like the datafileId. As you said, the NSD API docs are not helpful.

For reference, here are the steps I follow:

Here's the first GraphQL query for me:

image

And the second:

image

I have more or less the same requests when I use the "Data Wizard". The only difference is that there are more arguments because we specify which variables/years/countries we want. I don't know how to mimic this from R. If you have an example, I'd be curious to see how you do it.

gorcha commented 1 year ago

Hi @maelle - sure! I won't have a chance to look at it for a few weeks though.

Hi @etiennebacher, the datafile IDs are retrieved as part of some earlier GraphQL queries. In addition to watching the requests I've been prodding the JavaScript a bit to figure out some of the details.

Here's a dump of the process flow for both the Data Portal and Data Wizard from a bit of poking around, with the GraphQL operation name and arguments for each step. One note - instance and agencyId are used a lot, and from what I can see they are always "PUBLISHED" and "INT_ESSERIC" respectively.

I haven't had a go at replicating it in R yet, but the GraphQL calls are all just JSON so there'll be some parsing shenanigans and dealing with HTTP responses but the trickiest part will probably be OAuth.

Data Portal

Study metadata

graphql: query studyMetadata($id: ID!, $instance: Instance!, $agencyId: Agency!)

Retrieve metadata for the study (e.g. ESS 2010), which is used to populate the page (the study ID is in the page URL). The set of datafiles for the study are part of the returned object, including the datafile ID, description etc.

Get download URL

graphql: query download($datafileId: ID!, $version: Int!, $agencyId: String!, $format: DownloadFormat!, $instance: String!)

Get the download URL. This uses the datafileID, version, agencyId and instance values returned from studyMetadata. format is "CSV", "SAV" etc.

Download

https://stessdissprodwe.blob.core.windows.net/data/download/2/download_dataset_*

Download the data set from the URL returned by download

Register the download

graphql: mutation registerSingleDownload($datafileUrn: String!, $datafileTitle: String!, $studyUrn: String!, $format: DownloadFormat!, $orderedBy: ID!)

This registers downloads so they show up in the "previous downloads" section (not sure if there is any other purpose for this). datafileUrn and studyUrn are created from metadata using the datafile/study ID, version and agency (agency is always int.esseric):

'urn:ddi:int.esseric:' + t.id + ':' + t.version

format is CSV, SAV, etc., and orderedBy is the ID of the current user (which is retrieved via a separate couple of graphql queries).

Data Wizard

Get variable metadata

graphql: query topLevelConceptualVariablesGroups($id: ID!, $version: Int) graphql: query variableMetadata($id: ID!, $version: Int, $instance: Instance!, $agencyId: Agency!) graphql: query codeComparisonTable($input: CodeComparisonInput!)

Variable metadata (high level conceptual groups, variable details) are retrieved for the entire ESS series. This works similarly to the study metadata/data files - the top level conceptual variable groups are accessed from the series metadata (using the series ID). The returned object contains variable details (ID, name, etc.), and further variable info is accessed through additional graphql queries.

Get country metadata

graphql: query countryCoverageTable($id: ID!, $version: Int)

Possible country combinations are returned as a "countryCoverageTable" object (based on the series ID) that contains datafile IDs, country IDs, and a flag telling us whether the country exists for each datafile.

Make the wizard generate the file

graphql: query ($input: WizardDownloadInput!)

This asks the wizard to create the file. The WizardDownloadInput tells it what datafiles, countries and variables to include and ties it to the current user with orderedBy. For e.g.

{
  "variables": {
    "input": {
      "agencyId": "INT_ESSERIC",
      "datafiles": [
        {
          "countries": null,
          "id": "ffc43f48-e15a-4a1c-8813-47eda377c355",
          "version": 73
        },
        {
          "countries": [
            "AT"
          ],
          "id": "b2b0bf39-176b-4eca-8d26-3c05ea83d2cb",
          "version": 248
        }
      ],
      "format": "SAV",
      "instance": "PUBLISHED",
      "variables": [
        "netuse",
        "netusoft",
        "netustm",
        "nwspol",
        "nwsppol",
        "nwsptot",
        "pplfair",
        "pplhlp",
        "ppltrst",
        "rdpol",
        "rdtot",
        "tvpol",
        "tvtot"
      ]
    }
  }
}

The response gives the ID for the wizard download to be used in the following steps.

Register the download

graphql: mutation registerWizardDownload($datafiles: [WizardDataFileDownloadInput!]!, $variables: [String!]!, $orderedBy: ID!, $format: DownloadFormat!)

Similar to the Data Portal download registration, but I can't see this being used in the UI anywhere (there are no previous downloads displayed for the data wizard).

Poll the wizard download

While the file is being generated/prepared it is repeatedly polled to see if it's ready yet. This returns the an isFinished indicator and the URL for the download.

graphql: query ($id: ID!) { ESS { pollWizardDownload(id: $id) { isFinished url __typename } __typename } }

Download

https://stessdissprodwe.blob.core.windows.net/data/download/2/start_generate_datafile_job_*

Once the pollWizardDownload returns isFinished: true, data is downloaded from the URL.

etiennebacher commented 1 year ago

Thanks a lot for all these explanations! I don't have the skills to make this work, I never used GraphQL before so I've no idea where to start. I'll try to help in other ways

maelle commented 1 year ago

@gorcha I've now invited you to the ropensci organization and to a team with write access to this repository. Note that you'll need to enable 2FA for your GitHub account if that's not already the case, see https://docs.github.com/en/authentication/securing-your-account-with-two-factor-authentication-2fa/configuring-two-factor-authentication + https://ropensci.org/blog/2022/05/16/requiring-2fa-for-the-ropensci-github-organization/ for context

@etiennebacher if you start contributing (thanks already for the convo here!!) please ping me so I might grant you access too.

gorcha commented 1 year ago

Thanks @maelle!

No worries @etiennebacher, any help with testing and documentation would be super helpful 🙂 I'll ping you once I get started if there's anything specific. Thanks!

LukasWallrich commented 1 year ago

@gorcha Just to say that I'd be happy to help as well. I've loved using this package in both teaching and research and would be glad to see it brought back to life :)

LukasWallrich commented 1 year ago

Just to throw that out there: I now saw that the ESS data is CC BY-NC-SA 4.0 ... and not very large (10 waves of apparently only ~15 MBs). So an alternative to OAuth and API issues might just be to host the data in a separate GitHub repo, accessed by the package?

gorcha commented 1 year ago

Hey @LukasWallrich,

Great idea, I hadn't even thought of that! Will check it out :)

djhurio commented 1 year ago

Dear all, thank you for your involvement into this issue with the ESS data. Regarding the idea by @LukasWallrich, please note the ESS data for each round is released in several releases and versions (data editions are numbered according release.version scheme up to my knowledge). Usually for reach round there are two to three data releases, where each new data release contains data from more countries than previous release. Each release can have several versions because of corrections made to the data. For example we can see there were three data releases for the round 9 data and currently only the first release of the round 10 data has been made available (more releases are in pipeline).

   essround edition   proddate     N
1:      R07     2.2 2018-12-01 40185
2:      R08     2.2 2020-12-10 44387
3:      R09     3.1 2021-02-17 49519
4:      R10     1.2 2022-06-28 18060

If you host the data on a separate repo than there should be a process to monitor if new data edition (release.version) for each round has been published by the ESS.

etiennebacher commented 1 year ago

Hello, any news about this?

@gorcha did you have the occasion to try making the graphql requests from R? Or is the plan to use a separate repo to host the data and keep it updated with future releases?


FYI, I contacted the organization that manages the ESS data. They said that providing a clear API documentation is something they want to do but they didn't give me a timeline for this.

Berzils commented 9 months ago

Dear colleagues,

Wishing you a Merry Christmas, I would like to share two comments. Firstly, it is evident that the ESS managers consider the package a low priority. Despite promises of an API, more than six months have passed without any developments. I propose two solutions:

  1. Retrieve all available data from the portal, including versions, corrections, etc., and place them in a repository accessible to us. This aligns with the solution suggested by @LukasWallrich and supported by @djhurio (Sveiki, Mārtiņ!), who stated, "(...) there should be a process to monitor if new data edition (release.version) for each round has been published by the ESS." I'm for this, but unfortunately I don't know how to do that. Otherwise, I would have done it already.

  2. Download the entire ESS to your computer. This is what I have done, and the file size is approximately 200 MB when zipped.

Best regards,

Jānis