outbreak-info / outbreak.info-resources

A curated repository of metadata of resources on COVID-19 and SARS-CoV-2
MIT License
0 stars 4 forks source link

[DATASET, etc.] Create Dataverse parser #93

Closed flaneuse closed 4 years ago

flaneuse commented 4 years ago

More specific version of #10. Basic goal: find all COVID-19 / SARS-Cov-2 datasets in Dataverse, map to our Dataset schema, and integrate into api.outbreak.info/resources

Write parser.py

  1. Get all datasets/files related to COVID-19 or SARS-CoV-2 using the Dataverse API.

Note: you'll probably need to also traverse each of the 9 individual Dataverses (like Harvard, China Data Lab) to get their datasets/files

  1. Export the metadata in Schema.org JSON-LD format. Could alternatively grab the data by crawling each URL and grabbing the <script type="application/ld+json"> tag, which you can view in Google's Structured Data Testing Tool

Note: I was only briefly skimming the API guide, so I'm not certain if you can grab this directly from the search API call, or if you'll need to run a search call to get all the COVID-related IDs, and then another API call to get the metadata

  1. Coerce the dataset to fit our Dataset schema. Mainly, this should be double checking that the cardinality is correct. If there are any of these schema.org types, they should be mapped:
schema.org outbreak.info
ScholarlyArticle Publication
Book Publication
  1. Add a tag to indicate the provenance:
    curatedBy: {
    '@type': 'Organization',
    'identifier': 'dataverse',
    'url': 'https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VO7SUO', // link to original metadata page on dataverse
    'name': 'Harvard Dataverse', // or whatever dataverse
    'curationDate': <date run in YYYY-MM-DD>
    }

Write an upload.py, dump.py to control how the data are uploaded

Related examples:

Test in Biothings Hub to test deployment; deploy on api.outbreak.info

@marcodarko / @newgene can help once the parser.py, upload.py, dump.py are written

juliamullen commented 4 years ago

Quick note on scraping: traversing each individual dataverse seems unnecessary. Searching simply on terms "covid-19+sars-cov-2" returns all 91 results but searching for dataverses on those terms and then searching those dataverses returns fewer results (it would appear some datasets matching "covid-19+sars-cov-2" belong to dataverses that don't match "covid-19+sars-cov-2").

Unless there are additional dataverse servers beyond dataverse.harvard.edu , I think this one endpoint, with pagination if the results grow larger, should give all relevant datasets.

juliamullen commented 4 years ago

Exporting data seems to work in most cases (although it is a separate request for each dataset), but there's a single exception (doi:10.18130/V3/Z6524P) which will require the url hit + tag extraction.

flaneuse commented 4 years ago

Ah, okay @juliamullen -- it wasn't entirely clear for me whether the 9 "dataverses" that had hits contained unique datasets, or if their datasets were rolled into the search results. I think Harvard Dataverse queries all other datasets, but I've never confirmed that.

Actually... looking into it more, the math seems off: COVID-19 Data Collection: 102 datasets MIT Lincoln: 1 China Data Lab: 18 COVID Survey Research: 1 Population Council: 7 Coronavirus Europ Data: 4 Sentiment Analysis: 1 ....

It's weird though, because I checked the ones with only a couple of hits, and they're in the query you cite above. It may be that SARS-CoV-2 + COVID-19 isn't 100% inclusive and that the curated repos have additional datasets that people have hand-tagged.

It's certainly reasonable to start only with the API you use which captures most of the datasets, and then we can go back and refine later if needed. And/or only focus on combining the API call with the whatever is in COVID-19 Data Collection + de-duplication, since that seems to be the largest source of datasets.

juliamullen commented 4 years ago

Well, I just thought to search "covid-19+sars-cov-2+covid19", which has apparently already added two results bringing us to 93 but my question is: do we want to cast a wide net? I could include all datasets from dataverses which match the search term, at risk of overincluding. Though I'm not sure how great that risk is. If the dataverse matches, do all datasets belonging to it match?

flaneuse commented 4 years ago

yeah, that's definitely the balance... my hand-waivy answer is that we want to be as inclusive as possible while also trying to minimize unrelated results. in practice, i'm not sure what that looks like. i'm definitely okay tabling this issue and just running with a reasonable API query until we discover we're missing a large number of datasets. Getting an exhaustive list from Dataverse may not be worth the effort.

FYI, if you want to go down the synonym route, i've put together a synonym list already.

juliamullen commented 4 years ago

I gave both “finding all matches” and “finding everything in a dataverse for all dataverses that match” a try. With the full synonym list, the first one pulled in 101 datasets and 44 files. The latter pulled 137 datasets and 3369 (??) files. The dataverses that matched seemed to be named things like “Covid-19” and “ncov2019” so I don’t get the sense that whatever is in them is completely irrelevant.

Together they have 115 unique “global_id”s with about a dozen sitting outside the intersection on each side. Now that I think of it, probably one of those is “” so it’s 114. I don’t think “files” have global_id’s but haven’t checked. I don’t think I can get schema.org representation of anything without a global_id. The data export in schema.org format is a lil time intensive because each one is its own request.

I’m a little intimidated by this dataset coercion step, but I think I see how to do it based on the other examples you’ve linked, so I'm going to move forward on that for the ones I can get the schema.org representation of.

flaneuse commented 4 years ago

Sounds like you're off to a great start! I saw the "file" count jump... we'll need to figure out whether these are worth including as well.

Ping me if you want to have a call/chat about the coercion step-- we can walk through an example record together if you want. Our schema is actually based on schema.org, so it should actually be pretty simple to convert whatever dataverse spits out in schema.org format. The main things (off the top of my head) that you'll need to do is:

  1. Check that things that are supposed to be lists are lists
  2. Check that Objects are objects when they should be (for instance, you might have to turn author.affiliation from a string into an object like: {@type: "Organization", name: "Scripps Research"})
  3. If there are any ScholarlyArticles or Books (don't think so...), change them to @type: "Publication"
  4. Figure out what to do with "files"
  5. Double check all the fields in the schema that we could populate are. For instance, you might need to add an @type: "Dataset" or the like, and you will need to add curatedBy to tag where we got the data from.

Also maybe helpful: here's the datasets we currently have in the API, as examples: https://api.outbreak.info/resources/query?q=@type:Dataset

juliamullen commented 4 years ago

Current work is visible in the repo I created for the dataverse parser.

flaneuse commented 4 years ago

FYI: example of a dataverse record (not COVID-19) which has funding information: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VO0UNV