joph / Covid19-Austria

MIT License
5 stars 1 forks source link

Wikidata as Datasource #2

Open LibrErli opened 4 years ago

LibrErli commented 4 years ago

Hi @joph,

Idea

your visualization and calculation about increasing in number of cases and number of clinical tests in the covid-19 disease could also based on data stored on Wikidata. In this way it will be easy to use your script for any other country or region where data about the covid-19 disease is stored. (Tables on Wikidata are structrued widely different on all these pages) Furthermore each statement could be enriched and documented with a fully reference corpus (e.g. to store or make visible different numbers published or counted by different organisations)

Example about the number of cases stored in Wikidata

e.g. number of cases in https://www.wikidata.org/wiki/Q86847911:

SELECT ?numberOfCases ?pointInTime WHERE {
  wd:Q86847911 p:P1603 ?numberOfCasesStmt.
  ?numberOfCasesStmt ps:P1603 ?numberOfCases;
    pq:P585 ?pointInTime.
}
ORDER BY (?pointInTime)

[https://w.wiki/Kcr](Try it)

receive this data in R:

#http://www.r-bloggers.com/sparql-with-r-in-less-than-5-minutes/

library(SPARQL) # SPARQL querying package
library(ggplot2)

endpoint <- "https://query.wikidata.org/sparql"
query <- 'SELECT ?numberOfCases ?pointInTime WHERE {\n\n  wd:Q86847911 p:P1603 ?numberOfCasesStmt. \n  ?numberOfCasesStmt ps:P1603 ?numberOfCases; \n                     pq:P585 ?pointInTime.\n  \n}\nORDER BY ?pointInTime'
useragent <- paste("WDQS-Example", R.version.string) # TODO adjust this; see https://w.wiki/CX6

qd <- SPARQL(endpoint,query,curl_args=list(useragent=useragent))
df <- qd$results

Further development

In the moment there exists no properties for "number of recoveries" or the "number of clinical tests", the [https://www.wikidata.org/wiki/Wikidata:WikiProject_COVID-19/Data_models/Outbreaks](WikiProject Covid-19) is discussing about the data model and there two new property proposals out there:

In the meanwhile i started adding data about the number of clinical tests using the current data model in this way:

SELECT ?numberOfTests ?pointInTime WHERE {
  wd:Q86847911 p:P1114 ?numberOfStmt.
  ?numberOfStmt ps:P1114 ?numberOfTests;
    pq:P805 wd:Q86901049;
    pq:P585 ?pointInTime.
}
ORDER BY (?pointInTime)

https://w.wiki/Kcs

Overview and links about data related to covid-19 on Wikidata

joph commented 4 years ago

Thanks a lot for that comment! In principle wikidata is a much better data source for data than wikipedia (wikipedia data is unstructured, changes and format and as you pointed out tables are not comparable between countries or regions). However, it is updated really quickly with all information provided by the government. For that reason I'll stick to it for the moment - but keep me posted on updates on wikidata! For a generic SARS-CoV script which is applicable to all countries, wikidata is definitely anyhow the better source. This package here, however, is mainly intended to give a very quick update on the Austrian situation.

LibrErli commented 4 years ago

so feel free to make a wikidata-based version. i will update data on Wikidata currently, also adding WaybackMachine Links to the Ministry Website, to make the data verifiable.

joph commented 4 years ago

Thanks again for the query and everything! So in principle this works - however, to be honest, I have quite some trouble with the wikidata query language. E.g. how would a query for the Italian cases look like? So e.g. moving from one country to another seems to involve quite some research on the identifiers in the wikidata database, right?

There is https://github.com/CSSEGISandData/COVID-19 which has global level data updated quickly - but not about tested individuals. For a start, I may move there for international data.

LibrErli commented 4 years ago

List of all Wikidata-Items about covid-19 by country

Here is the query to get all Wikidata-Items about covid-19 by country or territory:

SELECT ?covid19_perCountry ?covid19_perCountryLabel ?country ?countryLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  ?covid19_perCountry wdt:P361 wd:Q83741704;
                                                      wdt:P17 ?country.
}
ORDER BY ?countryLabel

https://w.wiki/KhL

Change the SPARQL-Query for number of cases in other countries

Wikidata-ID in the first Column of the query above is used in the query above to get number of cases if already stored on Wikidata: It's the subject of the first tripel in the SPARQL-Query wd:Wikdata-ID of the Country's Covid19-Item p:P1603 ?numberOfCasesStmt.

e.g. Italy: (unfortunately data is not stored completely day by day)

SELECT ?numberOfCases ?pointInTime WHERE {
  wd:Q84104992 p:P1603 ?numberOfCasesStmt.
  ?numberOfCasesStmt ps:P1603 ?numberOfCases;
    pq:P585 ?pointInTime.
}
ORDER BY (?pointInTime)

https://w.wiki/KhM

Query all Number of Cases for all countries

SELECT ?countryLabel ?pointInTime ?numberOfCases WITH {
SELECT ?covid19_perCountry ?covid19_perCountryLabel ?country ?countryLabel WHERE {
  ?covid19_perCountry wdt:P361 wd:Q83741704;
  wdt:P17 ?country.
}
} AS %covid19Country 
WHERE { INCLUDE %covid19Country
 SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
        ?covid19_perCountry p:P1603 ?numberOfCasesStmt.
  ?numberOfCasesStmt ps:P1603 ?numberOfCases;
    pq:P585 ?pointInTime.
}
ORDER BY ?countryLabel ?pointInTime

https://w.wiki/KhT

joph commented 4 years ago

Very cool, thanks a lot! I'm going to move to wikidata soon for the Austrian data at least. Wikipedia is hell... are you responsible for the data-set on wikidata? There are two entries with the same timestamp unfortunately (for number of infections). Or is this something I can comment on directly on wikidata? (sorry, this is a very new world for me).

LibrErli commented 4 years ago

that sounds great. i am not responsible for the data-set on wikidata (because it's open [CC0] to read and edit in its widest sense ;-) ) but of course i have added most of the quantitive data on the Austrian item.

LibrErli commented 4 years ago

oh sorry, duplicate statement for number of cases in Austria yesterday (17 March) was already my fault, it seems, that i added the number published 8 am and the one at 3 pm. i removed the one from 8 am. At the moment datetime in Wikidata could be stored only on date precision (maybe that's a weak disadvantage)

joph commented 4 years ago

Again, thanks a lot. Unfortunately I am having quite some trouble with the SPARQL library in R which is extremely buggy. May take some time to get it up and running completely. You can find the code in the function get_wikidata_at(). However, it crashes on my machine.

LibrErli commented 4 years ago

do you tried to adjust the code line useragent <- paste("WDQS-Example", R.version.string) # TODO adjust this; see https://w.wiki/CX6 and add another string instead of the given example? e.g. 'covid-19 in Austria'?

joph commented 4 years ago

Yes. The code actually runs if executed stand-online, but once put in the function it crases. This is really strange behaviour. Just to say that it will take more time to implement it.

LibrErli commented 4 years ago

short update: wikidata has now two new knowledge-graph properties:

for austria i have to transform the data about clinical tests in the new model within the next hours:

here is a new sparql query, which fetches all of these data for Austria, restricted to data which is published by the Gesundheitsministerium (indicated in the statement-reference)

SELECT DISTINCT ?numberOfCases ?numberOfDeaths ?numberOfTests ?numberOfRecov ?pointInTime WHERE {
  VALUES ?covidCountry {
    wd:Q86847911
  }
  OPTIONAL {
    ?covidCountry p:P1603 ?numberOfCasesStmt.
    ?numberOfCasesStmt ps:P1603 ?numberOfCases;
      pq:P585 ?pointInTime;
      prov:wasDerivedFrom ?refNode.
    ?refNode pr:P123 wd:Q1006381.
  }
  OPTIONAL {
    ?covidCountry p:P1120 ?numberOfDeathsStmt.
    ?numberOfDeathsStmt ps:P1120 ?numberOfDeaths;
      pq:P585 ?pointInTime;
      prov:wasDerivedFrom ?refNodeDeath.
    ?refNodeDeath pr:P123 wd:Q1006381.
  }
  OPTIONAL {
    ?covidCountry p:P8011 ?numberOfTestsStmt.
    ?numberOfTestsStmt ps:P8011 ?numberOfTests;
      pq:P585 ?pointInTime;
      prov:wasDerivedFrom ?refNodeTest.
    ?refNodeTest pr:P123 wd:Q1006381.
  }
  OPTIONAL {
    ?covidCountry p:P8010 ?numberOfRecovStmt.
    ?numberOfRecovStmt ps:P8010 ?numberOfRecov;
      pq:P585 ?pointInTime;
      prov:wasDerivedFrom ?refNodeRecov.
    ?refNodeRecov pr:P123 wd:Q1006381.
  }
}
ORDER BY DESC (?pointInTime)

Try it

joph commented 4 years ago

Thanks! I hope I find the time on the weekend to work on it. Is there a plan to make the data complete from February on?

LibrErli commented 4 years ago

Data for Austria is currently complete beginning on 26 February https://w.wiki/L7b i haven't the overview for other countries, maybe i have time within the next few days to align some tables from different wikipedias with the new datamodel options in Wikidata. do you have some preferred countries (italy, brazil ?)