Open LibrErli opened 4 years ago
Thanks a lot for that comment! In principle wikidata is a much better data source for data than wikipedia (wikipedia data is unstructured, changes and format and as you pointed out tables are not comparable between countries or regions). However, it is updated really quickly with all information provided by the government. For that reason I'll stick to it for the moment - but keep me posted on updates on wikidata! For a generic SARS-CoV script which is applicable to all countries, wikidata is definitely anyhow the better source. This package here, however, is mainly intended to give a very quick update on the Austrian situation.
i will inform you about the discussion process on data model in Wikidata
for Austria i added the sum of clinical tests day by day in the way mentioned above - see at https://w.wiki/Kcs
Query combined number of cases and number of tests: https://w.wiki/KdU
so feel free to make a wikidata-based version. i will update data on Wikidata currently, also adding WaybackMachine Links to the Ministry Website, to make the data verifiable.
Thanks again for the query and everything! So in principle this works - however, to be honest, I have quite some trouble with the wikidata query language. E.g. how would a query for the Italian cases look like? So e.g. moving from one country to another seems to involve quite some research on the identifiers in the wikidata database, right?
There is https://github.com/CSSEGISandData/COVID-19 which has global level data updated quickly - but not about tested individuals. For a start, I may move there for international data.
Here is the query to get all Wikidata-Items about covid-19 by country or territory:
SELECT ?covid19_perCountry ?covid19_perCountryLabel ?country ?countryLabel WHERE {
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
?covid19_perCountry wdt:P361 wd:Q83741704;
wdt:P17 ?country.
}
ORDER BY ?countryLabel
Wikidata-ID in the first Column of the query above is used in the query above to get number of cases if already stored on Wikidata:
It's the subject of the first tripel in the SPARQL-Query
wd:Wikdata-ID of the Country's Covid19-Item p:P1603 ?numberOfCasesStmt.
e.g. Italy: (unfortunately data is not stored completely day by day)
SELECT ?numberOfCases ?pointInTime WHERE {
wd:Q84104992 p:P1603 ?numberOfCasesStmt.
?numberOfCasesStmt ps:P1603 ?numberOfCases;
pq:P585 ?pointInTime.
}
ORDER BY (?pointInTime)
SELECT ?countryLabel ?pointInTime ?numberOfCases WITH {
SELECT ?covid19_perCountry ?covid19_perCountryLabel ?country ?countryLabel WHERE {
?covid19_perCountry wdt:P361 wd:Q83741704;
wdt:P17 ?country.
}
} AS %covid19Country
WHERE { INCLUDE %covid19Country
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
?covid19_perCountry p:P1603 ?numberOfCasesStmt.
?numberOfCasesStmt ps:P1603 ?numberOfCases;
pq:P585 ?pointInTime.
}
ORDER BY ?countryLabel ?pointInTime
Very cool, thanks a lot! I'm going to move to wikidata soon for the Austrian data at least. Wikipedia is hell... are you responsible for the data-set on wikidata? There are two entries with the same timestamp unfortunately (for number of infections). Or is this something I can comment on directly on wikidata? (sorry, this is a very new world for me).
that sounds great. i am not responsible for the data-set on wikidata (because it's open [CC0] to read and edit in its widest sense ;-) ) but of course i have added most of the quantitive data on the Austrian item.
SELECT ?numberOfCases ?pointInTime ?archiveURL WHERE {
wd:Q86847911 p:P1603 ?numberOfCasesStmt.
?numberOfCasesStmt ps:P1603 ?numberOfCases;
pq:P585 ?pointInTime;
prov:wasDerivedFrom ?nOfCasesRef.
?nOfCasesRef pr:P123 wd:Q1006381. #published by Austrian Fed. Ministry
OPTIONAL { ?nOfCasesRef pr:P1065 ?archiveURL. } #fetch archiveURL if available
}
ORDER BY (?pointInTime)
oh sorry, duplicate statement for number of cases in Austria yesterday (17 March) was already my fault, it seems, that i added the number published 8 am and the one at 3 pm. i removed the one from 8 am. At the moment datetime in Wikidata could be stored only on date precision (maybe that's a weak disadvantage)
Again, thanks a lot. Unfortunately I am having quite some trouble with the SPARQL library in R which is extremely buggy. May take some time to get it up and running completely. You can find the code in the function get_wikidata_at(). However, it crashes on my machine.
do you tried to adjust the code line
useragent <- paste("WDQS-Example", R.version.string) # TODO adjust this; see https://w.wiki/CX6
and add another string instead of the given example? e.g. 'covid-19 in Austria'?
Yes. The code actually runs if executed stand-online, but once put in the function it crases. This is really strange behaviour. Just to say that it will take more time to implement it.
short update: wikidata has now two new knowledge-graph properties:
for austria i have to transform the data about clinical tests in the new model within the next hours:
here is a new sparql query, which fetches all of these data for Austria, restricted to data which is published by the Gesundheitsministerium (indicated in the statement-reference)
SELECT DISTINCT ?numberOfCases ?numberOfDeaths ?numberOfTests ?numberOfRecov ?pointInTime WHERE {
VALUES ?covidCountry {
wd:Q86847911
}
OPTIONAL {
?covidCountry p:P1603 ?numberOfCasesStmt.
?numberOfCasesStmt ps:P1603 ?numberOfCases;
pq:P585 ?pointInTime;
prov:wasDerivedFrom ?refNode.
?refNode pr:P123 wd:Q1006381.
}
OPTIONAL {
?covidCountry p:P1120 ?numberOfDeathsStmt.
?numberOfDeathsStmt ps:P1120 ?numberOfDeaths;
pq:P585 ?pointInTime;
prov:wasDerivedFrom ?refNodeDeath.
?refNodeDeath pr:P123 wd:Q1006381.
}
OPTIONAL {
?covidCountry p:P8011 ?numberOfTestsStmt.
?numberOfTestsStmt ps:P8011 ?numberOfTests;
pq:P585 ?pointInTime;
prov:wasDerivedFrom ?refNodeTest.
?refNodeTest pr:P123 wd:Q1006381.
}
OPTIONAL {
?covidCountry p:P8010 ?numberOfRecovStmt.
?numberOfRecovStmt ps:P8010 ?numberOfRecov;
pq:P585 ?pointInTime;
prov:wasDerivedFrom ?refNodeRecov.
?refNodeRecov pr:P123 wd:Q1006381.
}
}
ORDER BY DESC (?pointInTime)
Thanks! I hope I find the time on the weekend to work on it. Is there a plan to make the data complete from February on?
Data for Austria is currently complete beginning on 26 February https://w.wiki/L7b i haven't the overview for other countries, maybe i have time within the next few days to align some tables from different wikipedias with the new datamodel options in Wikidata. do you have some preferred countries (italy, brazil ?)
Hi @joph,
Idea
your visualization and calculation about increasing in number of cases and number of clinical tests in the covid-19 disease could also based on data stored on Wikidata. In this way it will be easy to use your script for any other country or region where data about the covid-19 disease is stored. (Tables on Wikidata are structrued widely different on all these pages) Furthermore each statement could be enriched and documented with a fully reference corpus (e.g. to store or make visible different numbers published or counted by different organisations)
Example about the number of cases stored in Wikidata
e.g. number of cases in https://www.wikidata.org/wiki/Q86847911:
[https://w.wiki/Kcr](Try it)
receive this data in R:
Further development
In the moment there exists no properties for "number of recoveries" or the "number of clinical tests", the [https://www.wikidata.org/wiki/Wikidata:WikiProject_COVID-19/Data_models/Outbreaks](WikiProject Covid-19) is discussing about the data model and there two new property proposals out there:
In the meanwhile i started adding data about the number of clinical tests using the current data model in this way:
https://w.wiki/Kcs
Overview and links about data related to covid-19 on Wikidata