Closed scholzling closed 3 years ago
Play with querys here
The problem is, everything on Wikidata is an Item with a specific ID, the Q-number. In order to get any information about something you need to know this Q-number of the item you are looking for.
So its two querys which maybe can be combined into one later:
To find the identifier for an item, we search for the item and copy the Q-number of the result that sounds like it’s the item we’re looking for (based on the description, for example). To find the identifier for a property, we do the same, but searc for “P:search term” instead of just “search term”, which limits the search to properties.
?gender
property of the resulting objectWorking Query for Angela Merkel. Gives Back all humans with Label "Merkel"
SELECT DISTINCT ?item ?name WHERE {
VALUES ?type {wd:Q5} ?item wdt:P31 ?type .
?item rdfs:label ?queryByTitle.
FILTER(REGEX(?queryByTitle, "Merkel", "i"))
}
LIMIT 10
gentle introducion to SPARQL: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/A_gentle_introduction_to_the_Wikidata_Query_Service
New idea: First searching name via Mediawiki API using wbsearchentities
getting Q-Number > using q number of first result in Sparql query
Wikimedia query (source
https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Douglas%20Adams&language=de
Ok... super complicated and slow but anyway:
https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Douglas%20Adams&language=de
taking id-Field
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q42
use P21:mainsnak:datavalue:id
https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Q6581097&language=de
label will be gender
6a81af0aa048104ef9f3d1bf0d8cc4157266ddef only works when der is really no abiguity about the names origin.
"Michael Schumacher" will give male, but "Schumach" will fale, because the first search result will be the Familyname.
Testquery:
SELECT ?Stra_e ?Stra_eLabel WHERE {
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],de". }
?Stra_e wdt:P31 wd:Q79007.
?Stra_e wdt:P131 wd:Q64.
?Stra_e wdt:P138 ?person.
}
LIMIT 100
Query is kind of implemented it works as follows:
A query is made to retrieve 50 results for a given name.
To find the "human" in these results we go to every element of the list and do another query with the element id.
This query gives a result about propertys of the element.
We check this result if the property P31
which translates to "instance of", is Q5
which stands for human
if its not a human the next element in the initial list will be queried the same way until a "human" is found
If the human is found the gender property P21
can be retrieved. BUT the property is also just an id.
To get the name or the label of this id we make a third query.
roughly follows th logig of https://github.com/oklab-cottbus/streetnames-cb/issues/3#issuecomment-782611819
Ok... super complicated and slow but anyway:
https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Douglas%20Adams&language=de
taking id-Field
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q42
use P21:mainsnak:datavalue:id
https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Q6581097&language=de
label will be gender
the last query to get the actual name/label of the gender can be cached (wikidata knows 46 though).
query to get all gender identities (Q48264):
SELECT ?item ?itemLabel WHERE {
?item wdt:P31 wd:Q48264.
SERVICE wikibase:label { bd:serviceParam wikibase:language "de,en". }
}
other interesting items regarding gender:
Current state of Query as in 421a04d16c7cbe3a89fed29b0a6517124e970704:
Basis of query is lined out in https://github.com/oklab-cottbus/streetnames-cb/issues/3#issuecomment-782611819
First we filter the Streetname for common Street-suffixes and prefixes like "An-; Am-;Straße-;Ring-...etc." The current list of these suffixes was manually derived from the Wordfrequency of all the Streetnames in the names-magedeburg.csv via R
df <- read.csv("names-magdeburg.csv")
str(df)
names <- df$Name
names_line <- paste(names, collapse = " ")
library(stringr)
names_line_rep <- str_replace_all(names_line,"-"," ")
names_count <- str_split(names_line_rep," ")
table(names_count)
sort(table(names_count))
Output:
Siedlung Steinwiese Alt
6 6 7
Olvenstedter Otto Heinrich
7 7 8
von Wilhelm Birkenweiler
8 8 10
den Ring Kleine
10 11 12
Zur Chaussee Im
14 15 15
Zum An Platz
15 27 29
der Privatweg Am
31 52 93
Weg Straße
120 387
from this we derived a pythonfunction which deletes these stopwords and hyphens.
def replace(name):
suffix = ["Straße","Weg","Am ","Chausse","Im ","An ","Platz","Zum ","Im ","Zur ","Kleine ","Ring","Siedlung"]
for string in suffix:
name = re.sub("(?i)"+string,"",name)
name = re.sub("-"," ",name)
return(name)
With this filtered name the first query is made.
name = replace(streetname)
result = requests.get(base_url, params={"action": "wbsearchentities", "search": name,"limit":"50","language": "de","format
This query results in max 50 objects which may or not may be the person we are looking for.
It gets tricky if we search just for a first name or family name.
The first result for Martin
could for example be Martinshorn
.
In this case it would be simple to just check if the Entity is a human and if its not try the next result in the list. The idea is that if we dont get the real person we at least get some person which has the same gender.
And we do just that. In wikidata every entity has an ID or Q-Number and is an instance of some other entity.
In Wikidata terms the Q-Number has aP31
property which translates to "instance of".
So we checkt for every result we got if the ID is an instance of human
.
for x in result.json()['search']:
id = x['id']
result = requests.get(base_url, params={"action": "wbgetentities", "ids":id ,"language":"de","format":"json"})
if result.json()['entities'][id]['claims']['P31'][0]['mainsnak']['datavalue']['value']['id'] == "Q5":
Programaticly we pipe every Q-Number of the first result into a second query where we get all the propertys of that Q-Number. Thenn we check if the P31
of this result has the value Q5
which stands for human
If it has we can use that result to get all sorts of data. Primary we want the gender.
The gender we will find under the property P21
. The value of this property represents the gender as a Q-Number.
gender_id = result.json()['entities'][id]['claims']['P21'][0]['mainsnak']['datavalue']['value']['id']
To get the actual string or label of this Q-Number we have to make a third query.
gender_result = requests.get(base_url, params={"action": "wbsearchentities", "search":gender_id, "language":"de","format":
gender = gender_result.json()['search'][0]['label']
Along with this we get some additional information which may help determining if the result represents the person which is related to the streename.
row = {"Name":[streetname],
"Gender":[gender],
"Information":[description],
"searched_name":[name],
"matched_name":[matched_name],
"date_of_birth":[date_of_birth],
"date_of_death":[date_of_death],
"ethnic_group":[ethnic_group]}
return(row)
The major flaw of this approach is that we never really know if the entity is really the right one. And it seems that 50 results is a little bit to high to get reliable data. With each next result it gets more "gambly" if the entity has anything to do with the person. In some cases it will find a person from a street which is not named after anybody.
example of shady results: (4e5751c93a4917192ece3fa7b18298be77a938e6)
Venusweg,female,US-amerikanische Pornodarstellerin,Venus,Angelica Costello,+1978-06-05T00:00:00Z,NA,indigenous peoples of the United States
Himbeerweg,female,Jungsteinzeitliche Moorleiche aus Schweden,Himbeer,Luttra Woman,-3125-00-00T00:00:00Z,-3100-00-00T00:00:00Z,sodium
also the ethnic_group property might not be very reliable.
Testing a Sparql query for retriving information about a specific Person.