oklab-cottbus / streetnames-cb

MIT License
1 stars 0 forks source link

Create Wikidata Sparql query #3

Closed scholzling closed 3 years ago

scholzling commented 3 years ago

Testing a Sparql query for retriving information about a specific Person.

scholzling commented 3 years ago

Play with querys here

scholzling commented 3 years ago

The problem is, everything on Wikidata is an Item with a specific ID, the Q-number. In order to get any information about something you need to know this Q-number of the item you are looking for.

So its two querys which maybe can be combined into one later:

  1. Get the Q-number of the Person

To find the identifier for an item, we search for the item and copy the Q-number of the result that sounds like it’s the item we’re looking for (based on the description, for example). To find the identifier for a property, we do the same, but searc for “P:search term” instead of just “search term”, which limits the search to properties.

  1. Get the ?gender property of the resulting object
scholzling commented 3 years ago

Working Query for Angela Merkel. Gives Back all humans with Label "Merkel"

SELECT DISTINCT ?item ?name WHERE {
  VALUES ?type {wd:Q5} ?item wdt:P31 ?type .
  ?item rdfs:label ?queryByTitle.
  FILTER(REGEX(?queryByTitle, "Merkel", "i"))
}
LIMIT 10
scholzling commented 3 years ago

gentle introducion to SPARQL: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/A_gentle_introduction_to_the_Wikidata_Query_Service

scholzling commented 3 years ago

New idea: First searching name via Mediawiki API using wbsearchentities getting Q-Number > using q number of first result in Sparql query Wikimedia query (source

https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Douglas%20Adams&language=de
scholzling commented 3 years ago

Ok... super complicated and slow but anyway:

https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Douglas%20Adams&language=de

taking id-Field

https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q42

use P21:mainsnak:datavalue:id

https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Q6581097&language=de

label will be gender

scholzling commented 3 years ago

6a81af0aa048104ef9f3d1bf0d8cc4157266ddef only works when der is really no abiguity about the names origin.

"Michael Schumacher" will give male, but "Schumach" will fale, because the first search result will be the Familyname.

scholzling commented 3 years ago

Testquery:

  SELECT ?Stra_e ?Stra_eLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],de". }
  ?Stra_e wdt:P31 wd:Q79007.
  ?Stra_e wdt:P131 wd:Q64.
  ?Stra_e wdt:P138 ?person.
  }
  LIMIT 100
scholzling commented 3 years ago

Query is kind of implemented it works as follows:

A query is made to retrieve 50 results for a given name.

To find the "human" in these results we go to every element of the list and do another query with the element id.

This query gives a result about propertys of the element.

We check this result if the property P31 which translates to "instance of", is Q5which stands for human

if its not a human the next element in the initial list will be queried the same way until a "human" is found

If the human is found the gender property P21 can be retrieved. BUT the property is also just an id.

To get the name or the label of this id we make a third query. directVideotoGif

roughly follows th logig of https://github.com/oklab-cottbus/streetnames-cb/issues/3#issuecomment-782611819

Ok... super complicated and slow but anyway:

https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Douglas%20Adams&language=de

taking id-Field

https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q42

use P21:mainsnak:datavalue:id

https://www.wikidata.org/w/api.php?action=wbsearchentities&search=Q6581097&language=de

label will be gender

blueonyx commented 3 years ago

the last query to get the actual name/label of the gender can be cached (wikidata knows 46 though).

query to get all gender identities (Q48264):

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q48264.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "de,en". }
}

other interesting items regarding gender:

scholzling commented 3 years ago

Current state of Query as in 421a04d16c7cbe3a89fed29b0a6517124e970704:

Basis of query is lined out in https://github.com/oklab-cottbus/streetnames-cb/issues/3#issuecomment-782611819

  1. Filtering Streetnames
  2. Query the name and get 50 results
  3. check if its a human via query
  4. get propertys
  5. query gender_id for written out gender_lable
  6. output data

1. Filtering

First we filter the Streetname for common Street-suffixes and prefixes like "An-; Am-;Straße-;Ring-...etc." The current list of these suffixes was manually derived from the Wordfrequency of all the Streetnames in the names-magedeburg.csv via R

df <- read.csv("names-magdeburg.csv")
str(df)
names <- df$Name
names_line <- paste(names, collapse = " ")
library(stringr)
names_line_rep <- str_replace_all(names_line,"-"," ")
names_count <- str_split(names_line_rep," ")
table(names_count)
sort(table(names_count))

Output:

 Siedlung           Steinwiese                  Alt
                   6                    6                    7
        Olvenstedter                 Otto             Heinrich
                   7                    7                    8
                 von              Wilhelm         Birkenweiler
                   8                    8                   10
                 den                 Ring               Kleine
                  10                   11                   12
                 Zur             Chaussee                   Im
                  14                   15                   15
                 Zum                   An                Platz
                  15                   27                   29
                 der            Privatweg                   Am
                  31                   52                   93
                 Weg               Straße
                 120                  387

from this we derived a pythonfunction which deletes these stopwords and hyphens.

def replace(name):

  suffix = ["Straße","Weg","Am ","Chausse","Im ","An ","Platz","Zum ","Im ","Zur ","Kleine ","Ring","Siedlung"]

  for string in suffix:

    name = re.sub("(?i)"+string,"",name)
    name = re.sub("-"," ",name)

  return(name)

2. First query

With this filtered name the first query is made.

name = replace(streetname)
    result = requests.get(base_url, params={"action": "wbsearchentities", "search": name,"limit":"50","language": "de","format

This query results in max 50 objects which may or not may be the person we are looking for. It gets tricky if we search just for a first name or family name. The first result for Martincould for example be Martinshorn.

In this case it would be simple to just check if the Entity is a human and if its not try the next result in the list. The idea is that if we dont get the real person we at least get some person which has the same gender.

3. Human?

And we do just that. In wikidata every entity has an ID or Q-Number and is an instance of some other entity. In Wikidata terms the Q-Number has aP31 property which translates to "instance of". So we checkt for every result we got if the ID is an instance of human.

    for x in result.json()['search']:
      id =  x['id']
result = requests.get(base_url, params={"action": "wbgetentities", "ids":id ,"language":"de","format":"json"})

      if result.json()['entities'][id]['claims']['P31'][0]['mainsnak']['datavalue']['value']['id'] == "Q5":

Programaticly we pipe every Q-Number of the first result into a second query where we get all the propertys of that Q-Number. Thenn we check if the P31of this result has the value Q5which stands for human

If it has we can use that result to get all sorts of data. Primary we want the gender.

4. Get all the Properties

The gender we will find under the property P21. The value of this property represents the gender as a Q-Number.

gender_id = result.json()['entities'][id]['claims']['P21'][0]['mainsnak']['datavalue']['value']['id']

To get the actual string or label of this Q-Number we have to make a third query.

5. Did you just query my gender?

gender_result = requests.get(base_url, params={"action": "wbsearchentities", "search":gender_id, "language":"de","format":

    gender = gender_result.json()['search'][0]['label']

Along with this we get some additional information which may help determining if the result represents the person which is related to the streename.

6. Output

row = {"Name":[streetname],
         "Gender":[gender],
         "Information":[description],
         "searched_name":[name],
         "matched_name":[matched_name],
         "date_of_birth":[date_of_birth],
         "date_of_death":[date_of_death],
         "ethnic_group":[ethnic_group]}
  return(row)

The major flaw of this approach is that we never really know if the entity is really the right one. And it seems that 50 results is a little bit to high to get reliable data. With each next result it gets more "gambly" if the entity has anything to do with the person. In some cases it will find a person from a street which is not named after anybody.

example of shady results: (4e5751c93a4917192ece3fa7b18298be77a938e6)

Venusweg,female,US-amerikanische Pornodarstellerin,Venus,Angelica Costello,+1978-06-05T00:00:00Z,NA,indigenous peoples of the United States

Himbeerweg,female,Jungsteinzeitliche Moorleiche aus Schweden,Himbeer,Luttra Woman,-3125-00-00T00:00:00Z,-3100-00-00T00:00:00Z,sodium

also the ethnic_group property might not be very reliable.