usaybia / usaybia-data

Data for interreligious interaction in Near Eastern texts
MIT License
2 stars 2 forks source link

Match persons with dates to Wikidata and VIAF #149

Open nathangibson opened 3 years ago

nathangibson commented 3 years ago

@hannafriedel Just to document what we already discussed:

Use the branch content/uri-matching

In OpenRefine:

  1. Import the project in data/persons/openrefine
  2. Use facets to filter the rows to a specific subset, e.g. persons with ISO death dates
  3. Use the WD_reconcile column to reconcile against Wikidata. For the type you can use "human" or something more specific if you've filtered accordingly.
  4. Add relevant columns for matching such as "date of death"
  5. Select clear matches. Leave unclear ones unmatched.
  6. After each major change, re-export the project to data/persons/openrefine
  7. Let me know if you need to add more columns. At some point we'll probably want Arabic names and occupations.

You can find the reconciliation service for VIAF at http://refine.codefork.com/. (Please duplicate the LHOM Name column for each additional reconciliation service.)

OpenRefine documentation: https://openrefine.org/documentation.html (you can also check out youtube videos)

nathangibson commented 3 years ago

Okay @hannafriedel , I finally figured it out, there is a better way to add OpenRefine columns from another spreadsheet! (Of course.) It uses the cross function.

  1. Download the Google spreadsheet and open it as a project. (I've already done this so you can just import the project https://github.com/usaybia/usaybia-data/blob/content/uri-matching/data/persons/openrefine/persons-text-v0.6.0-dev-2021-07-26.openrefine.tar.gz in your branch.)
  2. In your first spreadsheet (where you're doing the reconciling), click on the ID column > Edit column ... > Add column based on this column ...
  3. Use the following formula, but change the cells["Affiliation (all)"] to match the name of the column you want to import from the other sheet: cell.cross("persons text v0.6.0-dev 2021-07-26","ID").cells["Affiliation (all)"].value[0].

How this works: The cross function matches your current column (cell) to the one from the project and column specified ("persons text v0.6.0-dev 2021-07-26","ID"). Then cells["Affiliation (all)"] gets the column you want as an object and value[0] extracts the value from it.

If you have any trouble let me know. I can always upload my version if that helps.

nathangibson commented 3 years ago

I forgot to mention, some columns you might want to import for matching are

The Arabic names especially might help with ones you weren't able to match otherwise.

hannafriedel commented 3 years ago

Hi Nathan, I tried to reconcile the arabic names, but it only resulted in a few more matches (I only reconciled the Arabic names, when I had not already matched the English name).

nathangibson commented 3 years ago

@hannafriedel great work!

  1. Where there are multiple matches or no matches, what is the status? Do you need to do any more research? Is there a way for me to find the ones that need my input to decide?
  2. For matched Wikidata persons, please also import
    1. birth dates
    2. floruit/active dates for persons who have no birth/death dates
    3. place of birth
    4. place of death
    5. place of residence
    6. (any other common place types)
    7. VIAF ids if you don't already have them

I think the next major step will be matching against smaller, specialized databases that don't have OpenRefine APIs. We'll need to talk about how to do that.

hannafriedel commented 3 years ago
  1. a) I ran the reconciliation with every cell so the ones which do not have any matches at all, could not be matched by the reconciliation. b) For the ones which have multiple matches I did a basic lookover, if I could find the correct match by death date/occupation/relation/etc. and it did not work due to insufficient information.

I think that in both cases some can be assigned a singular match by doing intensive research including usaybia's writing. I tried to avoid this in hopes that you have a more elegant less time-consuming solution but can start doing it any time. I do not really know with which cells you could help so maybe wait until I have shrunk down the number of affected cells.

  1. I included these colums and also work location and "educated at". The VIAF IDs I immediately converted into VIAF matches so they are not visible anymore.
nathangibson commented 3 years ago
1. a) I ran the reconciliation with every cell so the ones which do not have any matches at all, could not be matched by the reconciliation.
   b) For the ones which have multiple matches I did a basic lookover, if I could find the correct match by death date/occupation/relation/etc. and it did not work due to insufficient information.

I think that in both cases some can be assigned a singular match by doing intensive research including usaybia's writing. I tried to avoid this in hopes that you have a more elegant less time-consuming solution but can start doing it any time. I do not really know with which cells you could help so maybe wait until I have shrunk down the number of affected cells.

OK, this makes sense. It may be that as we reconcile against specialized databases we will find some of the info (or links) we need for the unmatched persons. I'll post info here about how to reconcile with other databases. But tagging ch. 8 #162 is also a priority, so you can work on that for now and come back to this when you are bored :-)

1. I included these colums and also work location and "educated at". The VIAF IDs I immediately converted into VIAF matches so they are not visible anymore.

Perfect! 👍

nathangibson commented 3 years ago

Next step:

Please work on the unmatched persons to try to find matches. You can try

  1. Different spellings, like without diacritics
  2. Adding more information (when there are too many matches)
  3. Looking at LHOM for more context (see column Refs 1st link)

You could work on these as 2 or more subsets such as

  1. Persons with no matches (best candidate's score = 0)
  2. Persons with multiple matches or whatever other useful filters you find.

If you find useful information but not enough to decide you can add it to a notes column.

nathangibson commented 3 years ago

Note to self: We may be able to use https://github.com/cmharlow/isni-reconcile to get ISNIs (where not available from VIAF/WikiData).