nypl-registry / registry-ingest

MIT License
0 stars 0 forks source link

Agent Clustering #19

Closed thisismattmiller closed 8 years ago

thisismattmiller commented 8 years ago

Post serialization it should be possible to cluster terms together based on their shared normalized values

[
    {
        "_id": "ObjectId(5669f18d1d7d67523cb6a999)",
        "birth": false,
        "dbpedia": false,
        "death": false,
        "gettyId": false,
        "lcId": "n79117036",
        "nameControlled": "Boston public library",
        "nameNormalized": [
            "boston public library"
        ],
        "registry": "temp144978369397863931",
        "source": "catalog20294782",
        "type": "corporate",
        "useCount": 181,
        "viaf": 212721938,
        "viafAll": [
            212721938
        ],
        "wikidata": false
    },
    {
        "_id": "ObjectId(5669f2be1d7d67523cb82790)",
        "birth": false,
        "dbpedia": false,
        "death": false,
        "gettyId": 500305150,
        "lcId": false,
        "nameControlled": "Boston., Public Library",
        "nameNormalized": [
            "boston public library",
            "boston public library galatea collection",
            "prouty louise",
            "st james methodist episcopal church new york",
            "joan of arc collection boston public library"
        ],
        "registry": "temp14497839988712932841",
        "source": "catalog20190637",
        "type": "corporate",
        "useCount": 33,
        "viaf": 311426235,
        "viafAll": [
            311426235
        ],
        "wikidata": false
    },
    {
        "_id": "ObjectId(567866891d7d67523cf371c7)",
        "birth": false,
        "dbpedia": false,
        "death": false,
        "gettyId": false,
        "lcId": false,
        "nameControlled": "Boston Public Library",
        "nameNormalized": [
            "boston public library"
        ],
        "registry": "temp1450731145902868491",
        "source": "archivesCollectionDb221",
        "type": "corpname",
        "useCount": 1,
        "viaf": 316395379,
        "viafAll": [
            316395379
        ],
        "wikidata": false
    }
]
thisismattmiller commented 8 years ago

https://github.com/nypl-registry/registry-ingest/blob/5e62117bddd756ea272f49d99c608c5b6d4563f0/lib/serialize_utils.js#L279 This job looks for records with the same normalized name and then does its best to see if they should be merged together, if they are merged it adds all the viaf and normalizedNames into the best (most complete) record, the second pass of resources serialization will then only use the new merged record

thisismattmiller commented 8 years ago

clusterByName | totalAgents: 4225784 totalDeleted: 11810