rootsdev / genscrape

JavaScript library that aids in scraping person data off of genealogy websites
MIT License
42 stars 6 forks source link

Expose the originating entity ID #50

Closed justincy closed 7 years ago

justincy commented 7 years ago

For example, FamilySearch Family Tree person ID or Find A Grave memorial number.

We will set the persons' IDs to be the ID we want to expose (instead of the auto-incrementing IDs we've been using).

Related to https://github.com/rootsdev/genscrape/issues/33

justincy commented 7 years ago

I'm having trouble deciding between using IDs and Identifiers. IDs are advantageous because there's not the expectation that they're a URI. FamilySearch data has IDs already. The IDs in the tree are what we want but the IDs in records are not useful at all. Should we add our own?

The other option is to add a http://gedcomx.org/Primary Identifier. But why replicate data and functionality of the ID attribute we already have?

For now, we'll generate more useful IDs instead of auto-incrementing IDs. And we'll leave FS record IDs as is until we think of a better alternative.

justincy commented 7 years ago

Even better idea is to do both: a simple ID format for the id attribute and a URI format for an Identifier.

justincy commented 7 years ago

Maybe we should create our own Identifier type http://genscrape.??/Primary since we're not entirely sure what the intended use of http://gedcomx.org/Primary is.

justincy commented 7 years ago

To make the identifiers valuable, they must be deterministic and comparable. Thus for records available on multiple domains (i.e. Ancestry and findmypast that run multiple domains) we need to choose one as a standard. Also for websites that use query params (Ancestry, Find A Grave, findmypast, etc) we need to convert them into a custom format without query params.

justincy commented 7 years ago

The GedcomX spec says that http://gedcomx.org Identifiers "MUST resolve to the instance of Subject to which the identifier applies." We can't make that guarantee so we should probably use a custom URI syntax.

I propose genscrape://{gensiteId}/{recordId}.

What about websites like FamilySearch and Ancestry where historical records need to be differentiated from family tree profiles? Does it matter? Do we need to programmatically differentiate between them or just ensure uniqueness?

If we just need to ensure uniqueness then we can let each scraper prepend a token to the recordId where applicable. For example, the Ancestry tree scraper could prepend tree while the record scraper prepends record.

If there's any programmatic reason for being able to differentiate then we would need something more standardized and perhaps add a new piece to the genscrape Identifier URI:

genscrape://{gensiteId}/{type}/{recordId}

But that's opening a can of worms by approaching a description of a site's collection hierarchy. Let's not go there.

Would we ever want to link genscrape output back to the scraper that generated it? At the moment we can theoretically use the about URL of the SourceDescription to see which scraper matches it, but we could use the Identifier to do something more direct.

justincy commented 7 years ago

I have a better idea. Instead of {gensiteId} we do {scraperId} where {scraperId} ::= {gensiteId}:{type}. Thus we have genscrape://{scraperId}/{recordId}.