Closed justincy closed 7 years ago
I'm having trouble deciding between using IDs and Identifiers. IDs are advantageous because there's not the expectation that they're a URI. FamilySearch data has IDs already. The IDs in the tree are what we want but the IDs in records are not useful at all. Should we add our own?
The other option is to add a http://gedcomx.org/Primary
Identifier. But why replicate data and functionality of the ID attribute we already have?
For now, we'll generate more useful IDs instead of auto-incrementing IDs. And we'll leave FS record IDs as is until we think of a better alternative.
Even better idea is to do both: a simple ID format for the id
attribute and a URI format for an Identifier.
Maybe we should create our own Identifier type http://genscrape.??/Primary
since we're not entirely sure what the intended use of http://gedcomx.org/Primary
is.
To make the identifiers valuable, they must be deterministic and comparable. Thus for records available on multiple domains (i.e. Ancestry and findmypast that run multiple domains) we need to choose one as a standard. Also for websites that use query params (Ancestry, Find A Grave, findmypast, etc) we need to convert them into a custom format without query params.
The GedcomX spec says that http://gedcomx.org
Identifiers "MUST resolve to the instance of Subject
to which the identifier applies." We can't make that guarantee so we should probably use a custom URI syntax.
I propose genscrape://{gensiteId}/{recordId}
.
What about websites like FamilySearch and Ancestry where historical records need to be differentiated from family tree profiles? Does it matter? Do we need to programmatically differentiate between them or just ensure uniqueness?
If we just need to ensure uniqueness then we can let each scraper prepend a token to the recordId
where applicable. For example, the Ancestry tree scraper could prepend tree
while the record scraper prepends record
.
If there's any programmatic reason for being able to differentiate then we would need something more standardized and perhaps add a new piece to the genscrape Identifier URI:
genscrape://{gensiteId}/{type}/{recordId}
But that's opening a can of worms by approaching a description of a site's collection hierarchy. Let's not go there.
Would we ever want to link genscrape output back to the scraper that generated it? At the moment we can theoretically use the about
URL of the SourceDescription to see which scraper matches it, but we could use the Identifier to do something more direct.
I have a better idea. Instead of {gensiteId}
we do {scraperId}
where {scraperId} ::= {gensiteId}:{type}
. Thus we have genscrape://{scraperId}/{recordId}
.
For example, FamilySearch Family Tree person ID or Find A Grave memorial number.
We will set the persons' IDs to be the ID we want to expose (instead of the auto-incrementing IDs we've been using).
Related to https://github.com/rootsdev/genscrape/issues/33