Schema - Githubissues

justincy commented 9 years ago

Two choices:

A basic schema such as what gen-search uses or a version of that which allows multiple assertions (which roots-search will use in the future).
A more complete schema that accounts for non-vital facts, sources, more relationships, etc.

justincy commented 9 years ago

Using a common schema some will inherently lead to some data loss. Add an option to get the complete source-specific schema.

justincy commented 9 years ago

Another thought is to use the source schema by default and have an option for converting it into a shared schema.

justincy commented 9 years ago

Or maybe we devise a way to have the conversion in a separate lib/plugin so that users could choose which schema they wanted the data converted into. I prefer this method but I don't like the idea of giving myself more work.

justincy commented 9 years ago

The other thing I like about this proposal is that it cleanly separates the job of scraping from that of conversion. Right now the two are munged together.

dovy commented 9 years ago

Amen, MVC.

justincy commented 9 years ago

Design options:

1. Conversion libs wrap genscrape

Conversion libs would wrap genscrape to catch the data events, run the data through converters, and then pass the data on.

First you would include both genscrape and the conversion lib.

<script src="genscrape.js"></script>
<script src="genscrape-gedcom-converter"></script>

Then instead of calling genscrape directly you would call the converter.

genscrapeGedcom();

You could set it up so that you just called genscrape(); as normal but that would limit you to only one converter at a time.

The biggest disadvantage in this option is that genscrape would not be forced to have conversion built into it's design. There is value is designing genscrape with an API that makes conversion natural.

2. Allow for a plugin lib to be configured at runtime

genscrape(url, gedxConverter);

or

// register a converter
genscrape.converters('gensearch', gensearchConverter);

// tell it which converter to use
genscrape(url, 'gensearch');

This makes the API more explicit and requires less JS hackery and black magic than option 1).

You could consider a 3rd option that has all the converters built-in instead of requiring them to be registered, but if you allowed custom converters (which you should) then you would essentially be in the same situation as option 2.

justincy commented 9 years ago

3. A completely separate conversion library

When discussing option 1) I said there is advantage in having conversion be a natural part of genscrape's API. It can also be a disadvantage by increasing the complexity.

genscrape is built to be asynchronous. Conversion is synchronous. Conversion doesn't have to be part of genscrape's event interface. We could build a separate lib that just takes in 3 parameters (data, source schema, destination schema) and returns the converted data. Then you just call it when receiving a data event from genscrape.

genscrape().on('data', function(data){
  convert(data, 'sourceSchema', 'destinationSchema');
});

It would be valuable for genscrape to tell you what the source schema was. We could change the data into a response object with other metadata in addition to the source data.

genscrape().on('data', function(res){
  convert(res.data, res.schema, 'destinationSchema');
});

justincy commented 9 years ago

For option 3), there is the concern that supporting conversion from any schema to any other would be a nightmare. And it's true if the number of schemas gets large enough.

There are some schemas which wouldn't need to be supported as a destination because there's no value (in general) in converting data into that format. For example, FS Historical Records. The data is in a schema called SORD. There's no reason for anyone to convert Ancestry data, or data from any other website, into that format.

Also, not all websites will have a unique schema. Most websites which we might write a scraper for will have simple names, events, and relationships data. We won't need separate schemas for Find A Grave, BillionGraves, Fold3, and Archives.

robhoare commented 9 years ago

A while back I was working on something similar (but I've had no time for since). I created a simple bookmarklet that the user clicks to save the current rendered page (not just the source) to a server api - sombody with better js skills than me could save it to a dropbox/onedrive/owncloud direct from the page instead (no server needed).

So, I saved the whole page, to process on the server side (or, if it's in something like dropbox, it could be on the client side later). This means if the scraper goes wrong (as it will, as pages change), it can be re-run against the captured full copy of the data, possibly months later. I wrote some basic scrapers in php, but this process does allow the scrapers to be written in anything anywhere (and for multiple scraper providers to be available, you just point it to the stored data).

Extracting (scraping) from Familysearch (records data) is trivial now that they embed json structures in the page (see the lines starting var person, var record) and Ancestry has similiar in the correctionsMetadata variable. But many other sites will need constant maintenance of the scraping code.

My usage is different from yours: I want to capture the whole source with provenance, and then interpret it, to build relationships from, at multiple levels, all the way up to a tree. But it does have a large overlap with what you're doing (just doesn't need to be real-time). Like this:

capture contents of page and store them somewhere. My proof of concept was to use the bookmarklet to send it to an api on a server and store it there (where I ran php scripts), but there are plenty of other options for js wizards such as localstorage on the client, and cloud storage.
run a scraper/decoder against that data - can be run from a client or a server, doesn't matter as long as it can get to where the page data was stored. If it is run from a server, or picks up the rules at execution time from a server, this means the client app doesn't constantly need updates. Store the "as extracted" data (this is what you call the source schema). Keep the page image though, in case that needs to be rerun.
convert that to standardised data, what you call the common schema. For example, map fields to standard fieldnames and do some basic interpretation (such as adding country name based on the record type, where possible). This is where I stopped, waiting for FHISO to come up with something usable as a target (gedcom-x isn't flexible enough). I should give up on that and just come up with something usable now. Note that there would probably be a series of conversions, some (like date standardization, place name lookup, split names into first and last) could be api services elsewhere. A single page might end up mapping to multiple standard records (for example tabular data like http://automatedgenealogy.com/census/View.jsp?id=108365&highlight=1&desc=1901+Census+of+Canada+page+containing+Edgar+Jeffery )
take that standard data, and add relationships between items (such as a parent relationship from a main person to somebody labelled father). Without getting into RDF (which is way overkill). Much of that could be automated, although the user could edit in things as well.
then allow the user to build similar relationships between different records, with justification notes, which builds up linked data that can be used for reports (such as, but not limited to, genealogical trees).

Sorry this went on a bit long and is a bit off the target of the issue, but I think there's a lot in common between what we're trying to do. Just seen your update: no, I don't see any reason why you'd want to support converting to any target schema, if the common schema is flexible enough.

justincy commented 9 years ago

Thanks Robert! That gives me a lot to consider.

allow the user to build similar relationships between different records, with justification notes, which builds up linked data that can be used for reports (such as, but not limited to, genealogical trees)

That's actually my long-term vision for RootsSearch.

justincy commented 9 years ago

I've decided to drop the idea of full/converted schemas for now. May be worth revisiting in the future.

rootsdev / genscrape

Schema #3