rootsdev / genscrape

JavaScript library that aids in scraping person data off of genealogy websites
MIT License
42 stars 6 forks source link

Schema #3

Closed justincy closed 9 years ago

justincy commented 9 years ago

Two choices:

  1. A basic schema such as what gen-search uses or a version of that which allows multiple assertions (which roots-search will use in the future).
  2. A more complete schema that accounts for non-vital facts, sources, more relationships, etc.
justincy commented 9 years ago

Using a common schema some will inherently lead to some data loss. Add an option to get the complete source-specific schema.

justincy commented 9 years ago

Another thought is to use the source schema by default and have an option for converting it into a shared schema.

justincy commented 9 years ago

Or maybe we devise a way to have the conversion in a separate lib/plugin so that users could choose which schema they wanted the data converted into. I prefer this method but I don't like the idea of giving myself more work.

justincy commented 9 years ago

The other thing I like about this proposal is that it cleanly separates the job of scraping from that of conversion. Right now the two are munged together.

dovy commented 9 years ago

Amen, MVC.

justincy commented 9 years ago

Design options:

1. Conversion libs wrap genscrape

Conversion libs would wrap genscrape to catch the data events, run the data through converters, and then pass the data on.

First you would include both genscrape and the conversion lib.

<script src="genscrape.js"></script>
<script src="genscrape-gedcom-converter"></script>

Then instead of calling genscrape directly you would call the converter.

genscrapeGedcom();

You could set it up so that you just called genscrape(); as normal but that would limit you to only one converter at a time.

The biggest disadvantage in this option is that genscrape would not be forced to have conversion built into it's design. There is value is designing genscrape with an API that makes conversion natural.

2. Allow for a plugin lib to be configured at runtime

genscrape(url, gedxConverter);

or

// register a converter
genscrape.converters('gensearch', gensearchConverter);

// tell it which converter to use
genscrape(url, 'gensearch');

This makes the API more explicit and requires less JS hackery and black magic than option 1).

You could consider a 3rd option that has all the converters built-in instead of requiring them to be registered, but if you allowed custom converters (which you should) then you would essentially be in the same situation as option 2.

justincy commented 9 years ago

3. A completely separate conversion library

When discussing option 1) I said there is advantage in having conversion be a natural part of genscrape's API. It can also be a disadvantage by increasing the complexity.

genscrape is built to be asynchronous. Conversion is synchronous. Conversion doesn't have to be part of genscrape's event interface. We could build a separate lib that just takes in 3 parameters (data, source schema, destination schema) and returns the converted data. Then you just call it when receiving a data event from genscrape.

genscrape().on('data', function(data){
  convert(data, 'sourceSchema', 'destinationSchema');
});

It would be valuable for genscrape to tell you what the source schema was. We could change the data into a response object with other metadata in addition to the source data.

genscrape().on('data', function(res){
  convert(res.data, res.schema, 'destinationSchema');
});
justincy commented 9 years ago

For option 3), there is the concern that supporting conversion from any schema to any other would be a nightmare. And it's true if the number of schemas gets large enough.

There are some schemas which wouldn't need to be supported as a destination because there's no value (in general) in converting data into that format. For example, FS Historical Records. The data is in a schema called SORD. There's no reason for anyone to convert Ancestry data, or data from any other website, into that format.

Also, not all websites will have a unique schema. Most websites which we might write a scraper for will have simple names, events, and relationships data. We won't need separate schemas for Find A Grave, BillionGraves, Fold3, and Archives.

robhoare commented 9 years ago

A while back I was working on something similar (but I've had no time for since). I created a simple bookmarklet that the user clicks to save the current rendered page (not just the source) to a server api - sombody with better js skills than me could save it to a dropbox/onedrive/owncloud direct from the page instead (no server needed).

So, I saved the whole page, to process on the server side (or, if it's in something like dropbox, it could be on the client side later). This means if the scraper goes wrong (as it will, as pages change), it can be re-run against the captured full copy of the data, possibly months later. I wrote some basic scrapers in php, but this process does allow the scrapers to be written in anything anywhere (and for multiple scraper providers to be available, you just point it to the stored data).

Extracting (scraping) from Familysearch (records data) is trivial now that they embed json structures in the page (see the lines starting var person, var record) and Ancestry has similiar in the correctionsMetadata variable. But many other sites will need constant maintenance of the scraping code.

My usage is different from yours: I want to capture the whole source with provenance, and then interpret it, to build relationships from, at multiple levels, all the way up to a tree. But it does have a large overlap with what you're doing (just doesn't need to be real-time). Like this:

Sorry this went on a bit long and is a bit off the target of the issue, but I think there's a lot in common between what we're trying to do. Just seen your update: no, I don't see any reason why you'd want to support converting to any target schema, if the common schema is flexible enough.

justincy commented 9 years ago

Thanks Robert! That gives me a lot to consider.

allow the user to build similar relationships between different records, with justification notes, which builds up linked data that can be used for reports (such as, but not limited to, genealogical trees)

That's actually my long-term vision for RootsSearch.

justincy commented 9 years ago

I've decided to drop the idea of full/converted schemas for now. May be worth revisiting in the future.