historical-data / schema

Microdata schema for historical data.
historical-data.org
30 stars 4 forks source link

Applicability of historical-data as a data transfer solution? #34

Open DallanQ opened 12 years ago

DallanQ commented 12 years ago

The particular use-case I have in mind for historical-data is when someone finds a record about an ancestor on a website somewhere, they click on a bookmarklet / browser extension to copy that information to their online tree. Ideally the website would have described the record using the historical-data specification, which would enable the browser extension to extract the data from the web page and use it to pre-fill form fields in a pop-up window. The user would modify the form fields as desired, specify on which website their tree was located, and which person in their tree to add the extracted information to, and click save. In this scenario ideally the user would be able to copy all of the information from the record onto their tree, not just the standard birth/christening/marriage/death/burial dates and places.

Do others envision this as a possible use-case for historical-data? If so, do we need/want to come up with an agreed-upon lexicon for names of events/facts beyond the standard birth, marriage, christening, death, and burial? Say census, immigration, religion, occupation, etc?

stoicflame commented 12 years ago

I'm game.

davebarney commented 12 years ago

That's a great use-case and justification for a more comprehensive schema. In fact, the vision for our chrome extension is to include this very case (we've discussed an "export to Gedcom" option, as well as direct API integration with online services).

One of our themes for a simple and basic schema has been to improve search and we felt we captured the primary fields that would accomplish this. Our non-goal of becoming a complete data-transfer solution was mainly because we have observed the sometimes controversial and very long-winded debates around GedcomX and we didn't want to open a can of worms. We didn't want to replace GedcomX or really even spend significant time ironing out all the details of the fields that would bring little to no value for search. However, for this use-case, I see tremendous value in having a more complete and comprehensive schema.

How can we accomplish this 1) in a short period of time, and 2) in a manner that would be acceptable in the broader community? One of the time constraints we have is the effort to merge this with Schema.org, which is happening sometime in June. Once a part of Schema.org, changes and additions, while still possible, take longer as they go to a council with representatives from Google, Bing, Yahoo, Yandex, Ask, etc, etc. They are search and schema experts, but not domain experts. The best chance to make big changes is right now.

davebarney commented 12 years ago

Dallan, what about you putting a proposal together for all the key missing fields and we can go from there? While this group is certainly not a perfect community representation, we have people from FamilySearch, Ancestry, Geni, WeRelate, Google, and BYU. If this group can quickly agree, it seems reasonable that it will represent the broader community.

NatAtGeni commented 12 years ago

Is there some reason the existing Events can't be used for these additional events? That was the original intent.

davebarney commented 12 years ago

I think so, for most of these, but we should come up with standard naming.

Also, for things like "source", it's not quite an event.

NatAtGeni commented 12 years ago

I thought HistoricalRecord was for sources.

DallanQ commented 12 years ago

I'm thinking of a shared lexicon for event names, and an agreement that facts (e.g., religion, occupation) are to be represented by Events with the fact value stored in the "description" field. Data providers wouldn't be required to follow the naming convention, but if they did, then their data would transfer better. I don't know if it would need to be part of the official spec? Tomorrow I'll post a list of candidate event names that have been found on a significant number of gedcom's submitted to WeRelate. I don't think that should be the final list, but it could be a good place to start.

I think HistoricalRecord works for sources, if I understand correctly that the intent of HistoricalRecord is to include both record-specific information (e.g., who is listed in the record) as well as general bibliographic information (e.g., publication information about the book or record collection in which the record is found). If that understanding is correct, I think we're ok there. We could go into a lot more detail, but sources are a bit of a quagmire. There have been a number of efforts in this area; e.g., http://gencontent.wikispaces.com/, and it's probably not something that we could reach an agreement on anytime soon.

For context, the reason I started this thread is I'm working on another project in conjunction with some people at FamilySearch (open-source; should be sufficiently ready for people to play around with in a couple of months) where the goal is to provide a free service where an online tree (e.g., Geni, FamilySearch, Ancestry, WeRelate) can submit what is known about an ancestor and get recommendations for online collections to search to fill in what is not known. The user could then click on one of the recommendations to be taken to a filled-in search form for that collection (filled-in is the ideal, though sometimes the user will have to fill it in themselves). I'd like to use historical-data to close the loop: make it possible to write a browser plugin to copy the information from the record back into the user's tree. I think this loop: enter what I know, get recommendations, find possible records, copy them back into my tree, get more recommendations, will make it much easier for new genealogists to start playing the game.

DallanQ commented 12 years ago

Here are some candidates for a shared lexicon of event names beyond the standard birth, christening, marriage, death, and burial events. Following each candidate is the number of times it appeared in a sample of 7000 gedcoms. I'm not suggesting that we use this list as-is (some of the names overlap others and could probably be removed), but perhaps it would be a good starting point for discussion.

Residence 582247 Civil registration 263487 Occupation 247017 Census 218247 Ancestral file number 102223 Baptism 64077 Birth registration 58047 Address 37465 Cause of death 32284 Social security number 27074 Religion 25308 Death registration 24316 Divorce 18132 Immigration 15564 Arrival 13897 Color 13671 Military 13161 Education 11848 Reference number 11405 FamilySearch Id 8505 Marriage license 8261 Graduation 6443 Emigration 5781 Physical description 5162 Probate 5096 Property 4635 Obituary 4610 Will 4608 Stillborn 4107 Employment 3651 Number of children 3624 Departure 3474 Confirmation 3177 Illness 2704 Naturalization 2540 Medical 2413 Funeral 2301 Cremation 1914 Marriage contract 1666 Adoption 1487 Newspaper 1465 Nationality 1415 Engagement 1295 Land 1128 Race 908 House number 872 Marriage banns 736

stoicflame commented 12 years ago

Here's the enumerated list we're using for GEDCOM X:

https://github.com/FamilySearch/gedcomx/blob/master/gedcomx-common/src/main/java/org/gedcomx/types/FactType.java

This list is kind of a big conglomeration of the controlled vocabulary we're using at FamilySearch. This is the Java code, but in serialized form, the types are identified by URI, e.g. http://gedcomx.org/Adoption. Using URIs is in conformance to the way that RDF specifies controlled vocabularies and it works really well because it's highly extensible and, if done right, it's self documenting.

I'd be thrilled if schema.org adopted the GEDCOM X controlled vocabulary for historical event types. Is that an unreasonable suggestion? To me, it seems reasonable to leverage existing standards efforts in an effort to reduce duplication of work and to consolidate resources.

RobertGardner commented 12 years ago

The good news is that schema.org is adopting a mechanism where the type name lexicon is controlled outside of schema.org. Here's the proposed spec: http://www.w3.org/wiki/WebSchemas/ExternalEnumerations

So I think we can do Dallan's proposal for a shared lexicon for event names and, even better, we don't need to finalize it for the schema.org inclusion.

Do people agree this meets the immediate requirement? Perhaps we can start a separate effort to define standard names for HistoricalRecord.type and Event.type?

DallanQ commented 12 years ago

I'm happy to go with @stoicflame 's list from FactType.java. I'd like to suggest some modifications to it.

I'm thinking we would want to add the following fact/event names:

IDs

I think we need to figure out a standard way to represent ID's: something to say that a person is known by id X at source Y. I think we'll see more of these as time goes on. For example:

What would you think about a general "Reference number" event name, with the id source and value in the description field; e.g., AncestralFileNumber:xxxx, FamilySearchId:xxxx, GedcomUUID:xxxx, or UniversalId:xxxx ?

If you agree, then you could add ReferenceNumber to FactType.java, and remove GedcomUUID and UniversalId from FactType.java, since they're both specific types of IDs.

BTW, what is "UniversalId" in FactType.java, and why is it currently listed under "Facts generally applicable within the scope of a couple"? I've never heard of it before.

Potential renames

Finally, I think some of the fact/event names in FactType.java might be more commonly known under different names. We may want to consider renaming them. I don't feel strongly about these (and I think I disagree with the last example), but I thought I would bring them up:

DallanQ commented 12 years ago

I'd be happy to be involved in a separate effort to define names for Event.type and HistoricalRecord.type. @stoicflame I assume you're interested as well. Anyone else?

DallanQ commented 12 years ago

A couple of events in FactType.java seem specific to FamilySearch. Do we want to have them be part of the general list of event names?

Also, the only term in the list that I wasn't already familiar with is "Flourish". I had to look it up here:

http://blog.dearmyrtle.com/2011/09/genealogy-gems-podcast-episode-117.html

According to this web page, flourish is a pretty uncommon term. Do we want to include it?

stoicflame commented 12 years ago

Wow, @DallanQ, your comments are fantastic. I really appreciate that scrutiny; we haven't had the time to really scrub this list yet; it has just been a dump.

I'll open up an issue over at the GEDCOM X project so we can track the work you're suggesting.

benwbrum commented 12 years ago

I'd be interested in helping out with HistoricalRecord.type. I'm working on a database of parish registers of interest to genealogists and we'd like to use add microformats to our information display as much as possible. I am quite new to microformats, however.

benwbrum commented 12 years ago

We've been discussing this a bit over on issue #35. Looking over FactType.java, I find myself wondering what the difference is between "AdultChristening", "Baptism", and "Christening", how a person transcribing individual records from a register recording "Baptisms" would decide which to assign, and how a researcher viewing a record would know whether it was an appropriate proxy for birth date.

DallanQ commented 12 years ago

I think Baptism v Christening simply depends upon what the particular religion calls the infant naming ordinance. In my opinion, when adult christenings are intermingled with christenings in a parish register, I'd label all of them as christenings, since there's no way for the transcriber to distinguish. The genealogist could then decide to record the event in their family tree as AdultChristening if they had other information to tell them that the christening record was for a non-infant.

In other words, I view AdultChristening as a sub-type of Christening, and Baptism as a synonym for Christening. We probably ought to maintain both Christening and Baptism as separate event names because that's what different religions call them and genealogists seem to be particular about recording the event with the name it was labeled in the record.

benwbrum commented 12 years ago

Time for a Venn diagram! So you'd say that Christening and Baptism are synonymous, and that both might be candidates for interpretation as "near-birth-event", but AdultChristening would not be a candidate for such interpretation.

It's been a dozen years since I did serious genealogy, as I got distracted by family history. How much difficulty is introduced by the differences in baptismal theology between 1) the people devising the original record formats, 2) the people being baptized (and those doing the baptizing), 3) those transcribing records and designing the record databases, and 4) the researchers themselves?

DallanQ commented 12 years ago

I don't know. I subscribe to the idea that "less is generally more" in genealogy data models -- that if we make the model too complex, people either won't use it or will misunderstand and misuse it. I think we should do what we can to name events so that searches will be better, but in the end we are as you say at the mercy of the record originators, transcribers, researchers, etc.

stoicflame commented 12 years ago

@DallanQ I've opened up the following issues that we can use to track the work to apply the suggestions you made above:

We'll use those threads to discuss and apply your suggestions and continue the discussion about how to apply them to the microformat here.

stoicflame commented 12 years ago

BTW, what is "UniversalId" in FactType.java, and why is it currently listed under "Facts generally applicable within the scope of a couple"? I've never heard of it before.

I honestly don't know. You'd think I would, huh? :-)

Like I said, the list of those fact types was compiled by digging through all the types we knew about at FamilySearch and dumping them. There wasn't much rigor applied.

I'll see if I can find out more about the UniversalId...

EssyGreen commented 12 years ago

@DallanQ - I'm coming in a bit late to this discussion so forgive me being a dimwit here but don't we already have a whole load of fact types from GEDCOM 5.5 which can effectively be used for this purpose (with tweaks as/where necessary)? I wasn't aware that GEDCOM X was about to throw these out so er .. why the need for a separate issue?

I'd also just like to reiterate the need for a Role in this event ... finding say a baptism record with Mother/Father/Child, the user would prolly like to specify how each of the people relate to their tree so this would need to be added as a Role during the drag 'n' drop process.

Also, I think it's worth considering that historical data does not necessarily correlate to genealogical data. For example, a Census is historical data which provides genealogical information such as the Residence, Occupation, Birth, Marital Status' etc of the people recorded; similarly a military/prison/hospital (historical) record may various bits of information about a person's life. Hence, as a user of your system I would want to be able to relate/link a historical record to many "genealogical" records potentially for many different people.

DallanQ commented 12 years ago

@EssyGreen yes, gedcom 5.5 defines a number of fact types, and additional fact types not defined in gedcom 5.5 have been found in gedcoms because the gedcom 5.5 list was not very comprehensive. I assume that the list provided by @stoicflame is the list of fact types proposed for GedcomX. The proposal here is to use this list also for Historical-Data. I think this is a great idea. I reviewed the list and suggested that a few additional fact types be added or renamed based upon fact types that I've seen while analyzing about 7000 gedcoms. That's why @stoicflame opened the issues back on GedcomX.

I believe that roles for the people mentioned in a historical record can generally be handled either by (a) having the HistoricalRecord object link to a Person object, and having the Person object contain parent/child/spouse/sibling/relatedTo properties that point to additional Person objects mentioned in the record, or (b) having the HistoricalRecord object link to a Family object, with parent/child properties that point to the Person objects mentioned in the record. Whether a separate "role" property is needed in addition to these properties would be a separate issue from this one.

EssyGreen commented 12 years ago

I'm confused ... what do you mean by a "Historical Record"? Isn't this just a Source? If so, then isn't this already covered by the Record Model? If not, then can you clarify what you mean/give some examples and explain the context in which they would be used?

DallanQ commented 12 years ago

Historical-data is a project to standardize a microdata schema for historical/genealogical information appearing in webpages. This issue that you're commenting on is part of that project. The Historical-data project is related to, but separate from, GedcomX. You can find the list of schemas being proposed by Historical-data here. HistoricalRecord is one of those schemas.

EssyGreen commented 12 years ago

I'm sorry but I've looked at the schema and I still don't understand how this is different from a normal Source except that it is obviously your particular implementation with your own specific tailored fields/attributes etc. I'm not knocking it but I just can't tell the different between that and a genealogical Source document.