historical-data / schema

Microdata schema for historical data.
historical-data.org
30 stars 4 forks source link

Changes for merging historical-data.org into schema.org #32

Closed RobertGardner closed 12 years ago

RobertGardner commented 12 years ago

Here are the changes we are proposing for merging historical-data.org into schema.org. The changes were the following:

First commit:

Second commit:

There are a few outstanding questions in here that should be resolved before finalizing.

danbri commented 12 years ago

What kinds of value do historicalCollection take?

RobertGardner commented 12 years ago

We've been thinking of publishing (on historical-data.org) a list of suggested collection names, but for now it's left up to the vendor. We were thinking "1940 U.S. Census" as an example. FamilySearch.org, for example, organizes their data into a series of collections depending on where they got the data. If we can get the industry to standardize on some collection names, like the census data, then we could also do cross-site correlations.

RobertGardner commented 12 years ago

The properties 'contributor', 'modifiedDate', 'sources', and 'historicalCollection' all refer to information about the publisher or provenance of the record rather than the record itself. I propose we combine those four into a new Type, 'SourceInfo', and add this to each of the types instead of 4 separate ones.

More concretely, we'd add Type SourceInfo -> Thing: contributor, historicalCollection, modifiedDate, sources; remove these from Person, Event, etc., and add a SourceInfo property in their place.

ninjudd commented 12 years ago

Are we going to have both spouse and spouses on Person?

RobertGardner commented 12 years ago

That's a good catch. All properties in schema.org can have more than one value, so there's no benefit to having both. I vote to keep spouse and eliminate the newly-proposed spouses. Those that want to represent multiple spouses can just do so -- it might be difficult for them to discover how to do it, but the mechanism is there.

DallanQ commented 12 years ago

Looking at Person on schema.org, it appears that siblings and parents are both deprecated in favor of sibling and parent. Perhaps we should drop spouses and just use the existing spouse field on Person?

danbri commented 12 years ago

Yes, that makes sense.

There was a recent fix/clarification we made btw. As originally released, Schema.org had a notion that English language plurality on property names was a good way of indicating cardinality. So 'spouse' on that reading somewhat implied that a person could only be assigned one spouse. This was wrong on so many levels, and we've abandoned that model now. But you can see the consequences in that we now have a few cases, eg. see 'actors' vs 'actor' in http://schema.org/Movie where the legacy form is kept in the schema documentation.

See Properties from Movie actor Person A cast member of the movie, TV series, season, or episode, or video. actors Person A cast member of the movie, TV series, season, or episode, or video. (legacy spelling; see singular form, actor)

From now on, we just spell properties in a singular way, and repeat them if needed.

danbri commented 12 years ago

DallanQ - we were typing at same time, it seems.

Yes, I think Person's 'spouse' field should express what you need now.

RobertGardner commented 12 years ago

I've removed spouses. What about events and marriages? Should those be 'event' and 'marriage'?

danbri commented 12 years ago

if the property is repeated, for each individual event and marriage, then yes, singular form. c.f. http://www.w3.org/wiki/WebSchemas/Singularity

RobertGardner commented 12 years ago

I fixed all the plural/singular issues, using singular property names everywhere but putting "(s)" in the description to indicate that multiple values are allowed/expected.

DallanQ commented 12 years ago

A few thoughts after reviewing the changes this morning (sorry for the length):

(1) You may want to add a generic "event" field on Family like you have on Person, for use in tracking additional events like Marriage Banns

(2) Since ISO 8601 doesn't support approximate dates, do you want to add "startDateModifier" and "endDateModifier" fields to Event, where people could specify modifiers like about, before, after, etc.? (I'm not sure if this is worth it.)

(3) On EllisIsland.org, the immigration record contains the following fields: ethnicity, last place of residence, date of arrival, age at arrival, marital status, ship of travel, and port of departure, and manifest line number. I'm trying to figure out how these would be represented. Would they all be separate Event's under a Person's generic "event" field, with the event name being the fact name (e.g., ethnicity), the event description being the fact value (e.g., Scottish), and if there is a date (e.g., date of arrival) or place (e.g., last place of residence) associated with the fact then it would be stored in the event startDate or location field respectively?

(4) If a person has multiple marriages, say 1890 to Jane and 1910 to Jill, I can see how to represent the marriages and spouses on the Person object, but I can't figure out how to tie the marriage event and spouse together -- that is, how to specify that the 1890 marriage event was to Jane, and the 1910 marriage event was to Jill. One way to tie them together would be to add a "family" field to Person that would reference a Family object. I'm not if this is worth it. It would make the model more complex, since now spouses and children could be found either attached directly to Person or attached to a Family attached to Person.

(5) I'm trying to understand how SourceInfo.historicalCollection and SourceInfo.source should be used. Continuing with the EllisIsland.org example, it appears that the SourceInfo object is used store metadata about the item: the name of the item (Passenger Record), a link to the original manifest image, etc. In addition, we might want information about the collection: collection name, copyright owner, publisher, etc. You're capturing collection name with historicalCollection, but instead of just capturing the collection name, what would you think about having historicalCollection reference a CreativeWork so the other information could also be captured?

(6) I'm not sure the field name SourceInfo.historicalCollection is sufficiently general: if the data came from an online book, SourceInfo might contain the page number where the information is located, and SourceInfo.historicalCollection might reference the book. A book can be thought of as a collection (of pages), but you might prefer to use a more-general field name like SourceInfo.source. SourceInfo.source doesn't work because you're currently using SourceInfo.source to reference the historical record. You might want to rename source to historicalRecord and historicalCollection to source, but I'm not sure it is worth it at this point.

RobertGardner commented 12 years ago

Just a quick question on dates. Others to be looked at later.

ISO 8601 doesn't support approximate dates, but the Library of Congress's Extended Date and Time Format, which extends ISO 8601, does allow uncertain and approximate dates. We were hoping schema.org would adopt that extension or at least support dates in that format. Any plans?

danbri commented 12 years ago

Take a look at the Time and DateTime requests in http://www.w3.org/wiki/WebSchemas/GoodRelations (another proposal; to add more ecommerce stuff).

Also http://www.w3.org/wiki/WebSchemas/EventSchemaUpdate http://lists.w3.org/Archives/Public/public-vocabs/2012May/0056.html for discussion of improved event descriptions.

I think adding approx dates would be great for a lot of cultural heritage, bibliographic etc use cases too. Just a question of coordination with related updates and getting the details right...

RobertGardner commented 12 years ago

We would of course propose simply adopting the Library of Congress extended date/time format, specified here: http://www.loc.gov/standards/datetime/pre-submission.html It's compatible with ISO 8601 in the sense that ISO 8601 formats are accepted. It extends ISO 8601 by adding support for approximate and uncertain: 2012-05-12? is "approximately May 12, 2012" and 2012-05-12~ means "May 12, 2012, but that's uncertain". And so on.

stoicflame commented 12 years ago

Hi all.

Sorry for the late reply. I know my comment it last-minute. My thanks to @RobertGardner for putting this together and for everybody's comments.

I'd like to remove the SchemaInfo type and the historicalCollection property because I think they're superfluous and add unneeded complexity. Is there any reason why HistoricalRecord can't describe a collection? So to find out the title of the collection in which a Person or Family is found, just look for the title of the source of that Person or Family.

The properties 'contributor', 'modifiedDate', 'sources', and 'historicalCollection' all refer to information about the publisher or provenance of the record rather than the record itself.

Is that true? I always thought contibutor and modifiedDate on a Person applied to the data on that Person and not to that Person's source...

DallanQ commented 12 years ago

@stoicflame 's comment makes me think that I don't understand how HistoricalRecord is intended to be used. So my comments below may not be valid.

I distinguish what I call records or items -- things that contain factual information about a single individual or a small handful of individuals -- from sources or collections -- things that contain bibliographic information about a set of records/items. Sources/ collections are CreativeWork's in Schema.org parlance. HistoricalRecord seems to have properties of both -- a HistoricalRecord contains information about both the record/item as well as the source/collection in which it is found. If that is the intent -- to combine information about the record/item and source/collection into a single entity, then perhaps Person ought to link directly to HistoricalRecord, instead of linking to a SourceInfo which contains links to HistoricalRecord's? (I'm clearly not understanding the role of SourceInfo vs HistoricalRecord.)

danbri commented 12 years ago

Can I suggest we deal with this via examples?

An example in which a single person and/or family was described by multiple sources might help clarify the issues? Or in general in which there was a network of several related entities...

RobertGardner commented 12 years ago

Here's an example of how I understand SourceInfo would be used.

Suppose we have a record of Robert Gardner, Jr, born in 1820, as indicated by a birth certificate from Microfilm Repository X. Suppose Frank Wright extracted this information on June 3, 2010.

The Person item would contain Robert's name, birthdate, place, etc., as expected. The SourceInfo associated with this Person item would have SourceInfo { contributor: Frank Wright dateModified: June 3, 2010 historicalCollection: Microfilm Repository X source: HistoricalRecord describing the birth certificate }

Note that there is no good place to put "Frank Wright" or "June 3, 2010" or "Microfilm Repository X" in the HistoricalRecord. We could, of course, add them in there and then eliminate SourceInfo entirely in favor of simply 'source'.

stoicflame commented 12 years ago

Note that there is no good place to put "Frank Wright" or "June 3, 2010" or "Microfilm Repository X" in the HistoricalRecord.

Wait, I'm seeing that HistoricalRecord has contributor and dateModified and name inherited from CreativeWork. Wouldn't you put them there?

We could, of course, add them in there and then eliminate SourceInfo entirely in favor of simply 'source'.

Yeah, that's kind of what I was hoping for.

RobertGardner commented 12 years ago

Addressing @DallanQ from yesterday:

1) Generic 'event' on Family: good idea. Done. 2) Approx Dates: I think schema.org is going to address this separately. 3) Immigration Record: This schema wasn't really designed as a complete data transfer solution. It focuses on what would be of most interest to a search engine or simple plug-in application. It would be nice to have an extension mechanism for all this data, but other than simply adding undocumented properties, we don't have anything. 4) Tying spouse & marriage: Interesting problem. Should we add 'person' to HistoricalEvent? Right now it has attendee and performer, but those don't seem right. 'person' seems a bit odd, though. I suppose another solution would be to require the spouses and marriages to be specified in exactly the same order. 5) SourceInfo/HistoricalRecord: see above discussion 6) same

RobertGardner commented 12 years ago

Note that there is no good place to put "Frank Wright" or "June 3, 2010" or "Microfilm Repository X" in the HistoricalRecord.

Wait, I'm seeing that HistoricalRecord has contributor and dateModified and name inherited from CreativeWork. Wouldn't you put them there?

Those are there, but according to their descriptions: 'contributor' in CreativeWork says "Secondary contributor to the CreativeWork." That's not Frank Wright. 'dateModified' says "the date on which the CreativeWork was most recently modified." That would be sometime in 1820, not June 3, 2010.

stoicflame commented 12 years ago

Those are there, but according to their descriptions: 'contributor' in CreativeWork says "Secondary contributor to the CreativeWork." That's not Frank Wright. 'dateModified' says "the date on which the CreativeWork was most recently modified." That would be sometime in 1820, not June 3, 2010.

Fair enough.

But then I see a bigger problem because there's an inconsistency in the way that we use the contributor and dateModified properties. On a Person, it means the person who is responsible for contributing the information about the person. On HistoricalRecord it carries a different meaning as you've described above.

Couldn't we just decide that in the context of a HistoricalRecord, contributor and dateModified do not mean what they mean on CreativeWork?

If not, we might need to consider renaming contributor and dateModified on the other types and then adding those properties to HistoricalRecord, no?

RobertGardner commented 12 years ago

Note that there is no good place to put "Frank Wright" or "June 3, 2010" or "Microfilm Repository X" in the HistoricalRecord. We could, of course, add them in there and then eliminate SourceInfo entirely in favor of simply 'source'.

I've been playing around with this and I like it but there are some things to iron out. One thing we're worried about at Google is making sure we can clearly distinguish between a Person from a genealogy site and a Person from a social site. If we have a rule that is "If the Person contains a HistoricalRecord in its sources then it's genealogical," then I think we can meet that goal. So that's good.

However, it's a bit difficult figuring out how to handle 'dateModified' and 'contributor', where 'dateModified' is the date the Person itself would have been modified and 'contributor' is the person who contributed the record (say, found it in the microfilm). One answer is to say we just don't need these in schema.org. Another is to find good names for them. And another is to put 'dateModified' back into Person.

I'm not a domain expert, but I'm leaning toward "just don't put them in the schema" since we are not trying to create a data exchange standard but a rich subset that improves search and plug-in tools.

RobertGardner commented 12 years ago

Couldn't we just decide that in the context of a HistoricalRecord, contributor and dateModified do not mean what they mean on CreativeWork?

It's not a good idea to have them mean different things in different contexts. That confuses people. And presumably, those are there for a good reason. For example, if you wanted to express the date that the birth certificate itself was modified, where would you put it if that field had a different meaning in HistoricalRecord?

stoicflame commented 12 years ago

I'm not a domain expert, but I'm leaning toward "just don't put them in the schema" since we are not trying to create a data exchange standard but a rich subset that improves search and plug-in tools.

Oooo. I like that suggestion.

That will solve the disparity problem with CreativeWork, too.

But I'm interested in knowing whether anybody can provide a good use case where these properties would be needed for a search engine...

RobertGardner commented 12 years ago

I need to merge this pull request right now so that schema.org can make their blog announcement. This doesn't mean it's final, but it seems like we're pretty close. We probably still want to address the issue of tying spouse and marriage together.

I made the change of deleting SourceInfo and using 'source' as a collection of HistoricalRecord instead. I'm merging now. Any further discussion needs to happen through Issues instead of comments on this pull request.