Include region information on HistoricalDocuments

davebarney commented 13 years ago

I was at BYU yesterday and met with some professors. Dr. Bill Barrett, long-time genealogist and historical documents expert, had a very good suggestion.

Since many/most historical documents will be images of orginals documents or photographs, we should include region information as part of the schema.

In other words, if we have a scanned census record, for example, as a historical document that is attached to some event (e.g. birth), and since census records will typically have 50 names per pages, it would be helpful to provide the coordinates of the region on the image which includes this information.

Another example is a photograph. If the photograph is a group picture, it would be useful to include the region where the person of interest is located.

ninjudd commented 13 years ago

This is a good suggestion.

Thinking about this, I realized that a single HistoricalRecord may be referenced using sources from multiple HistoricalPerson objects, or may refer to multiple people in the persons field. One way I can think of to support multiple "regions" with microdata would be to add a superRecord field as well and allow records to be nested.

In addition, we'd also need fields like:

regionX
regionY
regionWidth
regionHeight

davebarney commented 13 years ago

Good thoughts!

I added Dr. Bill Barrett to the project specifically to weigh in on the this discussion.

stoicflame commented 13 years ago

The trick is adding properties to the HistoricalRecord that are intended to be applied to the image (or whatever) being referenced by the HistoricalRecord.

Personally, I think the qualifiers belong on the URI of the image, e.g.:

http://domain/path/image.jpg#x_pixels=123&y_pixels=456

ninjudd commented 13 years ago

Interesting idea. That still doesn't specify which person the coordinates refer to though. I wonder if microdata provides a general solution for this type of problem.

On Oct 7, 2011, at 8:35 AM, Ryan Heatonreply@reply.github.com wrote:

The trick is adding properties to the HistoricalRecord that are intended to be applied to the image (or whatever) being referenced by the HistoricalRecord.

Personally, I think the qualifiers belong on the URI of the image, e.g.:

http://domain/path/image.jpg#x_pixels=123&y_pixels=456

Reply to this email directly or view it on GitHub: https://github.com/historical-data/schema/issues/24#issuecomment-2323016

stoicflame commented 13 years ago

That still doesn't specify which person the coordinates refer to though.

I guess I'm struggling to see why we would try to model that kind of detail at the level of microdata. To tell the truth, I'm struggling to understand why we even care about "region" information at the level of microdata. Why do we think it's the responsibility of search engines to understand regions of images? It seems like if the provider cares about providing specifics about region info, then they'll provide a URL to the relevant region of the image instead of a URL to the whole image. Why does the search engine care?

So if I want to provide microdata for a record that describes a specific region of one of my images, I'll partition up that image in my own proprietary way, and provide the URL to the region, which might look like this:

http://domain/path/image.jpg?x_pixels=123&y_pixels=456

instead of just a URL to the image, which might look like:

http://domain/path/image.jpg

And if I want to partition up an image in such a way that there is one record per person, then that would allow me to definitively specify the region of the image where the person is found.

davebarney commented 13 years ago

Here is a good example of why a search engine might care about coordinates: http://news.google.com/newspapers?id=DyBPAAAAIBAJ&sjid=aE0DAAAAIBAJ&pg=7056,2571560&dq=pearl+harbor&hl=en

There is a valid point that the micro-format does have a limited scope intended for helping organize information on the web for search engines, but we also want to make sure it is rich enough for advances (possibly future) features of search engines. Also, microformat mark-up is not intended for just search engines, but also for applications, browser plugins, etc. Just think of a browser plugin that highlights the region of an image one clicks on that is marked up.

The region information is an area I really think is worth specifying in the microformat. Of course all fields in a microformat are optional, but for those who have the information available, why not have a standard way for that information to be published?

stoicflame commented 13 years ago

Pretty cool. So can you describe the architecture of this newspaper link you sent so we can look into applying that pattern here? Here are my questions:

Who's providing the image?
Did Google index microdata in order to discover that image?
What does that microdata look like?

davebarney commented 13 years ago

Sorry for delayed response - was out on vacation.

The example I provided was to illustrate a possible use of the micro-format. As far as I know, Google news archive does not use a micro-format. However, I can imagine many uses of images from the web just like this where the region can be highlighted.

stoicflame commented 13 years ago

I can imagine many uses of images from the web just like this where the region can be highlighted.

I can, too. But my contention is that this is best done by the service providing the image, and not by the service creating an index. I think this is consistent with the example that you've provided above.

davebarney commented 13 years ago

I think the contention is likely stemmed from viewing the micro-format only as a tool for search engines. While that is the primary purpose, browser extensions/plug-ins and mash-ups are other uses. But let's just take the search engines as the example. If an image thumbnail were provided in the search results, without region information, the thumbnail of a census record would be useless. If, however, region information is provided, the thumbnail could be taken from the region of interest.

You can imagine with the example I provided that the highlighting is done by a browser extension/plug-in or a third-party site or even in search engine results.

Of course all micro-format fields are optional so if a content provider does not have region data, does not want to provide it publicly, or for whatever other reason is unable to provide it, it can be left out. With FS, Ancestry, and others indexing large collections of images, why not provide a standard for sharing that information? Leaving it out of the spec will do the entire community a great disservice.

stoicflame commented 13 years ago

Yes, I'm not trying to say that region information isn't important. It is important. I'm questioning the value of such complexities in a microformat specification. I think there are better ways to provide the features that you're describing by allowing image providers to make their own decisions about how to provide regions.

Let's say FamilySearch has a census image that they're describing with semantic markup. The image URL is:

http://familysearch.org/census/US-1880-Page-1.jpg

All the semantic markup providing info on the persons and relationships is included in the page. Let's say, for the sake of simplicity, that there is "Fred" and "George" on that image.

Now let's say that FamilySearch wants to divide that image up into "regions" (say "region 1" and "region 2") and they'd like to provide semantic markup associated with each region. Here's one way it could be done without any extra complexity to the existing spec:

Provide a separate URL for each region. Let's say the URL to "region 1" looks like: http://familysearch.org/census/US-1880-Page-1.jpg?x_pixels=123&y_pixels=456 and the URL to "region 2" looks like: http://familysearch.org/census/US-1880-Page-1.jpg?x_pixels=123&y_pixels=789
Adjust my server (the server at FamilySearch) to provide just the region when a request to http://familysearch.org/census/US-1880-Page-1.jpg?x_pixels=123&y_pixels=456 is made. Same with region 2.
Adjust my markup to associate "Fred" with "region 1" and "George" with "region 2".

So help me out: what features would I get from adding region information to the microformat spec that are not supported by the mechanism I outlined above?

davebarney commented 13 years ago

By adding it to this spec, we can create a common standard, which today does not exist. I do not believe it adds complexity to the micro-format.

You are proposing a standard for regions by extending the URL. That is in my opinion more complicated. First, it's still a proposed standard for regions, but now involves modifying the URL format. In addition, some web servers treat extra parameters in the URL as errors and therefore will create problems for some content providers who would like to use this approach. It has the unfortunate side-effect that the URLs are no longer unique for each image. For a census record with 50 records, I'd like to be able to see that they all reference the same document because the URLs are identical.

We'd like to include this in the spec and I have not yet seen a good reason to exclude it. Perhaps we should have a quick chat on the phone to avoide further back-and-forth here. You available Wednesday?

stoicflame commented 13 years ago

Hmm... Yes, I must not be communicating very clearly. Sorry. I'd be happy to get on the phone with you. Do you have my number?

I do not believe it adds complexity to the micro-format.

Wait... I must not be understanding. Did you just say that adding region definition to a spec that doesn't have region definition does not make it more complex?

You are proposing a standard for regions by extending the URL. That is in my opinion more complicated. First, it's still a proposed standard for regions, but now involves modifying the URL format.

No, I'm not proposing a standard at all. I'm proposing leaving the implementation of region info out of the standard for the sake of simplicity. I'm asserting that providers can provide any relevant "region"-related feature without a standard. I'm asking for evidence to the contrary of that assertion. I'm stating that unless that assertion is proven false, there really isn't a need to add the extra complexity to the standard.

some web servers treat extra parameters in the URL as errors and therefore will create problems for some content providers who would like to use this approach.

This is where I start perceiving that I'm not communicating very clearly, because this statement doesn't make sense to me. Why would providers provide a URL with parameters in it if they treat those URL parameters as errors?

For a census record with 50 records, I'd like to be able to see that they all reference the same document because the URLs are identical.

But the provider knows that they all reference the same image, right? Are you saying that search engines/browser plugins/extensions/mash-ups need to know that they reference the same image? Why?

tfmorris commented 12 years ago

In image processing/image formats, these are called regions of interest (ROIs) http://en.wikipedia.org/wiki/Region_of_interest I'd suggest reviewing the work that's already been done by others before inventing something new.

I agree that having an image reference have the capability to address within the overall image is a good idea (ditto for audio and video clips, text documents, etc)

historical-data / schema

Include region information on HistoricalDocuments #24