marklogic-community / Corona

Community REST API for MarkLogic
Other
37 stars 9 forks source link

Normalize dates in extracted metadata from binaries #71

Closed ryangrimm closed 12 years ago

ryangrimm commented 12 years ago

Many binary formats include something along the lines of a creation date or a modification date. These dates can be under various names for various file formats. In order to support various queries and range indexes on this metadata, normalizing these dates into xs:dateTime values would be required.

To do so, the current plan is to attempt to parse the value of any piece of metadata that has date or time in its name. The parsing can be accomplished via the date parser that's already in use. New formats can easily be added if need be.

ScottConroy commented 12 years ago

Will want to normalize the element names as well. Content extracted from PDF ends up with corona:modDate while Word ends up with corona:lastSavedDate (which I believe are conceptually the same thing). I did a quick inventory of a half dozen other formats and that's the main one I saw.

hunterhacker commented 12 years ago

Discussion: http://markmail.org/thread/cenoooa7kmu7237n

ryangrimm commented 12 years ago

Normalizing last modification metadata to a corona:modDate element. Also running any piece of metadata that has "date" in the name through the date parser. If a date is extracted it's stored in a normalized-date attribute.