Closed ryangrimm closed 12 years ago
Will want to normalize the element names as well. Content extracted from PDF ends up with corona:modDate while Word ends up with corona:lastSavedDate (which I believe are conceptually the same thing). I did a quick inventory of a half dozen other formats and that's the main one I saw.
Discussion: http://markmail.org/thread/cenoooa7kmu7237n
Normalizing last modification metadata to a corona:modDate element. Also running any piece of metadata that has "date" in the name through the date parser. If a date is extracted it's stored in a normalized-date attribute.
Many binary formats include something along the lines of a creation date or a modification date. These dates can be under various names for various file formats. In order to support various queries and range indexes on this metadata, normalizing these dates into xs:dateTime values would be required.
To do so, the current plan is to attempt to parse the value of any piece of metadata that has
date
ortime
in its name. The parsing can be accomplished via the date parser that's already in use. New formats can easily be added if need be.