piql / insight

Archival packages insight application
GNU General Public License v3.0
8 stars 3 forks source link

Decide which files to index based on dokumentobjekt.format too #12

Open petterreinholdtsen opened 4 years ago

petterreinholdtsen commented 4 years ago

At the moment the indexer decide which files to extract content from based on their file name. This assume something about the content in dokumentobjekt.referanseDokumentfil that is not specified in Noark 5, and I have run into extractions where the file names did not include file extentions.

It would be better if values in dokumentobjekt.format were consulted in addition to looking at file suffixes. According to Arkivverket, the values in this field is now standardized as PRONOM codes, so those values should at least be recognized.

oleliabo commented 4 years ago

Yes, this is a difficult topic... Consulting dokumentobjekt.format makes sense, but if I understand the above it cannot be trusted to always be present.

One possible "data driven" design would be like this:

For each file format id:

So the config file could look something like (I don't remember exactly the Xpath syntax): noark-5.conf: documentNode=/*/dokumentobjekt/referanseDokumentfil/value() documentFileFormatNode=/*/dokumentobjekt/format/value() documentNodeFileFormatTool=file %FILENAME%

fileformats.conf: fileFormatName=PDF fileFotmatPronom=pronom/... fileFormatIdTool=file %FILE% fileFormatExtractTool=pdf2text %FILENAME% %OUTPUTFILE% fileFormatViewer=evince %FILENAME% fileFormatExtension=*.pdf

Note 1: Currently I don't use Xpaths for node and node leaf matching, I use regexp. I always wanted to use xpath but due to the .XMLs size can be several GB I ended up using regexp as an optimise. So this is a larger change, but it could also be solved using regexp. Note 2: Currently I rely on Qt to view files, it uses the OS default viewer for the format, this works nice so far, but at some point it would be good to override...

petterreinholdtsen commented 4 years ago

[Ole Liabø]

Yes, this is a difficult topic... Consulting dokumentobjekt.format makes sense, but if I understand the above it cannot be trusted to always be present.

I believe the format field can be trusted to be present (it is required), but its value is not very consistent across systems, so one would have to accept many values for the same format.

fileformats.conf: fileFormatName=PDF fileFotmatPronom=pronom/... fileFormatIdTool=file %FILE% fileFormatExtractTool=pdf2text %FILENAME% %OUTPUTFILE% fileFormatViewer=evince %FILENAME% fileFormatExtension=*.pdf

I suspect a format entry would need to take a list of both format values and extentions, if not using "magic numbers" to identify a file format.

For PDF, one would for example access "pdf/a", "pdf", "PDF", "RA-PDF", "fmt/95", "fmt/354" and probably a lot others. :)

-- Happy hacking Petter Reinholdtsen