Decide which files to index based on dokumentobjekt.format too

petterreinholdtsen commented 4 years ago

At the moment the indexer decide which files to extract content from based on their file name. This assume something about the content in dokumentobjekt.referanseDokumentfil that is not specified in Noark 5, and I have run into extractions where the file names did not include file extentions.

It would be better if values in dokumentobjekt.format were consulted in addition to looking at file suffixes. According to Arkivverket, the values in this field is now standardized as PRONOM codes, so those values should at least be recognized.

oleliabo commented 4 years ago

Yes, this is a difficult topic... Consulting dokumentobjekt.format makes sense, but if I understand the above it cannot be trusted to always be present.

One possible "data driven" design would be like this:

Xpath query to find document-nodes
Xpath query to find document-file-format-node for the document node
If document-file-format-node not defined or not found: file-format-id-tool to use instead

For each file format id:

text extraction tool
viewer tool

So the config file could look something like (I don't remember exactly the Xpath syntax): noark-5.conf: documentNode=/*/dokumentobjekt/referanseDokumentfil/value() documentFileFormatNode=/*/dokumentobjekt/format/value() documentNodeFileFormatTool=file %FILENAME%

fileformats.conf: fileFormatName=PDF fileFotmatPronom=pronom/... fileFormatIdTool=file %FILE% fileFormatExtractTool=pdf2text %FILENAME% %OUTPUTFILE% fileFormatViewer=evince %FILENAME% fileFormatExtension=*.pdf

Note 1: Currently I don't use Xpaths for node and node leaf matching, I use regexp. I always wanted to use xpath but due to the .XMLs size can be several GB I ended up using regexp as an optimise. So this is a larger change, but it could also be solved using regexp. Note 2: Currently I rely on Qt to view files, it uses the OS default viewer for the format, this works nice so far, but at some point it would be good to override...

petterreinholdtsen commented 4 years ago

[Ole Liabø]

Yes, this is a difficult topic... Consulting dokumentobjekt.format makes sense, but if I understand the above it cannot be trusted to always be present.

I believe the format field can be trusted to be present (it is required), but its value is not very consistent across systems, so one would have to accept many values for the same format.

fileformats.conf: fileFormatName=PDF fileFotmatPronom=pronom/... fileFormatIdTool=file %FILE% fileFormatExtractTool=pdf2text %FILENAME% %OUTPUTFILE% fileFormatViewer=evince %FILENAME% fileFormatExtension=*.pdf

I suspect a format entry would need to take a list of both format values and extentions, if not using "magic numbers" to identify a file format.

For PDF, one would for example access "pdf/a", "pdf", "PDF", "RA-PDF", "fmt/95", "fmt/354" and probably a lot others. :)

-- Happy hacking Petter Reinholdtsen

piql / insight

Decide which files to index based on dokumentobjekt.format too #12