Open petterreinholdtsen opened 4 years ago
Yes, this is a difficult topic... Consulting dokumentobjekt.format makes sense, but if I understand the above it cannot be trusted to always be present.
One possible "data driven" design would be like this:
For each file format id:
So the config file could look something like (I don't remember exactly the Xpath syntax):
noark-5.conf:
documentNode=/*/dokumentobjekt/referanseDokumentfil/value()
documentFileFormatNode=/*/dokumentobjekt/format/value()
documentNodeFileFormatTool=file %FILENAME%
fileformats.conf:
fileFormatName=PDF
fileFotmatPronom=pronom/...
fileFormatIdTool=file %FILE%
fileFormatExtractTool=pdf2text %FILENAME% %OUTPUTFILE%
fileFormatViewer=evince %FILENAME%
fileFormatExtension=*.pdf
Note 1: Currently I don't use Xpaths for node and node leaf matching, I use regexp. I always wanted to use xpath but due to the .XMLs size can be several GB I ended up using regexp as an optimise. So this is a larger change, but it could also be solved using regexp. Note 2: Currently I rely on Qt to view files, it uses the OS default viewer for the format, this works nice so far, but at some point it would be good to override...
[Ole Liabø]
Yes, this is a difficult topic... Consulting dokumentobjekt.format makes sense, but if I understand the above it cannot be trusted to always be present.
I believe the format field can be trusted to be present (it is required), but its value is not very consistent across systems, so one would have to accept many values for the same format.
fileformats.conf:
fileFormatName=PDF
fileFotmatPronom=pronom/...
fileFormatIdTool=file %FILE%
fileFormatExtractTool=pdf2text %FILENAME% %OUTPUTFILE%
fileFormatViewer=evince %FILENAME%
fileFormatExtension=*.pdf
I suspect a format entry would need to take a list of both format values and extentions, if not using "magic numbers" to identify a file format.
For PDF, one would for example access "pdf/a", "pdf", "PDF", "RA-PDF", "fmt/95", "fmt/354" and probably a lot others. :)
-- Happy hacking Petter Reinholdtsen
At the moment the indexer decide which files to extract content from based on their file name. This assume something about the content in dokumentobjekt.referanseDokumentfil that is not specified in Noark 5, and I have run into extractions where the file names did not include file extentions.
It would be better if values in dokumentobjekt.format were consulted in addition to looking at file suffixes. According to Arkivverket, the values in this field is now standardized as PRONOM codes, so those values should at least be recognized.