piql / insight

Archival packages insight application
GNU General Public License v3.0
8 stars 3 forks source link

Table of Contents

  1. Introduction
  2. About
  3. System Requirements
    1. OS-X / Linux
  4. Installation
  5. Usage
  6. Sphinx indexer and search engine
  7. Reports
  8. Journals
    1. Creating searchable-PDFs
  9. Log files
  10. Configuration
    1. Tip
  11. Batch mode
    1. Batch mode example
  12. How to report issues
  13. History
    1. insight-v1.3.0
    2. insight-v1.2.0
    3. insight-v1.2.0-beta3
      1. New features
    4. innsyn-v1.2.0-beta1
      1. New features
    5. 01.07.2020 innsyn-v1.1.0
      1. Fixes
    6. innsyn-v1.1.0-beta2
      1. Fixes
    7. innsyn-v1.1.0-beta1
      1. New features
      2. Fixes
    8. 2018.06.01 innsyn-v1.0.0
      1. Fixes
    9. 2018.04.13 innsyn-v1.0.0-rc1
    10. 2018.02.05 innsyn-v1.0.0-beta2
    11. 2018.01.18 innsyn-v1.0.0-beta1
  14. Development
    1. pdftotext
      1. Windows
      2. Other
    2. Ubuntu
    3. Windows

Introduction

Piql Insight is a fast and flexible archival package / information package (IP) inspection and dissemination tool.

The application does not validate the IPs apart from validating the existence of attachments. The application is data-driven and a design goal is that is should not contain any references to file formats, XML-tags or keywords. Instead are the views controlled by configuration files and optional regular expressions to transform the XML to user interface friendly keywords. This means that in principle the application can support any XML based IP based format by adding new config files.

For efficient dissemination the application supports batch mode exports with predefined transformations. It also supports creating searchable-PDFs where the PDF contains digitized page and embedded OCR metadata.

Insight ships with these formats:

About

Piql Insight was originally developed for the Kommunenes Digitale Ressursjsentral (KDRS) in Norway and released under the name KDRS Innsyn. Its job was to support dissemination of archival packages in the NORAK-5 format, used by the Norwegian legislation. It has later been extended to support multiple IP formats.

System Requirements

Released applications are tested on Windows 10 64bit.

OS-X / Linux

The text indexer tool Sphinx must be in the path.

Install:

osx$ brew install sphinx
centos-redhat$ sudo yum install sphinx

Check that tool is available:

computer$ indexer 
Sphinx 2.2.11-id64-release (95ae9a6)
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)

Installation

  1. Extract insight-v1.3.0.zip.
  2. Indexing of attachments can require some disk space, consider changing data location REPORTS_DIR in insight.conf and ensure it points to a location with sufficient disk space.
  3. UI language can be changed by editing LANGUAGE in insight.conf. Current available languages are english, norwegian-bokmål and norwegian-nynorsk. Default language is english. Please note that changing language after launch can cause trouble since some of the paths are localized.

Usage

img

  1. Launch insight.exe (OS-X/Linux: insight) from innsyn-v1.3.0.
  2. The UI has four main elements:
    1. Node tree: Shows all the elements in the imported XML/package. Each node in the tree corresponds to an XML tag or a file. It also shows previously loaded packages. These can be reloaded or deleted by using the Import or Delete buttons in the node-info window.
    2. Node info: Shows detailed info for each node. Some elements have action buttons attached to them. For example has the element for attachments a Display button that allows the user to display the attachment with the viewer the operating system has associated with this file format.
    3. Search: Allows users to search in loaded packages.
    4. Search result: Only displayed after a search.
  3. Press Import to load a package. Sample data can be found under testdata.
  4. The loading will start. For large packages this can take some time, but the UI should instantly start to populate the node tree.
  5. When the import is complete, the user is asked if the application should index the attachments. The indexer runs in the background and indexing duration will depend on number of and size of attachments. Windows: At first launch the OS might ask for firewall approval for the searchd.exe process, please approve this.
  6. Import generates a report file, this can be opened from the root node of the package.
  7. The application has two main functions:
    1. Search: Free text search in the node tree. When the search starts, the list of search results will appear. Double-click an element in the list to navigate to the element in the node tree. Select a item using the Select button. For advances search options press the + button next to the Select button: a) Attachments: Only displayed if attachment indexing is complete. When selected attachments will be searched instead of the node tree. b) Case sensitive: If not checked the search string findme will match both FINDME, findMe and findme. If checked it will only match findme. c) Include nodes: Displays all unique node names in the node tree. Nodes not checked will be excluded from the search. To hide the advanced search options press the - button.
    2. Export: Generate a report of selected nodes in the node tree. The report can be saved to disk or sent as an attachment in e-mail. Please note security requirements for disseminated material before sending on e-mail.

Sphinx indexer and search engine

For efficient full text searches in the attachments referenced by the archival package the search engine Sphinx is used. First all attachments are converted to text, then Sphinx builds and index to facilitate efficient look-ups. The command line tool pdftotext is used to convert the PDF files to text. This tool must be avaliable in the path.

With insight running and after loading and indexing an archival package, it is possible to run SQL queries directly on the index with MySQL (version 5.6) client installed:

mysql -h0 -P9306
MySQL [(none)]> show tables;
+------------+-------+
| Index      | Type  |
+------------+-------+
| INDEX_NAME | local |
+------------+-------+
1 row in set (0.00 sec)

MySQL [(none)]> select i from INDEX_NAME where match('Drammen');

See Sphinx user manual for more information.

Reports

After import a PDF-report is generated it the report folder as configured by the REPORTS_DIR key in insight.conf. The reports are stored in a folder named REPORTS_DIR\yyyy\MM\DD\TTMMSS\. In the report folder the Sphinx index and similar data attached to the IP is stored.

Journals

img

For some XML based formats there can be a one to many relation between a node in the XML and files in the archival package. An example of this is the Norwegian Health Archive package where the avlxml.xml file can reference multiple digitized pages and corresponding OCR metadata. This relationship can be configured using the key INFO_VIEW_JOURNAL_TYPE_REGEXP in the import format file. Nodes matching this key will get a Journal button at the bottom of the node view. The Journal view allows users to select pages that should be exported.

Creating searchable-PDFs

The journal mode supports display and export of journals as searchable-PDF where each page consists of the digitized page (for example in JPEG format) an the recognized text (OCR) as an invisible layer.

Supported OCR formats are ALTO and HOCR. For more information how this mode works study the script pdf\create-pdf.cmd. To create PDFs several tools have to be installed and available in the system path:

Log files

Configuration

The config file is named insight.conf. The goal of the config file is that it is self documenting, so inspect it for further details. If the config file is changed the application has to be restarted. Each import format has its own config file stored under formats. All files ending with .conf in this folder will be loaded at startup and displayed in the file open dialog.

Tip

Batch mode

img

To support automated dissemination workflows Insight support batch mode using the command line parameters: –file, –file-format and –auto-export. The auto-export feature is useful if the format is configures to auto select nodes at import using the TREEVIEW_AUTO_SELECT_REGEXP key.

Batch mode example

insight --file nha-sip.tar --file-format "Norsk Helsearkiv SIP" \
  --auto-export out.pdf

This command will open the AIP file nha-sip.tar as a Norsk Helsearkiv SIP as defined by <./formats/nha-sip.conf> format file and export it to out.pdf, then exit the application. Full example here.

The import process is controlled by various keys in the format file:

; File patterns supported by the format, separated by '@'
IMPORT_FORMAT_PATTERNS=*.7z@*.tar
; Extraction tools, order must match pattern list above
EXTRACT_TOOL="^.*\.7z$@7z x -y \
  -o%DESTINATION% %FILENAME%@^.*\.tar$@tar \
  -C %DESTINATION% -xf %FILENAME%"

; Auto load nodes will be loaded into tree view when importing XML
INFOVIEW_AUTO_IMPORT_REGEXP_EN=filename[^.*avlxml\\.xml$]

Auto load nodes will use the first format matching the filename. For the example above this is <./formats/epjark.conf>. This format definition file uses the TREEVIEW_AUTO_SELECT_REGEXP key to auto-select nodes at import. Only selected nodes will be included in the export.

; Auto select nodes will be selected when importing XML
TREEVIEW_AUTO_SELECT_REGEXP=pasientjournal@diagnose

How to report issues

Please create a GitHub issue with a detailed as possible description of what happened. Attach log files and insight.dmp if it exists. Do not post sensitive material!

History

insight-v1.3.0

insight-v1.2.0

insight-v1.2.0-beta3

New features

innsyn-v1.2.0-beta1

New features

01.07.2020 innsyn-v1.1.0

Fixes

innsyn-v1.1.0-beta2

Fixes

innsyn-v1.1.0-beta1

New features

Fixes

2018.06.01 innsyn-v1.0.0

Fixes

2018.04.13 innsyn-v1.0.0-rc1

Pre release.

2018.02.05 innsyn-v1.0.0-beta2

Pre release.

2018.01.18 innsyn-v1.0.0-beta1

First beta test release.

Development

The application is created using the Qt framework. When this is installed the application can be build using:

# Linux / OS-X
(cd src/thirdparty ; \
 unzip quazip-1.4.zip ; \
 cd quazip-1.4 ; \
 cmake -S . -B ./out -D QUAZIP_QT_MAJOR_VERSION=6 ; \
 cmake --build ./out)
./update-translations.sh
qmake
make

# Windows
cd src/thirdparty
unzip quazip-1.4.zip
cd quazip-1.4
cmake -S . -B ./out -D QUAZIP_QT_MAJOR_VERSION=6 -D CMAKE_LIBRARY_PATH=c:\dev\Piql\zlib-win64\Release
cmake --build ./out --config release 
cmake --build ./out --config debug
./update-translations.sh
qmake
nmake

On some systems the MySQL driver has to be built and copied to distribution dir: https://doc.qt.io/qt-5/sql-driver.html#qmysql-for-mysql-5-and-higher

``` mkdir mysql cd mysql qt-cmake c:\Qt\6.5.0\Src\qtbase\src\plugins\sqldrivers \ -DCMAKE_INSTALL_PREFIX=c:\Qt\6.5.0\msvc2019_64 \ -DMySQL_INCLUDE_DIR="c:\Program Files\MySQL\MySQL Server 8.0\include" \ -DMySQL_LIBRARY="c:\Program Files\MySQL\MySQL Server 8.0\lib\libmysql.lib" ```

To create release packages use:

create-release-osx.sh
create-release.cmd

pdftotext

Windows

The tool pdftotext comes from the Xpdf command line tools pacakge downloaded from https://www.xpdfreader.com/download.html. The pdf2text.exe and config file xpdfrc must be in the path.

Other

Intall with you favourite package manager and ensure tool is avaliable in path.

Ubuntu

sudo apt install libpoppler*
sudo apt install libboost-all-dev
sudo apt install libquazip5-dev
sudo apt install qttools5-dev-tools

Windows

setup-win64.cmd
qmake -tp vc