keeps / dbptk-developer

DBPTK Developer - library and command-line tool for execution of database preservation actions
http://www.database-preservation.com
GNU Lesser General Public License v3.0
43 stars 19 forks source link

Extractiong from MS Access: # added to hyperling field #188

Closed BorisDomajnko closed 5 years ago

BorisDomajnko commented 8 years ago

Description: In the original MS Access the field type is hyperlink, example value is Slike\1956.jpg. In the siard-2 file I see wrong values with hashes added, for example

#Slike\u005c1956.jpg#

and in some cases the mdb file name is added:

Slike\u005c6734.jpg#Izkaznice_III_MNZ.mdb#

In the restored db in PostgreSQL the field type is text which is ok as the viewer handles the hyperlink. But the hashes appear.

Steps required to reproduce the bug:

  1. dbptk export MS Access file with filed type hyperlink
  2. restore to any (I tried PostgreSQL) (hyperlink field now becomes text, ok for me)
  3. additional # characters are strangely added (e.g. #Slike\1956.jpg#)

Attach the dbptk-app.log.txt file below. I believe this is the column. In MS Access the type is hyperlink:

DEBUG 2016-05-04 13:13:09,478 (JDBCImportModule) Data type: 12 Type name: VARCHAR Column size: 16777216 Decimal digits: 0 Radix: 0 Remarks: null Default value: null Index: 24 Is Nullable: true Is auto increment: false Calculated type: SimpleTypeString ColumnType hash: 1423142621

chalkos commented 8 years ago

Assuming the hyperlink points to a local file, it is very difficult to automatically save the file to the SIARD archive.

If that column really corresponds to the hyperlink type, then it means that DBPTK perceives the hyperlink as plain text, and the strange values (like the ones with # characters) are obtained like that from MS Access.

But even if the link is correct, if the files are not saved together with the SIARD, the viewer will not be able to find them.

The next beta version will provide more information about the database. It may make it easier to solve this problem.

BorisDomajnko commented 8 years ago

The external files are indeed stored separately and then they have to be restored in such a way that the access tool can display them. This works in my environment if I manually correct the generated links, which are now like this one: http://pluton.ars.sigov.si/dbdipview/files/MNZoranzna/#Slike/1956.jpg#

I will try with the next beta version. So far I assumed dbptk used hashes for some internal processing as I do not see them in MS Access. But if they have some internal meaning for MS Access we will have to understand and then consider removing them. I guess we do not need them for eternity...

BorisDomajnko commented 8 years ago

Here it is: https://msdn.microsoft.com/en-us/library/bb237972%28v=office.12%29.aspx For the scenario with db access tool we only need a link as a relative path (see example above) to be abl eto display pictures or open other external documents. This is a clean text column. With my producers I do no expect other scenarios because usually simple registers are archived and not some specific hi-tec applications. But someone could suggest we add ad db column with original hashed values.

chalkos commented 8 years ago

@BorisDomajnko could you try to reproduce this problem with the 2.0.0-beta4.2 release and then provide me the log file?

BorisDomajnko commented 8 years ago

dbptk-app-2.0.0-beta4.2.jar, Creation of siard-2 (input: MS Acces)

dbptk-report-20160519102902332.txt:

dbptk-app.log.txt: 2016-05-19 10:15:12,061 [main] DEBUG (c.d.m.Reporter) dataTypeChangedOnImport, invoker: com.databasepreservation.modules.msAccess.in.MsAccessUCanAccessDatatypeImporter; message: - Data type import: in schema PUBLIC and table Osebe, the column Slika has type VARCHAR which was perceived as CHARACTER VARYING(16777216) (SQL99) and CHARACTER VARYING(16777216) (SQL2008); and type: Type{description='null', originalTypeName='VARCHAR', sql99TypeName='CHARACTER VARYING(16777216)', sql2008TypeName='CHARACTER VARYING(16777216)'} 2016-05-19 10:15:12,061 [main] DEBUG (c.d.m.j.i.JDBCImportModule) Data type: 12 Type name: VARCHAR Column size: 16777216 Decimal digits: 0 Radix: 0 Remarks: null Default value: null Index: 24 Is Nullable: true Is auto increment: false Calculated type: SimpleTypeString ColumnType hash: -642213657

siard-2 table file - various examples found in XML:

Slike\u005c535.jpg Slike\u005c6918.jpg#Slike/6918.jpg# Slike\u005c7913.jpg#Izkaznice_III_MNZ.mdb# #Slike\u005c10273.jpg#

This is creation of SIARD2. Please inform me if you want me to restore the DB as well . Regards, B.

chalkos commented 8 years ago

It seems that the type of these fields as perceived by the driver is VARCHAR and I don't see an obvious method to distinguish hyperlink field from text field.

Even if we could distinguish between these types and somehow know that the text is a file path, there are many cases where the referred file could be lost in conversion. Examples:

The hyperlink could even be a link to a dynamic webpage, like http://www.google.com and it is not possible to consider all these cases in this tool.

If the hyperlink field refers a file, the database viewer will not be able to show that file unless the file is provided in a way that is expected by the viewer.

@luis100 do you have any comments on this?

luis100 commented 8 years ago

This is part of a more general problem of content that is referred by the database but not directly accessible by it. The accessing of the content (be it a binary or something else) is done by the application layer that, in this case, is the Microsoft Access itself.

My solution here would be a new feature (request) that allows content to be enriched by the user so application logic can be injected. This feature would allow overriding a column and all its values, passing them through a user provided a script / function that would would receive the column value as an argument and provide the transformed value (in this case a binary) as a result.

Taken this problem as an example, the following arguments could be passed to the application:

java -jar dpptk.jar \
-i ...  \
-transform PUBLIC.Osebe.Slika VARBINARY fetchSlika.sh \
-o ...

The fetchSlika.sh would be a batch file that is run for each cell of this column.

#! /bin/sh
VALUE=$1 # e.g. "Slike\u005c6918.jpg#Slike/6918.jpg#"
RELATIVE_PATH=$(echo $VALUE | cut -d# -f1)
PATH="/media/data/osebe/$RELATIVE_PATH"
dd if="$PATH"

Although the user provided behavior could be done with Java interface via plugins, maybe a script would be more flexible and also be able to use system tools like, e.g. wget or curl, if the value was an URL instead of a relative file path.

If you prefer this strategy, a new issue should be created with this feature request, to be considered to be integrated in the roadmap.

BorisDomajnko commented 8 years ago

In the archives we do not prefer to modify the records of the creators. In this case I would rather ask them to filter values in such a problematic column before they pack it into SIP. This is why we execute test ingest - we would find an issue and then reject the first SIP. Therefore I do not see a need for a change in DBPTK.

chalkos commented 8 years ago

This transformation would have to be applied by the producer, so it would in fact become the "records from the creator" and not a modified version.

Rejecting the SIP and forcing the producer to filter out some data could lead to losing (potentially) precious information.

Instead the producer could apply a transformer and make modifications to the data, so it is provided in a way that DBPTK can handle.

luis100 commented 5 years ago

This as been developed on #346 and by filter modules