kitodo / kitodo-production

Kitodo.Production is a workflow management tool for mass digitization and is part of the Kitodo Digital Library Suite.
http://www.kitodo.org/software/kitodoproduction/
GNU General Public License v3.0
58 stars 65 forks source link

Indexing issue: illegal character in path #3694

Open henning-gerhardt opened 4 years ago

henning-gerhardt commented 4 years ago

After migrating existing meta data files to new format with the provided transformation file and start indexing all the data this error appear in the catalina.out file:

Exception in thread "Indexing 0 of type PROCESS" java.lang.IllegalArgumentException: Illegal character in path at index 9: file://./[alldeba_266928358_0001_tif/00000001.tif
        at org.kitodo.dataformat.access.FLocatXmlElementAccess.getAndRepairUri(FLocatXmlElementAccess.java:82)
        at org.kitodo.dataformat.access.FLocatXmlElementAccess.<init>(FLocatXmlElementAccess.java:65)
        at org.kitodo.dataformat.access.FileXmlElementAccess.<init>(FileXmlElementAccess.java:81)
        at org.kitodo.dataformat.access.MetsXmlElementAccess.readMeadiaUnitsTreeRecursive(MetsXmlElementAccess.java:157)
        at org.kitodo.dataformat.access.MetsXmlElementAccess.<init>(MetsXmlElementAccess.java:135)
        at org.kitodo.dataformat.access.MetsXmlElementAccess.read(MetsXmlElementAccess.java:194)
        at org.kitodo.production.services.dataformat.MetsService.loadWorkpiece(MetsService.java:105)
        at org.kitodo.production.services.dataformat.MetsService.getBaseType(MetsService.java:84)
        at org.kitodo.production.services.data.ProcessService.getBaseType(ProcessService.java:1702)
        at org.kitodo.production.services.data.ProcessService.addAllObjectsToIndex(ProcessService.java:246)
        at org.kitodo.production.helper.IndexWorker.indexObjects(IndexWorker.java:116)
        at org.kitodo.production.helper.IndexWorker.indexChunks(IndexWorker.java:110)
        at org.kitodo.production.helper.IndexWorker.run(IndexWorker.java:78)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.URISyntaxException: Illegal character in path at index 9: file://./[alldeba_266928358_0001_tif/00000001.tif
        at java.net.URI$Parser.fail(URI.java:2848)
        at java.net.URI$Parser.checkChars(URI.java:3021)
        at java.net.URI$Parser.parseHierarchical(URI.java:3105)
        at java.net.URI$Parser.parse(URI.java:3053)
        at java.net.URI.<init>(URI.java:588)
        at org.kitodo.dataformat.access.FLocatXmlElementAccess.getAndRepairUri(FLocatXmlElementAccess.java:71)
        ... 13 more

An excerpt from the meta data file of this process:

...
  <mets:fileSec>
    <mets:fileGrp USE="LOCAL">
      <mets:file ID="FILE_0000" MIMETYPE="image/tiff">
        <mets:FLocat xmlns:xlink="http://www.w3.org/1999/xlink" LOCTYPE="URL" xlink:href="file://./[alldeba_266928358_0001_tif/00000001.tif"/>
      </mets:file>
      <mets:file ID="FILE_0001" MIMETYPE="image/tiff">
        <mets:FLocat xmlns:xlink="http://www.w3.org/1999/xlink" LOCTYPE="URL" xlink:href="file://./[alldeba_266928358_0001_tif/00000002.tif"/>
      </mets:file>
...

I don't know how this error is influencing the index operation. Should this fixed outside of the application or should the application handle this?

matthias-ronge commented 4 years ago

METS file cannot be read. This is another job for org.kitodo.dataformat.access.FLocatXmlElementAccess.getAndRepairUri(FileType file)

henning-gerhardt commented 4 years ago

I don't know how the [ character was added at this position but the process title alldeba_266928358_0001 did not contain this character. So it can be removed in a manual way or during the meta data transformation?

henning-gerhardt commented 4 years ago

With your change in #3698 I can even more illegal characters like normal white space.

matthias-ronge commented 4 years ago

I assume the mistake was there before, only now you can see it for the first time.

henning-gerhardt commented 4 years ago

Sure. I don't know the reason nor the time when this illegal characters was "added". Maybe from a former migration (1.5.x to 1.6.x or so). Maybe I can fix this for our data but maybe even the application should handle this.

matthias-ronge commented 3 years ago

With your change in #3698 I can even more illegal characters like normal white space.

@henning-gerhardt, could you make me a list of the illegal characters you found in paths and how the paths should look correct?

henning-gerhardt commented 3 years ago

There is no list and the list of illegal characters depends on many things like your used operation system, used file system and how you may interact with this kind of characters. All illegal characters ([, , ...) which I found I removed for our instance but I don't know if this change is correct until we successful migrated and checked the data.

matthias-ronge commented 3 years ago

Should this fixed outside of the application or should the application handle this?

Since we don't have a clear error pattern, I would answer your initial question that such errors have to be corrected locally outside the application. Should we still be able to obtain a clear error pattern in the future, which affects several installations, then we can of course also incorporate a correction function here.