ulb-sachsen-anhalt / digital-derivans

Derive new digitals from existing ones
MIT License
6 stars 2 forks source link

NPE on getImagePath if LOCTYPE=URL #39

Open bertsky opened 1 year ago

bertsky commented 1 year ago

I have a METS where all FLocats are LOCTYPE=URL (as required by DFGViewer), but local directories FULLTEXT and MAX do exist as well.

Unfortunately, digital-derivans does not seem to like this representation:

Exception in thread "main" java.lang.NullPointerException
    at de.ulb.digital.derivans.model.DigitalPage.getImagePath(DigitalPage.java:96)
    at de.ulb.digital.derivans.DerivansPathResolver.enrichAbsoluteStartPath(DerivansPathResolver.java:289)
    at de.ulb.digital.derivans.Derivans.init(Derivans.java:120)
    at de.ulb.digital.derivans.Derivans.create(Derivans.java:169)
    at de.ulb.digital.derivans.App.main(App.java:43)

So do I have to convert the hrefs to local path?

bertsky commented 1 year ago

I (downloaded and) inserted secondary FLocat with LOCTYPE=OTHER OTHERLOCTYPE=FILE, but the NPE persists. Do I have to replace the remote FLocat completely?

bertsky commented 1 year ago

No, even after removing the remote URLs completely (so only the local FLocats would remain for MAX and FULLTEXT), it crashes the same.

bertsky commented 1 year ago

Ok, so perhaps the tool expects the directory name to be MAX, not just the fileGrp/@USE?

bertsky commented 1 year ago

No, not even that works.

How do you use this tool?

M3ssman commented 1 year ago

Thank you for trying out!

Since our Workflow uses not the OCR-D-METS itself, I'm not aware of these issues.

It served well +250k times in past 1.5 year to process digital objects pulled via OAI from Visual Library Server (versions range from 2012 - 2022.06) and opendata (DSpace 6). VLS: Kurzer Bericht Von der Gegenwärtigen Verfassung Des Paedagogii Regii Zu Glaucha vor Halle Opendata: Concordantiæ Bibliorum Germanico-Hebraico-Græcæ

(The latter being hopefully ocr'd by OCR-D)

METS like these are both DDB-valid and processable by Derivans, if the image content of the fileGroup MAX and FULLTEXT are present in a local sub dir relative to the higher-level (work-)dir which holds the METS-file.

Another scenario just uses a flat tree without any METS but at least a MAX sub dir containing images (tif, jpg) and optional OCR-Data (FULLTEXT) to create a PDF.

One can add lots of processing-steps in a configuration (which per default is expected to be in config/derivans.ini but can be passed another location, too), like scaling and image quality and so on.

bertsky commented 1 year ago

METS like these are both DDB-valid and processable by Derivans, if the image content of the fileGroup MAX and FULLTEXT are present in a local sub dir relative to the higher-level (work-)dir which holds the METS-file.

I already tried that – see above.

Let me elaborate. This is the filesystem:

schütz-test/
├── FULLTEXT
│   ├── FILE_0005_FULLTEXT.xml
│   ├── FILE_0007_FULLTEXT.xml
│   ├── FILE_0008_FULLTEXT.xml
│   ├── FILE_0009_FULLTEXT.xml
│   ├── FILE_0068_FULLTEXT.xml
│   ├── FILE_0073_FULLTEXT.xml
│   ├── FILE_0075_FULLTEXT.xml
│   ├── FILE_0133_FULLTEXT.xml
│   ├── FILE_0139_FULLTEXT.xml
│   ├── FILE_0141_FULLTEXT.xml
│   ├── FILE_0204_FULLTEXT.xml
│   ├── FILE_0266_FULLTEXT.xml
│   ├── FILE_0324_FULLTEXT.xml
│   ├── FILE_0364_FULLTEXT.xml
│   ├── FILE_0416_FULLTEXT.xml
│   ├── FILE_0421_FULLTEXT.xml
│   ├── FILE_0470_FULLTEXT.xml
│   ├── FILE_0475_FULLTEXT.xml
│   ├── FILE_0477_FULLTEXT.xml
│   ├── FILE_0478_FULLTEXT.xml
│   └── FILE_0533_FULLTEXT.xml
├── MAX
│   ├── 00000001.tif.large.jpg
│   ├── 00000002.tif.large.jpg
│   ├── 00000003.tif.large.jpg
│   ├── 00000004.tif.large.jpg
│   ├── 00000005.tif.large.jpg
│   ├── 00000006.tif.large.jpg
│   ├── 00000007.tif.large.jpg
│   ├── 00000008.tif.large.jpg
│   ├── 00000009.tif.large.jpg
│   ├── 00000010.tif.large.jpg
│   ├── 00000011.tif.large.jpg
│   ├── 00000012.tif.large.jpg
│   ├── 00000013.tif.large.jpg
│   ├── 00000014.tif.large.jpg
│   ├── 00000015.tif.large.jpg
...
│   └── 00000536.tif.large.jpg
├── mets.xml
...

The METS references these as local hrefs:

    <mets:fileGrp USE="MAX">
      <mets:file ID="FILE_0001_MAX" MIMETYPE="image/jpeg">
        <mets:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" xlink:href="MAX/00000001.tif.large.jpg"/>
      </mets:file>
      <mets:file ID="FILE_0002_MAX" MIMETYPE="image/jpeg">
        <mets:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" xlink:href="MAX/00000002.tif.large.jpg"/>
      </mets:file>
...
      <mets:file ID="FILE_0536_MAX" MIMETYPE="image/jpeg">
        <mets:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" xlink:href="MAX/00000536.tif.large.jpg"/>
      </mets:file>
    </mets:fileGrp>
   <mets:fileGrp USE="FULLTEXT">
      <mets:file ID="FILE_0005_FULLTEXT" MIMETYPE="text/xml">
        <mets:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" xlink:href="FULLTEXT/FILE_0005_FULLTEXT.xml"/>
      </mets:file>
...
      <mets:file ID="FILE_0533_FULLTEXT" MIMETYPE="text/xml">
        <mets:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" xlink:href="FULLTEXT/FILE_0533_FULLTEXT.xml"/>
      </mets:file>
    </mets:fileGrp>

I am running digital-derivans like this:

java -jar /data/ocr-d/digital-derivans/target/digital-derivans-1.7.1.jar schütz-test/mets.xml

What else am I expected to do to fit your profile?

bertsky commented 1 year ago

Running with just the directory name does produce a PDF file. It is 535 MB, and all pages look like grainy rainbows:

Corrupt JPEG data: bad Huffman code

Also, I tried with non-OCR-D (straight out of Kitodo.Presentation / DFG-Viewer) already. Same problem!

bertsky commented 1 year ago

VLS: Kurzer Bericht Von der Gegenwärtigen Verfassung Des Paedagogii Regii Zu Glaucha vor Halle

Ok, so how do I make this work? This is a OAI PMH link, so I did ocrd workspace -d paedagogii clone URL and ran on paedagogii/mets.xml. This yields:

...
ERROR de.ulb.digital.derivans.derivate.ImageDerivateer - render paedagogii/MAX/11477812.jpg: Can't read input file!
ERROR de.ulb.digital.derivans.Derivans - java.io.FileNotFoundException: paedagogii/FOOTER_80/11477799.jpg
        de.ulb.digital.derivans.DigitalDerivansException: java.io.FileNotFoundException: paedagogii/FOOTER_80/11477799.jpg
    at de.ulb.digital.derivans.derivate.PDFDerivateer.create(PDFDerivateer.java:401)
        ...
M3ssman commented 1 year ago

Regarding your first attempt: The setup and call look quite reasonable, with the difference that in our workflows the OCR-File and the Image match exactly by name. But IIRC (https://github.com/ulb-sachsen-anhalt/digital-derivans/blob/master/src/main/java/de/ulb/digital/derivans/data/MetadataStore.java#L86), they are picked by the physical links in each physical sub division, which I did not recognize in your example.

Regarding the second one: I did in a fresh venv with ocrd 2.49.0 ocrd workspace -d paedagogii clone "https://digitale.bibliothek.uni-halle.de/vd18/oai?verb=GetRecord&metadataPrefix=mets&identifier=11448616" which downloaded only the METS and no data. Is this intended? Therefore any images are missing and Derivans despairs (first ERROR message). This leads to the second ERROR message, because it expects from the first run to be a file present in this location.

bertsky commented 1 year ago

with the difference that in our workflows the OCR-File and the Image match exactly by name.

you mean the base name, excluding the suffix? IMO that would be unrealistic and overly strict.

But IIRC, they are picked by the physical links in each physical sub division, which I did not recognize in your example.

I did not show the physical structMap. It looks like this:

 <mets:structMap TYPE="PHYSICAL">
    <mets:div ID="PHYS_0000" TYPE="physSequence">
      <mets:fptr FILEID="FULLDOWNLOAD"/>
      <mets:div ID="PHYS_0001" ORDER="1" ORDERLABEL=" - " TYPE="page">
        <mets:fptr FILEID="FILE_0001_THUMBS"/>
        <mets:fptr FILEID="FILE_0001_DOWNLOAD"/>
        <mets:fptr FILEID="FILE_0001_MIN"/>
        <mets:fptr FILEID="FILE_0001_DEFAULT"/>
        <mets:fptr FILEID="FILE_0001_MAX"/>
        <mets:fptr FILEID="FILE_0001_ORIGINAL"/>
      </mets:div>
      <mets:div ID="PHYS_0002" ORDER="2" ORDERLABEL=" - " TYPE="page">
        <mets:fptr FILEID="FILE_0002_THUMBS"/>
        <mets:fptr FILEID="FILE_0002_DOWNLOAD"/>
        <mets:fptr FILEID="FILE_0002_MIN"/>
        <mets:fptr FILEID="FILE_0002_DEFAULT"/>
        <mets:fptr FILEID="FILE_0002_MAX"/>
        <mets:fptr FILEID="FILE_0002_ORIGINAL"/>
      </mets:div>
...

So, judging by the code, I guess the mets:fptr FILEID="FULLDOWNLOAD" might be a problem.

How do you run with debug logging?

Regarding the second one: Is this intended?

Like I said, I don't know what to expect. You said you have used this thousands of times on METS in your presentation. Presentation METS usually only have URLs. I already documented my odysee trying various combinations of remote and local hrefs above.

If I do an additional …

ocrd workspace -d paedagogii find -G MAX --download
ocrd workspace -d paedagogii find -G FULLTEXT --download

… (which replaces URLs with local path refs), then it works (but without text layer).

M3ssman commented 1 year ago

To clear out, the processing used by OCR-D-ODEM is based on a list of OAI-Record-URNs and works as follows:

  1. Get URL of OAI-Record
  2. Download OAI-Record METS and strip any unrelated file groups (THUMBNAIL, DEFAULT, DOWNLOAD)
  3. Filter MAX images by (configurable) logical type and phyisical label
  4. For each image of this subset create a new ocr-d workspace and add only a single, local page. This is the first appearance of OCR-D (save the initLogging-call).
  5. In each OCR-D workspace, run the Makefile-based OCR-D-Workflow This is run in parallel with respect to how many slots are configured - usually between 8-12 slots
  6. Afterwards, convert all created PAGE to ALTO and copy this up the base working dir into directory FULLTEXT
  7. Run Derivans to create the new PDF with text layer, outline and some more metadata
  8. Assemble all new created content (PDF, ALTO) into a new DSpace-SAF-file for presentation system opendata.uni-halle.de

Back to usage: Probably FULLTEXT section is missing.

I've tried to follow your way like this (with smaller print 1981185920/42053. Thank you for adding the calls to get the required contents)

ocrd workspace -d id02 clone "https://opendata.uni-halle.de/oai/dd?verb=GetRecord&metadataPrefix=mets&identifier=oai:opendata.uni-halle.de:1981185920/42053"
20:48:02.191 INFO ocrd_models.utils.is_oai_content - response data root.tag: '{http://www.openarchives.org/OAI/2.0/}OAI-PMH'
/home/m3ssman/Projekte/test-ocrd/id02
ocrd workspace -d id02 find -G FULLTEXT --download
20:48:08.790 INFO ocrd_models.utils.is_oai_content - response data root.tag: '{http://www.loc.gov/standards/alto/ns-v4#}alto'
...

ocrd workspace -d id02 find -G MAX --download
MAX/IMG_MAX_1315644.jpg
...

java -jar ../ulb-sachsen-anhalt-digital-derivans/target/digital-derivans-1.7.1.jar ~/Projekte/test-ocrd/id02/mets.xml

This in my case, creates a PDF with text layer, outline and metadata.

Probably you can get more information with a sample config. One is located at <derivans>/src/test/resources/config . Passing this, the final call would be something like java -jar ../ulb-sachsen-anhalt-digital-derivans/target/digital-derivans-1.7.1.jar -c /home/m3ssman/Projekte/ulb-sachsen-anhalt-digital-derivans/src/test/resources/config/derivans.ini ~/Projekte/test-ocrd/id02/mets.xml

In production environments it's ensured, that required configs are located in a sub directory config straight below Derivans jar file, like this:

- bin/digital-derivans.jar
- bin/config/derivans.ini
- bin/config/derivans_logging.xml
- bin/config/template_footer.png
bertsky commented 1 year ago

So, judging by the code, I guess the mets:fptr FILEID="FULLDOWNLOAD" might be a problem.

It was the problem with my own dataset. After stripping the existing FULLDOWNLOAD fptr (together with all the other steps described above), derivans does process the METS. Unfortunately, I end up with the same broken result I get when just passing the directory: garbage rainbows without any text.

To clear out, the processing used by OCR-D-ODEM is based on a list of OAI-Record-URNs and works as follows: [...]

Thanks for that explanation. So you are not using the METS yourself, only the directory. What I still find missing is at what point you download the MAX images. Do you just copy them over from the OCR-D workspaces (together with the FULLTEXT files)?

Back to usage: Probably FULLTEXT section is missing.

In my case? No. See above.

This in my case, creates a PDF with text layer, outline and metadata.

Like I said, I don't get a text layer, regardless of what dataset (yours or mine). Perhaps I need some configuration file?

So what setting there influences whether or not a text layer gets added?

M3ssman commented 1 year ago

There's no additional setting required. If the physical div has links to both MAX image and FULLTEXT OCR, it usually works on Ubuntu 18+20+22 OS built with OpenJDK 11 and Maven 3.6.

If config (c.f. above) and logging are in place, please inspect the log for messages like:

...
023-04-16 12:16:35 [DEBUG] (MetadataStore:95) create digital page from [MAX=>00000004]
2023-04-16 12:16:35 [DEBUG] (MetadataStore:149) [phys320802] contentids 'urn:nbn:de:gbv:3:3-10504-p0004-5'
2023-04-16 12:16:35 [TRACE] (MetadataStore:106) search for FULLTEXT within [MAX=>00000004]
2023-04-16 12:16:35 [DEBUG] (MetadataStore:95) create digital page from [MAX=>00000005, FULLTEXT=>320805.xml]
2023-04-16 12:16:35 [DEBUG] (MetadataStore:149) [phys320805] contentids 'urn:nbn:de:gbv:3:3-10504-p0005-1'
2023-04-16 12:16:35 [TRACE] (MetadataStore:106) search for FULLTEXT within [MAX=>00000005, FULLTEXT=>320805.xml]
2023-04-16 12:16:35 [DEBUG] (MetadataStore:177) found ocr data file '/tmp/junit17151919262435161774/148811035/FULLTEXT/320805.xml'
2023-04-16 12:16:35 [DEBUG] (MetadataStore:160) [phys320805] enriched ocr data with '13' lines
...

This means that for page 4 no OCR data was present (nothing enriched) but starting with page 5 the metadata points to OCR files. In this test case it's OCR-D transformed ALTO

At PDF creation time it is processed like this:

2023-04-16 11:48:45 [INFO ] (PDFDerivateer:472) created pdf '/tmp/junit2182737113542557775/pdf-image-0020.pdf' with 20 pages (outline:true)
2023-04-16 11:48:45 [INFO ] (PDFDerivateer:104) set dpi for image scaling 144
2023-04-16 11:48:45 [INFO ] (PDFDerivateer:99) debugRender: false
2023-04-16 11:48:45 [INFO ] (PDFDerivateer:395) PDF scale 0.5, dpi: 144 (orig.: 575.0x799.0)
2023-04-16 11:48:45 [INFO ] (PDFDerivateer:397) Firstpage: 575.0x799.0)
2023-04-16 11:48:45 [INFO ] (PDFDerivateer:399) Firstpage scaled: 287.5x399.5)
2023-04-16 11:48:45 [DEBUG] (PDFDerivateer:134) re-set document pageSize 287.5x399.5
2023-04-16 11:48:45 [INFO ] (PDFDerivateer:152) addPage rescale: 287.5x399.5
2023-04-16 11:48:45 [DEBUG] (PDFDerivateer:162) handle optional ocr for /tmp/junit11875524951958479536/MAX/0001.jpg
2023-04-16 11:48:45 [TRACE] (PDFDerivateer:181) scale ocr data for '/tmp/junit11875524951958479536/MAX/0001.jpg' by '0.5'
2023-04-16 11:48:45 [TRACE] (PDFDerivateer:194) render text at line-level
2023-04-16 11:48:45 [DEBUG] (PDFDerivateer:134) re-set document pageSize 287.5x399.5

which indicate that recognition and parsing took place and that OCR data had to be scaled to match the also scaled target image. This is because we scale images for PDF to reduce size for reading on screen.

Output like 2023-04-16 11:48:46 [INFO ] (PDFDerivateer:217) no ocr data present for '/tmp/junit16280214345017219749/MAX/0001.jpg'' means that no OCR has been found nor enriched.

Maybe the OCR-data isn't properly recognized? Regarding OCRReaderFactory, it knows ALTO3, ALTO4 and PAGE2019. See test resources for ALTO and PAGE

Is it possible to provide some test data for analytical purposes?

bertsky commented 1 year ago

Ok, we are getting there. I have copied src/test/resources/config/derivans.ini and src/test/resources/config/derivans_logging.xml to target/config/ – this should be in the README!

Now I can see log messages.

A new problem arose after the first successful run (with the garbled colours): when digital-derivans added the PDF to my METS, it created invalid identifiers!

Looks like the file name and file ID is based on whatever mods:recordInfo/mods:recordIdentifier it could find. In my case – only the OAI identifier: oai:de:slub-dresden:db:id-500088063. This obviously is not a valid XML identifier. So next time, digital-derivans crashes on that METS:

13:55:51.444 [main] ERROR org.mycore.mets.model.Mets - Error parsing and validating mets document
org.xml.sax.SAXException: Encountered a SAX exception processing the Document: 
    at org.jdom2.transform.JDOMSource$DocumentReader.parse(JDOMSource.java:565) ~[digital-derivans-1.7.1.jar:?]
    at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorHandlerImpl.validate(ValidatorHandlerImpl.java:731) ~[?:?]
    at com.sun.org.apache.xerces.internal.jaxp.validation.ValidatorImpl.validate(ValidatorImpl.java:101) ~[?:?]
    at javax.xml.validation.Validator.validate(Validator.java:124) ~[?:?]
    at org.mycore.mets.model.Mets.isValid(Mets.java:739) [digital-derivans-1.7.1.jar:?]
bertsky commented 1 year ago

Also, digital-derivans converted my METS from LF to CRLF convention for EOL. It's debatable whether this is still correct, but it's unexpected.

Another problem: the PDF gets referenced as fptr in the logical structMap. That's plain wrong according to DFG profile – it should be in the physical structMap.

bertsky commented 1 year ago

Looks like the file name and file ID is based on whatever mods:recordInfo/mods:recordIdentifier it could find. In my case – only the OAI identifier: oai:de:slub-dresden:db:id-500088063. This obviously is not a valid XML identifier.

The problem is in src/main/java/de/ulb/digital/derivans/data/DescriptiveDataBuilder.java. Worse even, it just blindly assumes that whatever resides under that path is a URN: optUrn = identifiers.stream().filter(sourceExists).findFirst();

IMO it should simply look for mods:identifier, preferably those of @type="urn".

But even then – to use this directly as PDF file name and XML identifier for it is just wrong. It should at least convert colons to underscores. See DerivansPathResolver.calculatePDFPath().

M3ssman commented 1 year ago

The decisions whom to use for what purpose reflects our inhouse workflows.

I have copied src/test/resources/config/derivans.ini and src/test/resources/config/derivans_logging.xml to target/config/ – this should be in the README!

I've tried to do this in the README, but it seems to be unclear. Please add critical remarks and open a PR to help this out.

Another problem: the PDF gets referenced as fptr in the logical structMap. That's plain wrong according to DFG profile – it should be in the physical structMap.

What DFG- profile do you mean?

This insertion is DDB-valid. Further, it is correctly displayed and linked in the DFG-Viewer. Try a digital object from Share_it or Share_DIGit, they are almost all done this way. Also, enterprise components like visual library or zeutschel do it like this.

bertsky commented 1 year ago

I've tried to do this in the README, but it seems to be unclear. Please add critical remarks and open a PR to help this out.

Oh sorry, I remember reading this now. It looked complicated... At least a reference to the example configs under src/test/resources/config/ would help. But perhaps we need better defaults. Also, for logging – at least some simple log level override – IMHO there should be a command line option.

Another problem: the PDF gets referenced as fptr in the logical structMap. That's plain wrong according to DFG profile – it should be in the physical structMap.

What DFG- profile do you mean?

I meant the DFG profile for METS. But now that I went looking, surprisingly I cannot find any specifics for PDF in there, except for the mention of the dedicated DOWNLOAD fileGrp.

It did enter the OCR-D spec on METS though. There it says to use fptr in the top-level div of the physical structMap.

Looking at the code base for DFG Viewer, Kitodo.Presentation, it appears like both are supported: fptr under physical and fptr under logical.

I am somewhat perplexed. How come this important detail never entered any official documentation?

This insertion is DDB-valid. Further, it is correctly displayed and linked in the DFG-Viewer.

Indeed.

Try a digital object from Share_it or Share_DIGit, they are almost all done this way. Also, enterprise components like visual library or zeutschel do it like this.

Ok, so at least SLUB (which also uses Zeutschel for OCR) puts it in the physical structMap. But since both options are allowed, I guess we can as well keep it as it is.

Now, coming back to my problem with generated images. This is how a page in MAX looks like: 00000001 tif large

And this is what digital-derivans generates under IMAGE_90 etc: 00000001

They all look like this.

ImageMagick complains about them like so:

identify-im6.q16: Corrupt JPEG data: bad Huffman code `schütz-test/IMAGE_90/00000001.jpg' @ error/jpeg.c/JPEGErrorHandler/322.

There's nothing special on the logs.

If you want to reproduce, here is the presentation without full text, and here is a version compatible with DFG Viewer which contains full text on selected pages. (You have to do the preprocessing as described above to get it working with digital-derivans.)

So what could be causing these broken images?

M3ssman commented 1 year ago

I'll have a look at this and report back.

M3ssman commented 1 year ago

Okay, I could reproduce the effect with the OAI record data you provided. Have a look at the branch https://github.com/ulb-sachsen-anhalt/digital-derivans/tree/fix/jpg-render-baseline which contains my first guess to fix this behavior. You'll to have a local installed OpenJDK11 and Maven 3.6 to execute a mvn clean package , which will generate a new derivans version in the target dir, suffixed as SNAPSHOT. (And please consider my remarks about the integration of the test resource data.)

bertsky commented 1 year ago

Just found out that mvn clean also removes my target/config ...

bertsky commented 1 year ago

Have a look at the branch https://github.com/ulb-sachsen-anhalt/digital-derivans/tree/fix/jpg-render-baseline which contains my first guess to fix this behavior.

That worked!

Now I can see correct JPEGs and I also get the text layer.

One more problem: the global setting default_quality is ignored in derivates that have no quality setting.

M3ssman commented 1 year ago

Great to hear!

Can you please transfer the quality setting to a new issue? But don't close this one, I want to integrate some more tests cases regarding this before turning to the next topic.

bertsky commented 1 year ago

Can you please transfer the quality setting to a new issue?

Sure, I'll spawn a new issue for each problem I found along the way. For some of them, I already have fixes.

BTW, while compiling, I was surprised to see an exception with stacktrace – apparently, one of the METS files in the test set is not valid. Is that intentional? Instrumenting with a log message that shows the affected file name, and then validating against the METS schema externally, I found out this much:

xmllint --noout --schema ../mets.xsd src/test/resources/mets/vls/vd18-9989442.ulb.xml
src/test/resources/mets/vls/vd18-9989442.ulb.xml:323: element file: Schemas validity error : Element '{http://www.loc.gov/METS/}file', attribute 'ID': 'IMG_MAX_10000000' is not a valid value of the atomic type 'xs:ID'.
src/test/resources/mets/vls/vd18-9989442.ulb.xml:510: element div: Schemas validity error : Element '{http://www.loc.gov/METS/}div', attribute 'ID': 'phys10000000' is not a valid value of the atomic type 'xs:ID'.
src/test/resources/mets/vls/vd18-9989442.ulb.xml fails to validate

The reason seems to be that these identifiers appear multiple times.

bertsky commented 1 year ago

Further, regarding tests involved in mvn install, IIUC this will prefer any ./config/derivans.ini if present, overruling the repo's own src/test/java/de/ulb/digital/derivans/config which is need for correct tests.

IOW when I place my own config under target, it will get removed by the build. If I place it in the root, it will mask the configs needed for the build. :roll_eyes:

M3ssman commented 1 year ago

Nay, you may create something like <home>/Tools/derivans and place your configuration there, like Tools/derivans/config. After each new build, copy the new jar (this is the only thing you really need) into this Tools/derivans dir. This way any local configurations are completely independent from building.

The configs from the build are just test sample configs and not meant to be used in productive scenarios.

Don't worry for the error messages during build, this is intended behavior. It is better to know how an application deals with unknown or corrupt data, since they appear quite common in massive workflows mixed with legacy stuff from the past 20 years.

Please note, one can completely turn off test execution when building with Maven like this mvn clean package -DskipTests , which will also speed up build process.

M3ssman commented 1 year ago

As it turned out, enforcing progressive rendering, which is the actual workaround to avoid the pinked-up images, has a severe impact on performance. Test cases take 100% more time to finish, which is not acceptable if unnecessary. I'll try to get some more insights from the image data to trigger this only, if it's likely required.

In the long term there can be a config flag which controls this.

bertsky commented 1 year ago

As it turned out, enforcing progressive rendering, which is the actual workaround to avoid the pinked-up images, has a severe impact on performance. Test cases take 100% more time to finish, which is not acceptable if unnecessary. I'll try to get some more insights from the image data to trigger this only, if it's likely required.

Couldn't we have some initial conversion step (only on the input side, before any image derivates are generated) to rid of these formats?

M3ssman commented 1 year ago

Couldn't we have some initial conversion step (only on the input side, before any image derivates are generated) to rid of these formats?

Yes! That is exactly the way I've got in mind!