kitodo / kitodo-ugh

Kitodo.UGH Library
2 stars 10 forks source link

UGH Conversion unclear #82

Closed M3ssman closed 4 years ago

M3ssman commented 5 years ago

Hello,

I'm stuggeling with the component UghConverter. To be more precise, I've tried to evaluate how a pre-defined ruleset effect transformation of Pica-Data into Kitodo Goobi-Metadata.

Therefore I tried to convert PicaPlus Field Data into dvmets, but the Converter complains about the provided Pica-Data: Unable to parse input file '/home/hartwig/Projekte/ulb-dd-ktopro-ruleset-validation/target/test-classes/lhhal_response/lhhal-147764068-03.xml'

After orientating on what the Kitodo PicaPlugin delivers (at least what I expect it does), the file lhhal-147764068-03.xml currently looks like this:

<record>
<field tag="01A">
<subfield code="0">0003:22-03-94</subfield>
</field>
<field tag="01B">
<subfield code="0">0003:04-10-18</subfield>
<subfield code="t">15:13:25.000</subfield>
</field>
...
etc.

Could you please be so kind and explain what's missing? How transforms Kitodo.Production the Pica-Response-Data into it's own Metadata-Format?

Thank in advance!

henning-gerhardt commented 5 years ago

I was not and i'm currently still not familiar with this class as this class is not really used in normal usage of Kitodo.Production and was not updated since years.

I get this "converter" class to run with the following parameters --config <ruleset-file.xml> --input <input-pica-plus-file.xml> --read picaplus --output <outputfile> --write <output format>

The pica plus file must look like this

<collection>
  <record>
    <field tag="001@">
      <subfield code="0">2006</subfield>
    </field>
....
  </record>
</collection>

So I think that the enclosing collection tag was missing in your data. This tag is added after retrieving the data from the pica plus catalogue somewhere in the 2.x Kitodo.Production code.

M3ssman commented 5 years ago

Thanks for your Explanations!

After changing the XML the suggested way, it still fails. I've tried a simple Testsetup to follow the Converter Internals on UghConvert (Lines:264ff) like this

String absPathPicaPlus = getAbsolutePath("lhhal_response/lhhal-147764068-04.xml");
String absolutePathPreference = getAbsolutePath("rulesets/ruleset_uh.xml");

Prefs preferences = new Prefs();
preferences.loadPrefs(absolutePathPreference);
Fileformat fileFrom = new PicaPlus(preferences);
DigitalDocument digDoc = new DigitalDocument();
fileFrom.setDigitalDocument(digDoc);

assertTrue(fileFrom.read(absPathPicaPlus));
DigitalDocument myDocument = fileFrom.getDigitalDocument();
DocStruct log = myDocument.getLogicalDocStruct();
assertNotNull(log);

This time, a NPE is thrown in DigitalDocument#setLogicalDocStrut:180. It's comming from PicaPlus:659, when it tries to set a Property on a DigitalDocument that is still nullat this point.

All seems to origin from PicaPlus#parsePicaPlusRecord, that itself return null for the provided XML (see XML below, it's contained in the ZIP). The Ruleset itself is used in other Test-Scenarios and at our Testsystem. It is supposed to work. lhhal-147764068-04.zip ruleset_uh.zip

Is there somewhere a Test-Specification for this? The Code in PicaPlus#parsePicaPlusRecord and PicaPlus#parsePicaPlusField is rather long and hard to debug.

So sad! The creation and maintance of Kitodo-Rulesets is really hard to do. Any Tool that could be used to validate this Configuration is really welcome ... but UghConvert doesn't look the silver bullet.

I'm using kitodo-ugh V2.1.0, the Version that is enclosed with Kitodo.Production V2.1.0

henning-gerhardt commented 5 years ago

I put some little time into your data and I found the following results:

For example: field with tag 02@ and subtag with code 0 contains the media type information but UGH and your ruleset expecting this information into 002@ subtag code 0 (in your ruleset file lines 4274 ff).

I suggest that your data should take more leading zeros (field tags should be 4 or more digit) to match even internal hard coded fields into the UGH library. You should even adjust your ruleset file to the media type values from your data (f.e. Aavs is not mapped to one of the defined docstruct types). In Kitodo.Production you can adjust this with the beautifier rules into the goobi_opac.xml file to map f.e. Aavs to Ob (maybe a bad example but I don't know what the meaning of Aavs is).

The UGH code did not contain one line of test code so there are no test scenarios nor tested code nor was the code written for tests in mind :-(

Regarding tools: for 2.x are no tools at all. In Kitodo.Production 3.x is import of data handled in an other way. Maybe @Kathrin-Huber can explain this import behaviour in 3.x more in detail?

M3ssman commented 5 years ago

Thanks for your efforts!

I've added the leading zero to the data-fields, but my regarding Tests still crash.

Unfortunately, I'm off right now for about 3 weeks on vacation. I forward your Insights and Comments to my colleague, maybe they come up with more Ideas to keep this issue going!

Greets-n-Thanks!

Kathrin-Huber commented 4 years ago

Is this still an issue?

M3ssman commented 4 years ago

@Kathrin-Huber No, sorry, I forgot to close it.