Closed samalloing closed 1 year ago
I took a look at this to try to retrace my decision to use Set over the other arities. In general I would expect Set to contain multiple values in a Java context. It's similar to an Array but does not support duplicate entries. According to this documentation, the Set arity maps to java.util.Set
, which matches my understanding - a collection without duplicates. I probably used a Set within a Set because there should not be duplicate font records listed and there should not be duplicate entries on the font records (the font names showing up twice, for example). That's not to say it was the best decision - now that I'm looking at the other modules I see Set isn't used much in this sort of arrangement - Array or List is used more often. Also, if using Array, we couldn't use String[] as there are two sub-properties nested in the font property - one for font name and the other a boolean to indicate whether the font is embedded.
Maybe someone could verify that I'm understanding the use of these arities correctly and confirm whether this should be changed? For extra context, the EPUB representation info is here - noting now there is an extra "s" on one of the "fonts" that I'm about to go fix! I also took a screenshot of the arrangement being referred to:
@samalloing do you believe that this is a bug as such and that there may be a reason to have repeat entries? I ask because if that's the case then Set is quite clearly the wrong construct. The nature of the issue isn't quite clear to me.
Hi Carl,
I understand the reasoning of Karen. I think there shouldn't be duplicated entries. But I don't know if Set is the correct construct. I see in PDF for example that it there is no Set used. Because in our software only the first entry is shown when a Set is processed. I was wondering if Set was correct here and if it shouldn't be different like other modules (but ePub could be right and the rest wrong). Karen also made this observation in her post
I hope made myself more clear now.
Sam
Hi, @samalloing. I'm doing a bit of triage and have come back to this. My feeling is that the processing software should treat Sets like any other collection type. The Set only protects against genuine duplicates, but it's still a collection and can have multiple entries. I've started to move to Sets for other purposes, again with the aim of eliminating duplicate entries without writing code to do so. If this is a real problem there might be a compromise solution by converting the set to an array but that shouldn't be necessary.
Hi Carl,
Thanks for getting back on this. I think you are right. Great to see that other parts of the code are getting changed to sets as well.
Thanks for thinking on this with me. This gives me more insight in the workings of jHove!
Sam
Not a problem Sam, closing this now.
The epub module reports a font list as a type of Set, when we automatically process the XML report for example Sets are processed as non repeatable fields. What happens is that only one entry of the fonts is stored in our system. I think the processing in our system is correct and that the font list should be of type Array (like Resources is).
https://github.com/openpreserve/jhove/blob/integration/jhove-ext-modules/src/main/java/org/ithaka/portico/jhove/module/EpubModule.java#L368 should be changed to type String[] so the output will be of Arity Array instead of Set. `
` cc @rosetta-format-library