hbz / lobid-resources

Transformation, web frontend, and API for the hbz catalog as LOD
http://lobid.org/resources
Eclipse Public License 2.0
8 stars 7 forks source link

Find testfiles for missing subject (Alma) #1222

Closed TobiasNx closed 3 years ago

TobiasNx commented 3 years ago

MARC-Feld | Inhalt

TobiasNx commented 3 years ago

647, 654, 657, 662

found no test resources for these fields.

dr0i commented 3 years ago

In general it is already possible to search those fields in XML data, just place the XML file to "src/test/resources/alma/almaMarcXmlTestFiles.xml.tar.bz2" and adapt the AlmaTest in https://github.com/hbz/lobid-resources/blob/a7794a371ddb0eee8425e1b0ff366e9c8d8ea5a5/src/test/java/org/lobid/resources/AlmaMarc21XmlToLobidJsonTest.java#L57 to something like this:

private static final String PATTERN_TO_IDENTIFY_XML_RECORDS = ".*662\"";

But oh is this slow - while it is lighning fast when the field is at the beginning of an XML (like 001) it takes about one second to filter out matches with 662" . Did this nonetheless and stopped the program after 66k documents (20 h!) , no hits so far.

Not sure if I have to think about a faster way to find those XMLs or if we just wait for someone to point out it's missing by giving an example where we can work with. WDYT @TobiasNx ?

dr0i commented 3 years ago

Ah, when using the pattern without wildcard it's back to be ultra fast again :) So, just use e.g. 662 as pattern, not .*662.

dr0i commented 3 years ago

I couldn't also find any XML with the fields "647, 654, 657, 662" filtering the update.xml.bgzf containing 100k XML records using the OR pattern described at wiki: How-to-find-particular-XML-records. I think we can close this issue then - WDYT @TobiasNx ?

TobiasNx commented 3 years ago

+1 For now we have our examples and corresponding tickets. So yes, thanks for finding a new faster way finding specific test files.