Closed TobiasNx closed 3 years ago
647, 654, 657, 662
found no test resources for these fields.
In general it is already possible to search those fields in XML data, just place the XML file to "src/test/resources/alma/almaMarcXmlTestFiles.xml.tar.bz2" and adapt the AlmaTest in https://github.com/hbz/lobid-resources/blob/a7794a371ddb0eee8425e1b0ff366e9c8d8ea5a5/src/test/java/org/lobid/resources/AlmaMarc21XmlToLobidJsonTest.java#L57 to something like this:
private static final String PATTERN_TO_IDENTIFY_XML_RECORDS = ".*662\"";
But oh is this slow - while it is lighning fast when the field is at the beginning of an XML (like 001
) it takes about one second to filter out matches with 662"
. Did this nonetheless and stopped the program after 66k documents (20 h!) , no hits so far.
Not sure if I have to think about a faster way to find those XMLs or if we just wait for someone to point out it's missing by giving an example where we can work with. WDYT @TobiasNx ?
Ah, when using the pattern without wildcard it's back to be ultra fast again :)
So, just use e.g. 662
as pattern, not .*662
.
I couldn't also find any XML with the fields "647, 654, 657, 662" filtering the update.xml.bgzf
containing 100k XML records using the OR pattern described at wiki: How-to-find-particular-XML-records.
I think we can close this issue then - WDYT @TobiasNx ?
+1 For now we have our examples and corresponding tickets. So yes, thanks for finding a new faster way finding specific test files.
MARC-Feld | Inhalt