hbz / lobid-resources

Transformation, web frontend, and API for the hbz catalog as LOD
http://lobid.org/resources
Eclipse Public License 2.0
7 stars 7 forks source link

Enrich with RVK concordance #2024

Closed dr0i closed 3 months ago

dr0i commented 3 months ago

See #1058.

dr0i commented 3 months ago

This is not working, IDK why. Can you help me @TobiasNx ?

TobiasNx commented 3 months ago

When running $ mvn clean install -DskipTests=false -DgenerateTestData=true I get an error message:

[INFO] Scanning for projects...
[INFO] 
[INFO] ---------------------< org.lobid:lobid-resources >----------------------
[INFO] Building lobid-resources 1.0.1-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ lobid-resources ---
[INFO] Deleting /home/tobias/git/lobid-resources/target
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ lobid-resources ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 37 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.3:compile (default-compile) @ lobid-resources ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 17 source files to /home/tobias/git/lobid-resources/target/classes
[INFO] /home/tobias/git/lobid-resources/src/main/java/org/lobid/resources/EtikettJson.java: Some input files use unchecked or unsafe operations.
[INFO] /home/tobias/git/lobid-resources/src/main/java/org/lobid/resources/EtikettJson.java: Recompile with -Xlint:unchecked for details.
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ lobid-resources ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 349 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.3:testCompile (default-testCompile) @ lobid-resources ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 5 source files to /home/tobias/git/lobid-resources/target/test-classes
[INFO] 
[INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ lobid-resources ---

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running UnitTests
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
2024-06-04 16:41:25 INFO [org.elasticsearch.node.Node.<init>(Node.java:254)] - initializing ...
Tests run: 5, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 20.79 sec <<< FAILURE! - in UnitTests
org.lobid.resources.CulturegraphXmlFilterHbzToJsonTest  Time elapsed: 0.959 sec  <<< ERROR!
org.elasticsearch.ElasticsearchException: java.io.IOException: failed to read [id:10, legacy:false, file:/home/tobias/git/lobid-resources/tmp/data/nodes/0/_state/node-10.st]
        at org.lobid.resources.CulturegraphXmlFilterHbzToJsonTest.setup(CulturegraphXmlFilterHbzToJsonTest.java:92)
Caused by: java.io.IOException: failed to read [id:10, legacy:false, file:/home/tobias/git/lobid-resources/tmp/data/nodes/0/_state/node-10.st]
        at org.lobid.resources.CulturegraphXmlFilterHbzToJsonTest.setup(CulturegraphXmlFilterHbzToJsonTest.java:92)
Caused by: org.elasticsearch.gateway.CorruptStateException: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=-1124073472 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/home/tobias/git/lobid-resources/tmp/data/nodes/0/_state/node-10.st")))
        at org.lobid.resources.CulturegraphXmlFilterHbzToJsonTest.setup(CulturegraphXmlFilterHbzToJsonTest.java:92)
Caused by: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=-1124073472 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/home/tobias/git/lobid-resources/tmp/data/nodes/0/_state/node-10.st")))
        at org.lobid.resources.CulturegraphXmlFilterHbzToJsonTest.setup(CulturegraphXmlFilterHbzToJsonTest.java:92)

org.lobid.resources.CulturegraphXmlFilterHbzToJsonTest  Time elapsed: 0.959 sec  <<< ERROR!
java.lang.NullPointerException
        at org.lobid.resources.CulturegraphXmlFilterHbzToJsonTest.down(CulturegraphXmlFilterHbzToJsonTest.java:147)

Results :

Tests in error: 
org.lobid.resources.CulturegraphXmlFilterHbzToJsonTest.org.lobid.resources.CulturegraphXmlFilterHbzToJsonTest
  Run 1: CulturegraphXmlFilterHbzToJsonTest.setup:92 » Elasticsearch java.io.IOExceptio...
  Run 2: CulturegraphXmlFilterHbzToJsonTest.down:147 NullPointer

Tests run: 4, Failures: 0, Errors: 1, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  25.939 s
[INFO] Finished at: 2024-06-04T16:41:25+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project lobid-resources: There are test failures.
[ERROR] 
[ERROR] Please refer to /home/tobias/git/lobid-resources/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Somehow CulturegraphXmlFilterHbzToJsonTest.java creates errors

This also happens with the current master

TobiasNx commented 3 months ago

I also updated the fix. To use every RVK notation separately.

dr0i commented 3 months ago

Re build error: Delete /home/tobias/git/lobid-resources/tmp/. IDK why this occurs for you as a problem - the tmp folder is ignored in editorconfig-maven-plugin and also in .gitignore. It runs OK with GitHub actions and at my sides.

TobiasNx commented 3 months ago

@dr0i now it is running. filenames need to be updated

TobiasNx commented 3 months ago

I adjusted the names. @dr0i you can continue

dr0i commented 3 months ago

Started a full ETL using the tsv from #1058 . See e.g. http://stage.lobid.org/resources/990178010510206441 , which shows the enrichment as the data isn't part of the source. RAM seems to be good. I don't expect a stark increase of needed time (last was ~21h) . Will be fully indexed tomorrow.

acka47 commented 3 months ago

Hi everybody, I haven't followed this closely, so I am chiming in with an aspect that is important to me: Before we deploy this to production:

  1. We should make sure that all enriched parts are marked as such. (This probably won't be easy without breaking the existing data structure.)
  2. We should find out/discuss whether there should be the possibility to filter out enriched subjects in queries. In other words, we will have to answer the question: What are the risks in adding these, how will quality be impaired and will some API users not want to use the enriched data.

Thus, we should probably schedule a meeting to discuss these questions.

dr0i commented 3 months ago

RVK enrichment from CG ready:

[edit: Just saw that I've used the wrong basedump for this, so e.g. many Items are missing (see e.g. http://stage.lobid.org/resources/990092347550206441) and 2 M resources].

dr0i commented 3 months ago

As decided last week offline we can go on merge this.

This tsv is generated by executing e.g. mvn exec:java -Dexec.mainClass="org.lobid.resources.run.CulturegraphXmlFilterHbzRvkToTsv" -Dexec.args="/data/other/cg/aggregate_20240507.marcxml.gz". The tsv is written to the root level of the repo. Don't forget to put the lookup table to lookup-tables/data/rvk.tsv.

A blog post shall be written.