Closed ogunoz closed 2 years ago
I created a project using the "entity linking (wikidata)" quick template, deleted the wikidata kb, created an empty local OWL KB, save it and then imported your mock data then imported a toy text, marked up 1-4 words and pressed "space". The dropdown takes a moment to appear but not 16-30 seconds.
ConceptLinkingServiceImpl - Generated [367] candidates in 2615ms when pressing space on 5 words
It seems to get very slow though when I enter a query term into the "identifier" field
ConceptLinkingServiceImpl - Found [0] candidates exactly matching [mountain, Chicago and taught constitutional law]
ConceptLinkingServiceImpl - Found [79] candidates starting with [mountain]]
ConceptLinkingServiceImpl - Found [1000] candidates using matching [mountain, Chicago and taught constitutional law]
ConceptLinkingServiceImpl - Generated [1000] candidates in 52875ms
Seems that "mountain" is a very frequent term in the KB - when I use e.g. "hexachloride", it is faster. Needs closer investigation.
Thanks for the initial investigation. I wasn't sure that there was ConceptLinking running. I assume from the documentation, there is no way to disable ConceptLinking and having a much simpler search there. For example:
title.lower().startswith(query.lower())
or
any([word.lower().startswith(query.lower()) for word in title.split(" ")])
With KBs, we do a SPARQL query internally. If you want a simpler approach, you could use a tagset instead of a KB.
Try going go the tagset pane in the project settings and export one of the tagsets as JSON. Once you see the format, you can probably write a script converting your KB to that JSON and then import it. Then you can attach the tagset to a "string" feature.
That may work. I'll try definitely. But i am not sure dough what happens with the 500k tags in the tagset. But i will write here once I try!
I think 500k tagsets have not been used yet, but I do believe that people have used tagsets one order of magnitude smaller. So, let's see how you fare ;)
That was super good for 10k entries but for 100k entries while importing the tag set json, i get
org.apache.wicket.page.CouldNotLockPageException
unfortunately
There is also a simpler tab-separated format - maybe you have more luck with that. That said: I think even if you get this exception, the tagset import should continue in the background and eventually you should have all the tags in there....
If you run with a MariaDB/MySQL, you could also look at the DB and possibly shove the tags directly into the tagset table.
Thank you for the suggestions, i also realized that in the background it writes the tags into DB. If i can export a project after importing the tags and then project import time is reasonable, then embedded DB setup is already fine for me :) Otherwise, i will go deeper with MariaDB setup
Describe the bug Searching on knowledge base with big number of entities is slower than expected.
To Reproduce Steps to reproduce the behavior: To be able to import a large KB like this, you may need to set JAVA_OPTS e.g., "-Dspring.jpa.properties.hibernate.dialect.storage_engine=innodb -Xmx2g"
Expected behavior A search on title field should be finished under 3 seconds
Screenshots
Please complete the following information:
Additional context I tried different full text search modes but they did not help. As it is a local KB, most fitting should be
Lucene Sail
If there is a config to increase the performance, that can be added into documentation e.g., increase X, decrease Y.