inception-project / inception

INCEpTION provides a semantic annotation platform offering intelligent annotation assistance and knowledge management.
https://inception-project.github.io
Apache License 2.0
599 stars 152 forks source link

Search on large local knowledge base is much slower than expected #3038

Closed ogunoz closed 2 years ago

ogunoz commented 2 years ago

Describe the bug Searching on knowledge base with big number of entities is slower than expected.

To Reproduce Steps to reproduce the behavior: To be able to import a large KB like this, you may need to set JAVA_OPTS e.g., "-Dspring.jpa.properties.hibernate.dialect.storage_engine=innodb -Xmx2g"

  1. Create a project
  2. Go to settings to create a Knowledge Base
  3. Import the file in https://www.swisstransfer.com/d/b63e81c3-bed3-4b60-9adf-155b55bcc2c0
  4. Pick OWL as IRI Schema and save the KB
  5. Go to dashboard, select the project and select Knowledge Base on the left bar
  6. Search for an entry e.g., Achieve Theatre
  7. Then the search is quite slow around 30 seconds
  8. You can also see the same search performance on Annotation Editor

Expected behavior A search on title field should be finished under 3 seconds

Screenshots Screen Shot 2022-05-10 at 11 36 14

Please complete the following information:

Additional context I tried different full text search modes but they did not help. As it is a local KB, most fitting should be Lucene Sail If there is a config to increase the performance, that can be added into documentation e.g., increase X, decrease Y.

reckart commented 2 years ago

I created a project using the "entity linking (wikidata)" quick template, deleted the wikidata kb, created an empty local OWL KB, save it and then imported your mock data then imported a toy text, marked up 1-4 words and pressed "space". The dropdown takes a moment to appear but not 16-30 seconds.

ConceptLinkingServiceImpl - Generated [367] candidates in 2615ms when pressing space on 5 words

It seems to get very slow though when I enter a query term into the "identifier" field

ConceptLinkingServiceImpl - Found [0] candidates exactly matching [mountain, Chicago and taught constitutional law]
ConceptLinkingServiceImpl - Found [79] candidates starting with [mountain]]
ConceptLinkingServiceImpl - Found [1000] candidates using matching [mountain, Chicago and taught constitutional law]
ConceptLinkingServiceImpl - Generated [1000] candidates in 52875ms

Seems that "mountain" is a very frequent term in the KB - when I use e.g. "hexachloride", it is faster. Needs closer investigation.

ogunoz commented 2 years ago

Thanks for the initial investigation. I wasn't sure that there was ConceptLinking running. I assume from the documentation, there is no way to disable ConceptLinking and having a much simpler search there. For example:

title.lower().startswith(query.lower()) or any([word.lower().startswith(query.lower()) for word in title.split(" ")])

reckart commented 2 years ago

With KBs, we do a SPARQL query internally. If you want a simpler approach, you could use a tagset instead of a KB.

reckart commented 2 years ago

Try going go the tagset pane in the project settings and export one of the tagsets as JSON. Once you see the format, you can probably write a script converting your KB to that JSON and then import it. Then you can attach the tagset to a "string" feature.

ogunoz commented 2 years ago

That may work. I'll try definitely. But i am not sure dough what happens with the 500k tags in the tagset. But i will write here once I try!

reckart commented 2 years ago

I think 500k tagsets have not been used yet, but I do believe that people have used tagsets one order of magnitude smaller. So, let's see how you fare ;)

ogunoz commented 2 years ago

That was super good for 10k entries but for 100k entries while importing the tag set json, i get org.apache.wicket.page.CouldNotLockPageException unfortunately

reckart commented 2 years ago

There is also a simpler tab-separated format - maybe you have more luck with that. That said: I think even if you get this exception, the tagset import should continue in the background and eventually you should have all the tags in there....

reckart commented 2 years ago

If you run with a MariaDB/MySQL, you could also look at the DB and possibly shove the tags directly into the tagset table.

ogunoz commented 2 years ago

Thank you for the suggestions, i also realized that in the background it writes the tags into DB. If i can export a project after importing the tags and then project import time is reasonable, then embedded DB setup is already fine for me :) Otherwise, i will go deeper with MariaDB setup