Indexing Speed - Githubissues

antonyscerri commented 3 years ago

This isnt a bug just an enquiry at this point. Getting in to trying out more of the features for a workflow i've run into an issue with the corpus/document searching. I hadn't made changes for a while and when i went to run a search it told me there was indexing in progress. After some time i was able to run a search and try out the batch create/delete functionality. Of course this made a change to a document (i constrained the search to the current document so it should only have modified one), which then immediately causes indexing to take place again, rendering the search unusable.

My question is are there any basic requirements for achieving some level of performance as it seems to be taking a long time to index. The corpus has 10 full length articles, with many existing span annotations (roughly a few thousand across all 10 documents), some with KB identifiers. I would expect this to not take very long to index even if it was reindexing the whole corpus. Is it reindexing all documents, or only the one with changes? I assume it indexes per user because each user may have different annotations, i didnt find anything in the user/admin guides describing this.

The indexing in this scenario is taking over ten minutes during which the java process is taking very little CPU (1-2%).

Thanks for any info/advice.

antonyscerri commented 3 years ago

Also just poking around in the code:

https://github.com/inception-project/inception/blob/f1e59f31e21e6704dcd1f84fdb607d4884b3bbb0/inception/inception-ui-search/src/main/java/de/tudarmstadt/ukp/inception/app/ui/search/sidebar/SearchAnnotationSidebar.java#L430

It doesn't seem like its checking if the indexing is happening for the current user or not. I noted the indexing task appears to take note of the user, so assuming indexing is per user for a given corpus/document can it check this as well to limit any impact of changes to annotations?

reckart commented 3 years ago

There is one index per project and it contains

an entry for each document in the state it was imported in (source document)
an entry for each document that a user has opened at some point and potentially made modifications in (annotation document) (there might be cases a document is indexed even if a user has not formally opened it/modified it yet).

When documents are imported first, they are indexed (source document).

When a change is made in an annotation document, only this document for this user is re-index (annotation document).

I think that indexing should take more than 1-2% CPU when it is running and actually doing something. Maybe the indexing crashed?

antonyscerri commented 3 years ago

Thanks for confirming the document index, i've been going through more code checking what went where. The indexing is finishing as we can eventually search and there are no errors logged and we appear to be getting the correct results from searches.

We definitely are seeing searches for one user being blocked when a different user updates a document. From what i can tell it should only be reindexing the modified documents. Mtas parsing may add some additional overhead dealing with the more complex index structure but I still wouldnt expect it to take as long as we are seeing. The machine still has free memory, and plenty of disk space.

reckart commented 3 years ago

You can turn up logging on the indexing to see more:

System property setting

-Dlog4j.configurationFile="/path/to/your/custom/log4j2.xml"

log4j2.xml

<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN" monitorInterval="5">
  <Appenders>
    <Console name="ConsoleAppender" target="SYSTEM_OUT">
      <PatternLayout pattern="%d{yyyy-MM-dd HH:mm:ss} [%thread] %level{length=5} [$${ctx:username:-SYSTEM}] %logger{1} - %msg%n" />
    </Console>

    <Routing name="ProjectAppender">
      <Routes pattern="$${ctx:projectId}">
        <!--
          This route is used when we do *not* have a project ID set, i.e. the variable reference
          "${ctx:projectId}" is not resolved and remains as the plain string "${ctx:projectId}"
        -->
        <Route key="${ctx:projectId}">
          <!-- 
            Do not log anywhere if there is no project ID. In this case logging is going only to
            the console.
           -->
          <Null name="null"/>
        </Route>

        <!-- 
          This route is used when we have a project ID set (i.e. in case the key is not
          "${ctx:projectId}")
        -->
        <Route>
          <File name="Project-${ctx:projectId}"
            fileName="${ctx:repositoryPath}/project/project-${ctx:projectId}.log">
            <PatternLayout pattern="%d{yyyy-MM-dd HH:mm:ss,SSS} [${ctx:username}] %msg%n" />
          </File>
        </Route>
      </Routes>
    </Routing>
  </Appenders>

  <Loggers>
    <Logger name="INCEpTION" level="INFO"/>

    <Logger name="uima" level="ERROR"/>
    <Logger name="liquibase" level="ERROR"/>

    <Logger name="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer" level="ERROR"/>
    <Logger name="org.springframework.beans.factory.config.PropertiesFactoryBean" level="ERROR"/>

    <Logger name="de.tudarmstadt" level="INFO"/>

    <Logger name="uima" level="ERROR"/>
    <!-- Avoid warning on cookies with an invalid expiration date - cf. issue #739 -->
    <Logger name="org.apache.http.client.protocol.ResponseProcessCookies" level="ERROR"/>
    <!-- Avoid warning when components is not rerendered because it has been removed from page -->
    <Logger name="org.apache.wicket.page.PartialPageUpdate" level="ERROR"/>

    <!-- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
      - SEARCH 
      -->
    <Logger name="de.tudarmstadt.ukp.inception.search" level="DEBUG"/>

    <Root level="WARN">
      <AppenderRef ref="ConsoleAppender" />
      <AppenderRef ref="ProjectAppender" />
    </Root>
  </Loggers>
</Configuration>

You can also go to TRACE if you are read to see all the beefy details.

antonyscerri commented 3 years ago

Thanks for the logging config, that was going to be my next question. I'm still in the process of running some test and collecting logs, but I just had a thought based on one message i saw earlier which got me thinking. If we have lots of annotations of concepts from a KB, during indexing is it going to go off and query for a label for each one to add to the index? It's just I noted an error in the log which came from the index task complaining it couldn't get a label for a feature value (related to my other bug about the different SPARQL queries being used to get concepts and their labels, which given I loaded documents with existing annotations there are a few causing this error it seems).

But if it is going off to fetch a label that could well explain the slow indexing times. If so an obvious next question is whether we have an issue with any caching, or can we adjust any retention periods to avoid it making calls to the remote KB so often.

antonyscerri commented 3 years ago

So the significant time appears to be reported in the logging in lines like this:

2021-06-30 23:23:51 [Index task consumer] DEBUG [SYSTEM] MtasUimaParser - Created token collection in 181579 ms

Between the line above and the earlier line:

2021-06-30 23:20:50 [Index task consumer] DEBUG [SYSTEM] MtasUimaParser - Starting creation of token collection

I see lines like these for those concepts it cannot get labels:

2021-06-30 23:22:50 [Index task consumer] ERROR [SYSTEM] ConceptFeatureSupport - No label for feature value [http://www.wikidata.org/entity/Q17142171] 2021-06-30 23:23:02 [Index task consumer] ERROR [SYSTEM] ConceptFeatureSupport - No label for feature value [http://www.wikidata.org/entity/Q2833068]

There are occasional lines like this also:

2021-06-30 23:23:06 [Index task consumer] WARN [SYSTEM] SharedHttpClientSessionManager$RetryHandlerStale - Closing stale connection

The odd thing is i've observed this when doing a search across the corpus (and i think also a single document) then either creating or deleting a set of matching annotations, and also simply logging in and browsing through the set of documents (10 in this case) in the collection (i've seen a log message about upgrading the CAS of a document even though i've been working with these documents for several days now). If however I delete or add one span (no concept information) the whole indexing takes maybe 0.5s. So it seems it may well be gathering labels causing the problem but its unclear why sometimes it very quick.

reckart commented 3 years ago

Are you using your own Virtuoso loaded with a Wikidata dump or the official Wikidata SPARQL endpoint? In the latter case, slowness, varied timings and even eventual rate limiting and banning can be expected.

INCEpTION supports Virtuoso's FTS scheme and also Fuseki's FTS scheme. For Wikidata, we have used Virtuoso already - for smaller KBs, we were also happy with Fuseki (didn't try it for large KBs).

There is a cache in INCEpTION's KB service which caches query results. It caches up to 100.000 results by default (not that a query can return many results, by default capped at 1.000). Depending on your usage patterns, the cache might be effective in some cases but not in others. You can find the description of the cache settings in the KB section of the admin guide. If you use a large KB heavily and have a beefy machine, it is likely going to help increasing the cache size. Mind that the cache is effective only for read-only KBs (all remote SPARQL KBs).

reckart commented 3 years ago

i've seen a log message about upgrading the CAS of a document even though i've been working with these documents for several days now

A CAS upgrade is performed every time you open a document in the editor in read/write mode (i.e. also when switching between documents). When paging within a document, no upgrade is performed. This might be optimized in the future, but for the moment it is like this.

antonyscerri commented 3 years ago

Ok so the upgrade is expected each time. The wikidata in this instance is their live end point, we have not loaded it up locally. I'll check the cache settings and see if we can get things running any smoother. Depending on what constitutes a results, is that each item or a requests entire result set because some of those SPARQL queries are returning unique label, description, item triples I believe which were returning a full set of results (mainly during the lookup of an identifier). So that cache may become full relatively easy.

reckart commented 3 years ago

You can yank up the logging for the KB service (de.tudarmstadt.ukp.inception.kb.KnowledgeBaseServiceImpl) to trace level to see cache stats.

antonyscerri commented 3 years ago

I bumped up the cache settings (these are likely excessive high, particularly the size but should'nt hurt me too much):

knowledge-base.cacheSize=10000000
knowledge-base.cacheExpireDelay=24h
knowledge-base.cacheRefreshDelay=12h

Given these are not changing frequently it can have a long expiry. I'll check on the stats but just having done it without the logger change it didn't appear to use any additional memory . With stats enabled i went through the same sequence and the last cache stats reported:

2021-07-01 11:17:30 [ForkJoinPool.commonPool-worker-3] TRACE [SYSTEM] KnowledgeBaseServiceImpl - KB cache stats: CacheStats{hitCount=12565, missCount=3052, loadSuccessCount=3052, loadFailureCount=0, totalLoadTime=1227678126940, evictionCount=0, evictionWeight=0}

Looking at the code its setting the cache size based on weight which treats each item in the result list towards the count, so that could have reached the limit given the number of unique entries reported above if the average labels per item averaged more than 5. I also noted it's getting the parent concepts for each annotated concept to index, which is an additional query (and potentially multiple results) to cache.

I think given the indexing times i was having and the tests i was doing its more likely the timeout was most significant because to index all the documents was well over 15 minutes, so likely it was cycling through things.

antonyscerri commented 3 years ago

The contention with indexing blocking user searching can also have an impact across project it seems, because documents cannot be indexed in one project if there are indexing tasks from another project in the queue.

reckart commented 3 years ago

The contention with indexing blocking user searching can also have an impact across project it seems, because documents cannot be indexed in one project if there are indexing tasks from another project in the queue.

For 0.19.x, this is particularly true. For the upcoming version, it is still true but mitigated because there is a thread-pool and not just a single indexing thread. But then indexing and recommenders and other background tasks share this thread pool. But since the thread-pool size is configurable, users with beefy machines can reduce contention.

reckart commented 3 years ago

I assume the configuring the cache could address this problem. If not, please comment, then we can re-open.

antonyscerri commented 3 years ago

I think the cache settings can help mitigate the issue, as would a more performant local KB.

I think there maybe a consideration for adding entries without labels to the cache to prevent them being retried repeatedly impacting the performance, does a concept without a label stand any more change of suddenly getting a label vs an existing label changing?

inception-project / inception

Indexing Speed #2428