Importing corpus in Annis 4

gcelano commented 1 year ago

Describe the bug

I cannot import (via the command 'import') a corpus in Annis 4 server. After executing the command "import directory", nothing happens, and when Annis server is run (java -jar annis-4.10.2-server.jar), I get the error in the screenshot.

System:

Operating System: Ubuntu 22.04 Browser Chrome Java Version: openjdk 11.0.19 ANNIS Version 4, desktop version

Screenshot from 2023-07-07 17-22-49

thomaskrause commented 12 months ago

It looks like there is another (graph) ANNIS process still running. Can you check if restarting your computer fixes the issue? It could happen, that you close the ANNIS Desktop interface, but the graphANNIS background service is still running and blocking the database. Restarting the computer is the easiest way of making sure all processes are terminated.

If there is a pattern to non-stopped graphANNIS background services (it always fails to stop when X), then we might need to open a separate issue. If restarting the computer fixes the issue, please close the issue.

Sorry, I missed the part about importing a corpus failed. What does not work, if you have the ANNIS Desktop open and try to import on the command line simultaneously. You can either import a ZIP file with the corpus, or stop ANNIS Desktop before running the CLI (http://korpling.github.io/ANNIS/4.10/user-guide/import-and-config/import.html#importing-a-corpus-using-the-command-line)

gcelano commented 12 months ago

If I use annis-desktop, everything works, but a query for my ~35M token corpus takes minutes. My problem is getting the corpus available for annis-server. The corpus is in ~/.annis/v4 (previously imported via annis-desktop), but when I run annis-server, I do not see the corpus in the GUI. How can I load it there? I can also use graphAnnis, and query it there (fast!), but it is not clear how to make the corpus available for search in the GUI of the server instance.

thomaskrause commented 11 months ago

If the corpus in in ~/.annis/v4 the issue is probably that it is not visible to users not logged in. Your can set a property in the service configuration to make all corpora available without login: http://korpling.github.io/ANNIS/4.10/user-guide/configuration/user.html#allowing-anonymous-access-to-all-corpora

gcelano commented 8 months ago

I have Annis 4.10.3 running on a server, but I cannot import any dataset using graphAnnis-python:

>>> from graphannis.cs import CorpusStorageManager
>>> from graphannis.graph import GraphUpdate
>>> with CorpusStorageManager() as cs:
...     # import relANNIS corpus with automatic name
...     corpus_name = cs.import_from_fs("relannis/GUM")
...     print(corpus_name)

The code does not return any error, but I cannot see any corpus in the web interface at :5712 (or in .annis/v4): Any idea why?

MartinKl commented 8 months ago

You are initializing cs with the default directory setting:

Init signature: CorpusStorageManager(db_dir='data/', use_parallel=True)

If your goal is to import the corpus to the data directory that the web services uses, you'd need to set db_dir to your .annis/v4 folder. Make sure the directory is not accessed by a web service instance simultaneously (you'd find a db.lock in the directory).

gcelano commented 8 months ago

When importing, I get the following files, but no corpus is then available in the GUI (moreover, db.lock always appears when importing, even if no Annis instance is running):

drwxr-xr-x 2 gcelano gcelano  4096 Nov 13 16:18 .
drwxr-xr-x 3 gcelano gcelano  4096 Nov  8 14:00 ..
-rw-r--r-- 1 gcelano gcelano 20480 Nov 13 16:18 frontend_data.h2.mv.db
-rw-r--r-- 1 gcelano gcelano  6159 Nov 12 20:55 frontend_data.h2.trace.db
-rw-r--r-- 1 gcelano gcelano 28672 Nov 12 20:55 service_data.sqlite3

thomaskrause commented 7 months ago

@gcelano There might be an issue that the ANNIS server service uses a different corpus storage folder. By default, ANNIS uses the ${user.home}/.annis/service.toml service configuration file. There are fields to configure where the corpus storage folder is and where it expects the service sqlite file to be located.

[database]
graphannis = "/home/thomas/.annis/v4"
sqlite = "/home/thomas/.annis/v4/service_data.sqlite3"

The service configuration file can be changed as well, by using a setting in the application.properties file (http://korpling.github.io/ANNIS/4.10/user-guide/configuration/index.html).

So I would check

is there a frontend configuration in form of a application.properties file in the current working directory or a config sub-directory of the working directory of the annis server or if the service is started with a --spring.config.location argument
- if yes, is the setting annis.webservice-config configured: use this as the backen configuration file to check
- if no, assume that the service configuration file is ${user.home}/.annis/service.toml
check the backend configuration file for the field
```
[database]
graphannis="..."
```
This is where the db_dir argument of the CorpusStorageManager Python Script should point to. I also updated the graphANNIS Python library to the newest graphANNIS version in case there is some weird incompatibility (which should not be the case).

BTW, if the only thing the script does is to import the corpus, you can also achieve that as a one-line by using the graphANNIS CLI (<https://korpling.github.io/graphANNIS/docs/v2/cli.html)

annis ~/.annis/v4 -c 'import relannis/GUM'

gcelano commented 7 months ago

@thomaskrause, thanks, it works now. However, I am experiencing a few issues with corpus query because my corpus is huge (about 34M tokens: https://zenodo.org/records/8158675). On my local machine (with 28 vCPUs), the corpus is about 14GB once imported in the v4 directory, and when I try to query it, it works, but there is some noticeable query latency. On my server, which at the moment has far fewer resources (5 vCPU), I have not yet been able to import the corpus, probably because the computer is too slow: I am therefore wondering whether you might have a suggestion about a good server configuration that can cope with the size of my corpus.

thomaskrause commented 6 months ago

@gcelano I think for a server with at least 32GB RAM is a good target for your configuration. It is important that the 16GB+ memory is also actually configured to be used: the conservative default configuration is to only use 25% of the free memory available (https://korpling.github.io/graphANNIS/docs/v2/rest/configuration.html#database-section)

Parallel execution should speed up queries when there is at least one operator, but for "simple" ones like tok it does not help. We are also trying to improve the situation for larger corpora by optimizing more for 100 M words "flat" corpora (we use 1/10th of the DEWAC corpus as goal) and publishing performance fixes every month or so. The changes are normally in the graphANNIS changelog (https://github.com/korpling/graphANNIS/blob/main/CHANGELOG.md) and not the ANNIS one.

Please also note that there has been an issue with memory consumption before graphANNIS 3.0.0 (or ANNIS 4.10.5) which was especially problematic for larger corpora and when the query results are large. This has been fixed in the latest release and should also help with performance.

But I also see an issue with the current version that just executing find on the corpus takes around 10 seconds. While you could not sort the results (in the search options of the ANNIS UI) which would speed it up, there should be improvements in this area. I opened a separate issue for this to work in the find performance: https://github.com/korpling/graphANNIS/issues/276

korpling / ANNIS

Importing corpus in Annis 4 #832