Closed gcelano closed 6 months ago
It looks like there is another (graph) ANNIS process still running. Can you check if restarting your computer fixes the issue? It could happen, that you close the ANNIS Desktop interface, but the graphANNIS background service is still running and blocking the database. Restarting the computer is the easiest way of making sure all processes are terminated.
If there is a pattern to non-stopped graphANNIS background services (it always fails to stop when X), then we might need to open a separate issue. If restarting the computer fixes the issue, please close the issue.
Sorry, I missed the part about importing a corpus failed. What does not work, if you have the ANNIS Desktop open and try to import on the command line simultaneously. You can either import a ZIP file with the corpus, or stop ANNIS Desktop before running the CLI (http://korpling.github.io/ANNIS/4.10/user-guide/import-and-config/import.html#importing-a-corpus-using-the-command-line)
If I use annis-desktop
, everything works, but a query for my ~35M token corpus takes minutes. My problem is getting the corpus available for annis-server
. The corpus is in ~/.annis/v4 (previously imported via annis-desktop
), but when I run annis-server
, I do not see the corpus in the GUI. How can I load it there? I can also use graphAnnis, and query it there (fast!), but it is not clear how to make the corpus available for search in the GUI of the server instance.
If the corpus in in ~/.annis/v4 the issue is probably that it is not visible to users not logged in. Your can set a property in the service configuration to make all corpora available without login: http://korpling.github.io/ANNIS/4.10/user-guide/configuration/user.html#allowing-anonymous-access-to-all-corpora
I have Annis 4.10.3 running on a server, but I cannot import any dataset using graphAnnis-python:
>>> from graphannis.cs import CorpusStorageManager
>>> from graphannis.graph import GraphUpdate
>>> with CorpusStorageManager() as cs:
... # import relANNIS corpus with automatic name
... corpus_name = cs.import_from_fs("relannis/GUM")
... print(corpus_name)
The code does not return any error, but I cannot see any corpus in the web interface at :5712 (or in .annis/v4): Any idea why?
You are initializing cs
with the default directory setting:
Init signature: CorpusStorageManager(db_dir='data/', use_parallel=True)
If your goal is to import the corpus to the data directory that the web services uses, you'd need to set db_dir
to your .annis/v4
folder. Make sure the directory is not accessed by a web service instance simultaneously (you'd find a db.lock
in the directory).
When importing, I get the following files, but no corpus is then available in the GUI (moreover, db.lock
always appears when importing, even if no Annis instance is running):
drwxr-xr-x 2 gcelano gcelano 4096 Nov 13 16:18 .
drwxr-xr-x 3 gcelano gcelano 4096 Nov 8 14:00 ..
-rw-r--r-- 1 gcelano gcelano 20480 Nov 13 16:18 frontend_data.h2.mv.db
-rw-r--r-- 1 gcelano gcelano 6159 Nov 12 20:55 frontend_data.h2.trace.db
-rw-r--r-- 1 gcelano gcelano 28672 Nov 12 20:55 service_data.sqlite3
@gcelano There might be an issue that the ANNIS server service uses a different corpus storage folder. By default, ANNIS uses the ${user.home}/.annis/service.toml
service configuration file. There are fields to configure where the corpus storage folder is and where it expects the service sqlite file to be located.
[database]
graphannis = "/home/thomas/.annis/v4"
sqlite = "/home/thomas/.annis/v4/service_data.sqlite3"
The service configuration file can be changed as well, by using a setting in the application.properties
file (http://korpling.github.io/ANNIS/4.10/user-guide/configuration/index.html).
So I would check
application.properties
file in the current working directory or a config sub-directory of the working directory of the annis server or if the service is started with a --spring.config.location
argument
annis.webservice-config
configured: use this as the backen configuration file to check${user.home}/.annis/service.toml
[database]
graphannis="..."
This is where the db_dir
argument of the CorpusStorageManager
Python Script should point to. I also updated the graphANNIS Python library to the newest graphANNIS version in case there is some weird incompatibility (which should not be the case).
BTW, if the only thing the script does is to import the corpus, you can also achieve that as a one-line by using the graphANNIS CLI (<https://korpling.github.io/graphANNIS/docs/v2/cli.html)
annis ~/.annis/v4 -c 'import relannis/GUM'
@thomaskrause, thanks, it works now. However, I am experiencing a few issues with corpus query because my corpus is huge (about 34M tokens: https://zenodo.org/records/8158675). On my local machine (with 28 vCPUs), the corpus is about 14GB once imported in the v4 directory, and when I try to query it, it works, but there is some noticeable query latency. On my server, which at the moment has far fewer resources (5 vCPU), I have not yet been able to import the corpus, probably because the computer is too slow: I am therefore wondering whether you might have a suggestion about a good server configuration that can cope with the size of my corpus.
@gcelano I think for a server with at least 32GB RAM is a good target for your configuration. It is important that the 16GB+ memory is also actually configured to be used: the conservative default configuration is to only use 25% of the free memory available (https://korpling.github.io/graphANNIS/docs/v2/rest/configuration.html#database-section)
Parallel execution should speed up queries when there is at least one operator, but for "simple" ones like tok
it does not help. We are also trying to improve the situation for larger corpora by optimizing more for 100 M words "flat" corpora (we use 1/10th of the DEWAC corpus as goal) and publishing performance fixes every month or so. The changes are normally in the graphANNIS changelog (https://github.com/korpling/graphANNIS/blob/main/CHANGELOG.md) and not the ANNIS one.
Please also note that there has been an issue with memory consumption before graphANNIS 3.0.0 (or ANNIS 4.10.5) which was especially problematic for larger corpora and when the query results are large. This has been fixed in the latest release and should also help with performance.
But I also see an issue with the current version that just executing find
on the corpus takes around 10 seconds. While you could not sort the results (in the search options of the ANNIS UI) which would speed it up, there should be improvements in this area. I opened a separate issue for this to work in the find
performance: https://github.com/korpling/graphANNIS/issues/276
Describe the bug
I cannot import (via the command 'import') a corpus in Annis 4 server. After executing the command "import directory", nothing happens, and when Annis server is run (
java -jar annis-4.10.2-server.jar
), I get the error in the screenshot.System:
Operating System: Ubuntu 22.04 Browser Chrome Java Version: openjdk 11.0.19 ANNIS Version 4, desktop version