Open seliverstov-de opened 9 months ago
What does the end of the scan
log file say?
Did you check to see if the File is actually present? Like use docker exec -it sist2-admin bash
then cd /sist2-admin
What does the end of the
scan
log file say?
At the moment, indexing looks like this:
The log looks like this (no longer updated):
On the job page Test
, the status is failed
:
For this reason, I cannot create a frontend.
Did you check to see if the File is actually present? Like use docker exec -it sist2-admin bash then cd /sist2-admin
If you mean the file /sist2-admin/scan-Test-2024-01-23 18:36:05.735720.sist2
, it wasn't there when I started indexing with OCR.
The file was created only after indexing without OCR, but, as I wrote earlier, it also failed.
Now I tried to delete the sist2-admin-data
folder, recreated the container, after which I created the Test
job, specified the sqlite
search engine with OCR and started indexing - indexing went without problems.
After that, I changed the search engine to elasticsearch
in the Test
job, after which I started indexing - it also passed without problems.
This line seems interesting:
2024-01-24 06:06:46 [DEBUG database.c] Closing database /dev/shm/sist2-ipc-31.sqlite (0x565134449bb8)
Elasticsearch can't work without sqlite indexing?
For the purity of the experiment, I did the same thing - I get an error again))
[ADMIN ] Starting sist2 command with args ['/root/sist2', 'sqlite-index', '/sist2-admin/scan-Test-2024-01-24 06:24:40.549224.sist2', '--search-index', '/sist2-admin/search-index-sqlite.sist2', '--json-logs', '--very-verbose']
2024-01-24 06:25:51 [FATAL cli.c] Invalid PATH argument. File not found: /sist2-admin/scan-Test-2024-01-24 06:24:40.549224.sist2
[ADMIN ] Sist2Scan task finished return_code=-10, duration=datetime.timedelta(microseconds=3477)
I don't understand what's wrong.
When a Job is run, it creates two tasks: (1) Scan; and (2) Index
During the scan
task, SIST2 goes through the files and pulls out the requested data and stores it in a .sist2
sqlite db.
During the index
task, if scan
is successful, SIST2 provides the data from the .sist2
db to the index of choice, ES or SQLITE - whichever you have setup as the search backend.
To me, it looks like the scan
may be failing without marking as failed for some reason, so the sist2
database is not created by scan
. Thus, when index
is run, it fails because the sist2
database is not there.
Elasticsearch can't work without sqlite indexing?
That isn't from sqlite indexing. During scanning, the metadata and content from the files is stored in an sqlite db. During ES indexing, an index is made/updated and the file metadata and content is stored in ES. The sist2
file also stores metadata about the scan
that was performed as well as information such as embeddings, the stats page aggregations and treemaps, the tags, the thumbnails, version info, etc.
Device Information (please complete the following information):
Debian 12
Docker
Chrome
3.3.4
7.17.9
Hi, I'm using the following
docker-compose.yml
:I have verified that my files are present in the container at the following path:
/host-files
. I created a program calledTest
in sist2-admin, selected the search engineelasticsearch
(test result:Elasticsearch version 7.17.9
), went to the files/host-files2/
and clicked "Enable image file recognition" in Tesseracteng
andrus
. When I start indexing, I get the following error:Running indexing with the search engine
sqlite
produces the same result:But running indexing without OCR starts without problems. Although there are a lot of errors in the log of the following type:
Initially, I had about 3,000 files
.doc
,.docx
and.pdf
, and I was able to index them by some miracle, I did it when I set up the rights to the /host-files2 folder as follows:root:root 755
. Now I have uploaded several times more different files (including image files), but I could not index the files with the new job (with OCR).Steps To Reproduce Please be specific!
sist2-admin
[job name]
andIndex now
Tasks
and see the indexing errorExpected behavior Indexing of files using OCR and the selected search engine should begin
Actual Behavior I get an error that is related to the missing scan file
/sist2-admin/scan-Test-2024-01-23 18:36:05.735720. sist2
Screenshots The indexing process without OCR: Recurring errors with such indexing: Error when starting indexing with OCR: Unfortunately, I have no more ideas how to fix this error. I tried to experiment with the rights to the file folder, deleted
sist2-admin-data
and recreated the containers, nothing helped. I really liked your product, and I would like it to continue to develop. I hope for your help, thank you!