simon987 / sist2

Lightning-fast file system indexer and search tool
GNU General Public License v3.0
843 stars 55 forks source link

Indexing error: Invalid PATH argument. File not found #463

Open seliverstov-de opened 7 months ago

seliverstov-de commented 7 months ago

Device Information (please complete the following information):

Hi, I'm using the following docker-compose.yml:

version: "3"

services:
  elasticsearch:
    image: elasticsearch:7.17.9
    restart: unless-stopped
    environment:
      - "discovery.type=single-node"
      - "ES_JAVA_OPTS=-Xms2g -Xmx2g"
  sist2-admin:
    image: simon987/sist2:3.3.4-x64-linux
    restart: unless-stopped
    volumes:
      - ./container-data/sist2-admin-data/:/sist2-admin/
      - ./container-data/files:/host-files2
    ports:
      - 4090:4090 # sist2
      - 8080:8080 # sist2-admin
    working_dir: /root/sist2-admin/
    entrypoint: python3 /root/sist2-admin/sist2_admin/app.py

I have verified that my files are present in the container at the following path: /host-files. I created a program called Test in sist2-admin, selected the search engine elasticsearch (test result: Elasticsearch version 7.17.9), went to the files /host-files2/ and clicked "Enable image file recognition" in Tesseract eng and rus. When I start indexing, I get the following error:

 [ADMIN ] Starting sist2 command with args ['/root/sist2', 'index', '/sist2-admin/scan-Test-2024-01-23 18:36:05.735720.sist2', '--threads=1', '--es-url=http://elasticsearch:9200', '--es-index=sist2', '--batch-size=70', '--incremental-index', '--json-logs', '--very-verbose']
2024-01-23 18:36:06 [FATAL cli.c] Invalid PATH argument. File not found: /sist2-admin/scan-Test-2024-01-23 18:36:05.735720.sist2
 [ADMIN ] Sist2Scan task finished return_code=-10, duration=datetime.timedelta(microseconds=3736)

Running indexing with the search engine sqlite produces the same result:

 [ADMIN ] Starting sist2 command with args ['/root/sist2', 'sqlite-index', '/sist2-admin/scan-Test-2024-01-23 18:36:05.735720.sist2', '--search-index', '/sist2-admin/search-index-sqlite.sist2', '--json-logs', '--very-verbose']
2024-01-23 18:36:31 [FATAL cli.c] Invalid PATH argument. File not found: /sist2-admin/scan-Test-2024-01-23 18:36:05.735720.sist2
 [ADMIN ] Sist2Scan task finished return_code=-10, duration=datetime.timedelta(microseconds=3716)

But running indexing without OCR starts without problems. Although there are a lot of errors in the log of the following type:

2024-01-23 19:01:37 [ERROR ooxml.c] Got fatal XML error while parsing document: Start tag expected, '<' not found
2024-01-23 19:01:37 [ERROR ooxml.c] Got fatal XML error while parsing document: Start tag expected, '<' not found
2024-01-23 19:01:39 [ERROR ooxml.c] Got fatal XML error while parsing document: Start tag expected, '<' not found

Initially, I had about 3,000 files .doc,.docx and .pdf, and I was able to index them by some miracle, I did it when I set up the rights to the /host-files2 folder as follows: root:root 755. Now I have uploaded several times more different files (including image files), but I could not index the files with the new job (with OCR).

Steps To Reproduce Please be specific!

  1. Go to sist2-admin
  2. Click on [job name] and Index now
  3. Click on Tasks and see the indexing error

Expected behavior Indexing of files using OCR and the selected search engine should begin

Actual Behavior I get an error that is related to the missing scan file /sist2-admin/scan-Test-2024-01-23 18:36:05.735720. sist2

Screenshots The indexing process without OCR: image Recurring errors with such indexing: image Error when starting indexing with OCR: image Unfortunately, I have no more ideas how to fix this error. I tried to experiment with the rights to the file folder, deleted sist2-admin-data and recreated the containers, nothing helped. I really liked your product, and I would like it to continue to develop. I hope for your help, thank you!

dpieski commented 7 months ago

What does the end of the scan log file say? Did you check to see if the File is actually present? Like use docker exec -it sist2-admin bash then cd /sist2-admin

seliverstov-de commented 7 months ago

What does the end of the scan log file say?

At the moment, indexing looks like this: image The log looks like this (no longer updated): image On the job page Test, the status is failed: image For this reason, I cannot create a frontend.

Did you check to see if the File is actually present? Like use docker exec -it sist2-admin bash then cd /sist2-admin

If you mean the file /sist2-admin/scan-Test-2024-01-23 18:36:05.735720.sist2, it wasn't there when I started indexing with OCR. The file was created only after indexing without OCR, but, as I wrote earlier, it also failed. image

seliverstov-de commented 7 months ago

Now I tried to delete the sist2-admin-data folder, recreated the container, after which I created the Test job, specified the sqlite search engine with OCR and started indexing - indexing went without problems. image After that, I changed the search engine to elasticsearch in the Test job, after which I started indexing - it also passed without problems. image This line seems interesting: 2024-01-24 06:06:46 [DEBUG database.c] Closing database /dev/shm/sist2-ipc-31.sqlite (0x565134449bb8) Elasticsearch can't work without sqlite indexing?

seliverstov-de commented 7 months ago

For the purity of the experiment, I did the same thing - I get an error again))

 [ADMIN ] Starting sist2 command with args ['/root/sist2', 'sqlite-index', '/sist2-admin/scan-Test-2024-01-24 06:24:40.549224.sist2', '--search-index', '/sist2-admin/search-index-sqlite.sist2', '--json-logs', '--very-verbose']
2024-01-24 06:25:51 [FATAL cli.c] Invalid PATH argument. File not found: /sist2-admin/scan-Test-2024-01-24 06:24:40.549224.sist2
 [ADMIN ] Sist2Scan task finished return_code=-10, duration=datetime.timedelta(microseconds=3477)

I don't understand what's wrong.

dpieski commented 7 months ago

When a Job is run, it creates two tasks: (1) Scan; and (2) Index During the scan task, SIST2 goes through the files and pulls out the requested data and stores it in a .sist2 sqlite db. During the index task, if scan is successful, SIST2 provides the data from the .sist2 db to the index of choice, ES or SQLITE - whichever you have setup as the search backend.

To me, it looks like the scan may be failing without marking as failed for some reason, so the sist2 database is not created by scan. Thus, when index is run, it fails because the sist2 database is not there.

Elasticsearch can't work without sqlite indexing?

That isn't from sqlite indexing. During scanning, the metadata and content from the files is stored in an sqlite db. During ES indexing, an index is made/updated and the file metadata and content is stored in ES. The sist2 file also stores metadata about the scan that was performed as well as information such as embeddings, the stats page aggregations and treemaps, the tags, the thumbnails, version info, etc.