simon987 / sist2

Lightning-fast file system indexer and search tool
GNU General Public License v3.0
843 stars 55 forks source link

Sist2 grabs more and more memory then gets killed #486

Closed MparkG closed 2 months ago

MparkG commented 2 months ago

i am running scan on a folder of spidered websites. the sist2 DB after indexing was 150gb. then it ran, got killed, now it is 15gb big. going by the sproadic folders shown in the database i conclude there has been search data lost. I reran sist2 scan several times, always getting the kill for memory consumption. It also did not complete the search database again, its still 15gb and incomplete. I shall delet' it and restart the week long search.. Ram is also 16gb but of course other programs need some ram, too.

[16375.040801] Out of memory: Killed process 5561 (sist2) total-vm:17649280kB, anon-rss:9787700kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:34464kB oom_score_adj:0
[16378.354342] oom_reaper: reaped process 5561 (sist2), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[25517.347452] sist2 invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

sist2 works fine, even displays "Done" at the end, but does not stop there, it then continues to run doing something with the DB and it is then that it crashes.

./sist2 scan '/folder' -o ./folder.sist2 --incremental -t 2 --optimize-index --name="Folder" --mem-buffer=2000 --verbose

It may be the optimize index part, idk.

image

dpieski commented 2 months ago

What version are you running?

After "Done!", there are two more lines. That "Done!" is in regards to "Generating Stats"

2024-06-19 05:25:42 [INFO database.c] Generating stats
2024-06-19 06:22:16 [INFO database.c] Treemap merge iteration (2783798 rows changed)
2024-06-19 06:22:21 [INFO database.c] Treemap merge iteration (66257 rows changed)
2024-06-19 06:22:22 [INFO database.c] Treemap merge iteration (6799 rows changed)
2024-06-19 06:22:22 [INFO database.c] Treemap merge iteration (1969 rows changed)
2024-06-19 06:22:22 [INFO database.c] Treemap merge iteration (499 rows changed)
2024-06-19 06:22:22 [INFO database.c] Treemap merge iteration (203 rows changed)
2024-06-19 06:22:22 [INFO database.c] Done!
2024-06-19 06:22:22 [DEBUG database.c] Closing database /sist2-admin/scan-general-2024-06-05 15:15:01.088421.sist2 (0x60fe588ce3a8)
 [ADMIN ] Save last_index_date=2024-06-19 06:22:22.989776

And if you have "optimize db" selected in the scan settings, you will have a line like: 2024-06-19 01:31:59 [DEBUG database.c] Optimizing database After Closing database ....

MparkG commented 2 months ago

Hi! I see it did not complete with "Done".

Version is 3.3.6. i see there is at least one new one; 3.4 -- should i use that or go by what is currently on git?

server@mgp:~/search$ ./sist2 -v
3.3.6server@mgp:~/search$ 

I watched the journal file change in size.. Other than that i dont know what particularly it is doing. It may be rewriting the db to file? I checked the results with the web app, and it seems there are many folders missing in the database. i believe it wrote the db as far as the ram allowed it to. i have another 8gb of swap, that likely would have allowed that, given that other programs usually use around 9gb, the resulting 15gb make sense.

Is the db not written in sequences or batches of entries if necessary with a temporary database for modifications to be applied to the actual db -- (idk if that is what the journal file is for)?

What i dont understand is, while its scanning the folder it adds to the db and has it reach 150gb.. no problem. Then after that it does something else which attempts to load the db into ram or similar memory intensive operations. that has to be what i jump to by restarting the scan command on the already scanned folder. when the scan first completed, i actually noticed the pc being irresponsive and forcefully restarted it, not knowing what was going on. db size was still 150gb after that.

dpieski commented 2 months ago

I have been using Docker instead of the linux binaries. The current version in Docker is v3.4.2. You may try using the Docker version.

The VM I run it on only has 16GB ram and 16 cores. One of the folders I scan is about 4.7 million files and 4.2 TB - I have the bytes extracted set to 250k and thumbnail creation turned off.

If you are adamant on running the binaries, I think you would have to build the latest version. You may want to use the --very-verbose flag to get more logging. Additionally, if you build it, I would use the debug version. That will give more information about any memory leaks.

MparkG commented 2 months ago

Thanks for all letting me know of your settings and case of use! I will try it on the folder with out graphics, or at least with much smaller thumbnails. what does the option bytes extracted mean? --content-size=<int> Number of bytes to be extracted from text documents. Set to 0 to disable. DEFAULT: 32768 what does it set? limit the information gathered from a file? it wouldnt make sense to limit the information if what i want to do is search throughout all of it..

i will not touch docker, its worthless and overcomplicating imo. i will compile it later.

Btw, do i get this option right? It means a file of max. 2GB can be opened within an archive of any size, i.e. more than 2GB? --mem-buffer=<int> Maximum memory buffer size per thread in MiB for files inside archives (see USAGE.md).

MparkG commented 2 months ago

in scripts/before_build.sh "cd .." needs to be commented out for it to compile.

#!/usr/bin/env bash
(
  #cd ..
  rm -rf index.sist2

  python3 scripts/mime.py > src/parsing/mime_generated.c
  python3 scripts/serve_static.py > src/web/static_generated.c
  python3 scripts/index_static.py > src/index/static_generated.c
  python3 scripts/magic_static.py > src/magic_generated.c

  printf "static const char *const Sist2CommitHash = \"%s\";\n" $(git rev-parse HEAD) > src/git_hash.h
)

then it stopps at this:

[ 76%] Building C object CMakeFiles/sist2.dir/src/web/serve.c.o
/home/server/search/git_sist2/src/web/serve.c: In function ‘get_embedding’:
/home/server/search/git_sist2/src/web/serve.c:53:72: error: ‘struct mg_str’ has no member named ‘ptr’
   53 |     if (hm->uri.len != SIST_SID_LEN + 2 + 4 || !parse_sid(&sid, hm->uri.ptr + 3)) {
      |                                                                        ^
In file included from /home/server/search/git_sist2/src/sist.h:33,
                 from /home/server/search/git_sist2/src/web/serve.h:4,
                 from /home/server/search/git_sist2/src/web/serve.c:1:
/home/server/search/git_sist2/src/web/serve.c:54:89: error: ‘struct mg_str’ has no member named ‘ptr’
   54 |         LOG_DEBUGF("serve.c", "Invalid embedding path: %.*s", (int) hm->uri.len, hm->uri.ptr);
      |                                                                                         ^
/home/server/search/git_sist2/src/log.h:16:72: note: in definition of macro ‘LOG_DEBUGF’
   16 |     if (LogCtx.very_verbose) {sist_logf(filepath, LOG_SIST_DEBUG, fmt, __VA_ARGS__);}}while(0)
      |                                                                        ^~~~~~~~~~~
/home/server/search/git_sist2/src/web/serve.c:59:40: error: ‘struct mg_str’ has no member named ‘ptr’
   59 |     int model_id = (int) strtol(hm->uri.ptr + SIST_SID_LEN + 3, NULL, 10);
      |                                        ^
/home/server/search/git_sist2/src/web/serve.c: In function ‘stats_files’:
/home/server/search/git_sist2/src/web/serve.c:89:33: error: ‘struct mg_str’ has no member named ‘ptr’
   89 |     memcpy(index_id_str, hm->uri.ptr + 3, 8);
      |                                 ^
/home/server/search/git_sist2/src/web/serve.c:93:34: error: ‘struct mg_str’ has no member named ‘ptr’
   93 |     memcpy(arg_stat_type, hm->uri.ptr + 3 + 9, 4);
      |                                  ^
/home/server/search/git_sist2/src/web/serve.c: In function ‘thumbnail_with_num’:
/home/server/search/git_sist2/src/web/serve.c:182:72: error: ‘struct mg_str’ has no member named ‘ptr’
  182 |     if (hm->uri.len != SIST_SID_LEN + 2 + 4 || !parse_sid(&sid, hm->uri.ptr + 3)) {
      |                                                                        ^
In file included from /home/server/search/git_sist2/src/sist.h:33,
                 from /home/server/search/git_sist2/src/web/serve.h:4,
                 from /home/server/search/git_sist2/src/web/serve.c:1:
/home/server/search/git_sist2/src/web/serve.c:183:89: error: ‘struct mg_str’ has no member named ‘ptr’
  183 |         LOG_DEBUGF("serve.c", "Invalid thumbnail path: %.*s", (int) hm->uri.len, hm->uri.ptr);
      |                                                                                         ^
/home/server/search/git_sist2/src/log.h:16:72: note: in definition of macro ‘LOG_DEBUGF’
   16 |     if (LogCtx.very_verbose) {sist_logf(filepath, LOG_SIST_DEBUG, fmt, __VA_ARGS__);}}while(0)
      |                                                                        ^~~~~~~~~~~
/home/server/search/git_sist2/src/web/serve.c:188:35: error: ‘struct mg_str’ has no member named ‘ptr’
  188 |     int num = (int) strtol(hm->uri.ptr + SIST_SID_LEN + 3, NULL, 10);
      |                                   ^
/home/server/search/git_sist2/src/web/serve.c: In function ‘thumbnail’:
/home/server/search/git_sist2/src/web/serve.c:196:54: error: ‘struct mg_str’ has no member named ‘ptr’
  196 |     if (hm->uri.len != 20 || !parse_sid(&sid, hm->uri.ptr + 3)) {
      |                                                      ^
In file included from /home/server/search/git_sist2/src/sist.h:33,
                 from /home/server/search/git_sist2/src/web/serve.h:4,
                 from /home/server/search/git_sist2/src/web/serve.c:1:
/home/server/search/git_sist2/src/web/serve.c:197:89: error: ‘struct mg_str’ has no member named ‘ptr’
  197 |         LOG_DEBUGF("serve.c", "Invalid thumbnail path: %.*s", (int) hm->uri.len, hm->uri.ptr);
      |                                                                                         ^
/home/server/search/git_sist2/src/log.h:16:72: note: in definition of macro ‘LOG_DEBUGF’
   16 |     if (LogCtx.very_verbose) {sist_logf(filepath, LOG_SIST_DEBUG, fmt, __VA_ARGS__);}}while(0)
      |                                                                        ^~~~~~~~~~~
/home/server/search/git_sist2/src/web/serve.c: In function ‘search’:
/home/server/search/git_sist2/src/web/serve.c:213:26: error: ‘struct mg_str’ has no member named ‘ptr’
  213 |     memcpy(body, hm->body.ptr, hm->body.len);
      |                          ^
/home/server/search/git_sist2/src/web/serve.c: In function ‘file’:
/home/server/search/git_sist2/src/web/serve.c:419:54: error: ‘struct mg_str’ has no member named ‘ptr’
  419 |     if (hm->uri.len != 20 || !parse_sid(&sid, hm->uri.ptr + 3)) {
      |                                                      ^
In file included from /home/server/search/git_sist2/src/sist.h:33,
                 from /home/server/search/git_sist2/src/web/serve.h:4,
                 from /home/server/search/git_sist2/src/web/serve.c:1:
/home/server/search/git_sist2/src/web/serve.c:420:84: error: ‘struct mg_str’ has no member named ‘ptr’
  420 |         LOG_DEBUGF("serve.c", "Invalid file path: %.*s", (int) hm->uri.len, hm->uri.ptr);
      |                                                                                    ^
/home/server/search/git_sist2/src/log.h:16:72: note: in definition of macro ‘LOG_DEBUGF’
   16 |     if (LogCtx.very_verbose) {sist_logf(filepath, LOG_SIST_DEBUG, fmt, __VA_ARGS__);}}while(0)
      |                                                                        ^~~~~~~~~~~
/home/server/search/git_sist2/src/web/serve.c: In function ‘tag’:
/home/server/search/git_sist2/src/web/serve.c:531:54: error: ‘struct mg_str’ has no member named ‘ptr’
  531 |     if (hm->uri.len != 22 || !parse_sid(&sid, hm->uri.ptr + 5)) {
      |                                                      ^
In file included from /home/server/search/git_sist2/src/sist.h:33,
                 from /home/server/search/git_sist2/src/web/serve.h:4,
                 from /home/server/search/git_sist2/src/web/serve.c:1:
/home/server/search/git_sist2/src/web/serve.c:532:83: error: ‘struct mg_str’ has no member named ‘ptr’
  532 |         LOG_DEBUGF("serve.c", "Invalid tag path: %.*s", (int) hm->uri.len, hm->uri.ptr);
      |                                                                                   ^
/home/server/search/git_sist2/src/log.h:16:72: note: in definition of macro ‘LOG_DEBUGF’
   16 |     if (LogCtx.very_verbose) {sist_logf(filepath, LOG_SIST_DEBUG, fmt, __VA_ARGS__);}}while(0)
      |                                                                        ^~~~~~~~~~~
/home/server/search/git_sist2/src/web/serve.c:538:26: error: ‘struct mg_str’ has no member named ‘ptr’
  538 |     memcpy(body, hm->body.ptr, hm->body.len);
      |                          ^
/home/server/search/git_sist2/src/web/serve.c: In function ‘check_auth0’:
/home/server/search/git_sist2/src/web/serve.c:615:29: error: ‘struct mg_str’ has no member named ‘ptr’
  615 |     strncpy(token_str, token.ptr, token.len);
      |                             ^
/home/server/search/git_sist2/src/web/serve.c: In function ‘ev_router’:
/home/server/search/git_sist2/src/web/serve.c:645:28: error: ‘struct mg_str’ has no member named ‘ptr’
  645 |         memcpy(uri, hm->uri.ptr, hm->uri.len);
      |                            ^
make[2]: *** [CMakeFiles/sist2.dir/build.make:216: CMakeFiles/sist2.dir/src/web/serve.c.o] Fehler 1
make[1]: *** [CMakeFiles/Makefile2:207: CMakeFiles/sist2.dir/all] Fehler 2
make: *** [Makefile:91: all] Fehler 2

ill just use the released binaries (which would be v3.3.6)..

MparkG commented 2 months ago

well then, i added 180GB of swap for sist2 to load the whole db into memory. now it works. after the completed scan it shrinks the 160gb down to 10gb, and then saves it from memory to disk. it would make more sense to perform whatever the actions are soley on disk with only a fraction of the db placed into memory. this is insane to load the whole database into memory.

after running index to send the database to elasticsearch on the same configuration (because it too loads the 10gb database into ram!) it seems to all work now, with out 180gb of swap, on a small server.