simon987 / sist2

Lightning-fast file system indexer and search tool
GNU General Public License v3.0
875 stars 55 forks source link

Decouple scanning and web service #421

Closed caesay closed 1 year ago

caesay commented 1 year ago

Which SIST2 component is your Feature Request related to? All

Is your feature request related to a problem? Please describe. At the moment, I run scan, and index, on "server1", and elasticsearch is on "server2". This is no problem, server1 can index the files on elasticsearch on another server. I can see that all the file contents, metadata, and thumbnails have been uploaded to elasticsearch on server2 via kibana. I would like to launch sist2 web on server2 also, and connect to the same elasticsearch - but it seems the web server will not start without a path to a local myfile.sist2.

I don't understand why the web server would need the path to the sist2 file if all of the files have already been indexed in elasticsearch. Am I missing something vital about how this works? As it stands, I'll now have to copy the sist2 files from all of my remote servers to my web server.

What would you like to see happen? The sist2 web command could be run without the path to the sist2 file, and just use what is available in elasticsearch.

dpieski commented 1 year ago

Are you running from the Docker container?

From my understanding: Sist2-Admin ensures that the indexes are successful before allowing the frontend to a particular index. So sist2-admin connects to a database it has with the settings and statuses for the front end and linking info to the back end. So, sist2-admin, when setting up the frontend, knows about the sist2-idxes from the jobs stored in the db. When making the frontend, the Sist2-Admin can determine if a particular frontend, connected to a particular ES Server, can add particular ones of the Jobs (sist2-idxs) to search.

So, Sist2-admin, on the ES Server, would not know about the jobs since its db wouldn't have that information and would not be able to determine which backend certain jobs are located, only the Scan server would.

If you want the search frontend to be running on the ES server, you should be able to update the docker-compose.yml file on the ES Server and change the command for sist2 based on the parameters in USAGE. So it would be something along the lines of: - /root/sist2 web --es-url=[ES_URL] --es-index=[ES_INDEX] --bind=0.0.0.0:80 That would keep the ES Server from running sist2-admin and should just execute the web portion of sist2.

caesay commented 1 year ago

Yes I am using docker, currently x64-linux since it's the latest tag. No, I am not using sist2-admin. I tried running that command you suggested, and it seems to have failed because the web command is asking for a .sist2 file to be provided on the command line.

I wish to run the standalone commands scan and index on remote file servers, and run web on the same server on which elasticsearch is running on (eg. not the server the scan/index was run on). As far as I can tell, the --es-index parameter refers to an elasticsearch index (always sist2 by default), but the "sist2" index (not the elasticsearch index) is actually retrieved from the .sist2 sqlite database and refers to a separate elasticsearch property. It's a bit confusing. In elasticsearch, the documents seem to be stored with _index=sist2 and index={sist2-scan-id}.

Specifically, I believe that sist2 web command loads the .sist2 sqlite database to read the index id from the descriptor table. image

And that this 'id' (in this case, 1694422458) is then used to filter elasticsearch results using the index property (but not to be confused with the elasticsearch index which is sist2). What I don't understand is why this local .sist2 sqlite database is even needed at all to run the web command. Surely, the descriptor table from this database could have been easily uploaded into elasticsearch, and all the other data then seems to be present?

So this issue is asking for that to be implemented, or if this is already possible, for the documentation to be updated. Thank you for your consideration.

simon987 commented 1 year ago

Hi @caesay the web service needs the .sist2 database for:

So unfortunately it would be very difficult to run the web service without the .sist2 file. If you are running the frontend on a separate server you will unfortunately need to sync the index files over after every update for now.

caesay commented 1 year ago

@simon987 Out of all the things you mentioned, the only one that I care about is thumbnails - and even those I could do without in a pinch. But they could also be uploaded to ES as base64 if they are not already. Being able to search the file contents in archives of random remote servers (which are all in elasticsearch) is really a killer feature, but at the moment we're limited to only searching on the server which was scanned itself. I suppose adding other (non-sqlite) backends could also work - like PostgreSQL. If the results of a scan and index was stored to a remote PostgreSQL server instead of a local sqlite file this may also work.

simon987 commented 1 year ago

Sorry I'm not planning to add this feature for now

we're limited to only searching on the server which was scanned itself

For now you can rsync the .sist2 index to the search server, I do it all the time