Refactor connector - Githubissues

jvillafanez commented 1 year ago

New things coming in this PR:

Possible usage of multiple connectors when indexing data. This is intended to help with data migration from one index to another without downtime for the users
New connectors:
- "Legacy" -> default one, matching the current behavior
- "RelevanceV2" -> returns results based on the relevance of the document (to be detailed below)
Old commands are expected to affect to all the "write" connectors configured.
New "fillSecondary" command to fill the secondary index (the one we want to migrate to)

About the new "RelevanceV2" connector:

It will return results based on the score provided by elasticsearch:
- Matching the whole word in the filename will score the whole points.
- ~Matching the starts of the word in the filename will score half of the points~ (No longer applies)
- Matching a part of the word in the filename will score a half of the points
- If we're searching in the file content, we only match the whole word for the whole points.
The modification time will affect the end score of the document the following way.
- Last modification older than 1 year -> score * 0.25
- Last modification older than 1 month -> score * 0.5
- Last modification older than 1 week -> score
- Last modification older than 1 day -> score * 2
- Modified in the last 24 hours -> score * 4

Note that, while the modification time affects the scoring, it doesn't mean that recent files will always be the first ones to appear. Old files might still have a higher score even after those boosts.

The "RelevanceV2" connector also introduces new ways to search for files based on the indexed fields. Note that the following info only applies to the "RelevanceV2" connector.

By default, the "RelevanceV2" connector will search in the name field, and in the file content if possible. Old limitations for the app to index the file content are still in place. Also note that, due to those limitations, big files might not get its contents indexed.

Additional searches you can do with the "RelevanceV2" connector:

Search by extension -> ext:pdf , ext:docx , ext:gif , ext:mp4 , ext:tar.gz , ext:gz , etc, any extension is possible
Search by size, only in bytes or megabytes:
- Search by byte size -> size.b:<8092 , size.b:>102400, size.b:[8092 TO 16184]
- Search by megabyte size -> size.mb:<3, size.mb:>9, size.mb:[3 TO 9]
Search by type: only "file" or "folder" -> type:file , type:folder
Search by date:
- Search by timestamp -> mtime:<1678960862 , mtime:>1678960862 , mtime:[1608111372 TO 1678960862]
- Search by date -> mtime:<2021-08-25 , mtime:>2023-01-18, mtime:[2022-01-01 TO 2022-12-31]
Search by mimetype -> mime:image , mime:gif , mime:text NOTE: To search for the whole mimetype such as "image/gif" use mime.key:image\/gif

~By default, each search term will be joined with an "OR" operator. For example brown ext:pdf will be interpreted as "name or content containing brown OR extension = pdf", so "brown.txt" file and "tito.pdf" will appear in the results. You can use brown AND ext:pdf to match pdf files containing brown in the name or contents.~ (It doesn't works like this any longer) Each search term will narrow the search. For example brown ext:pdf will be interpreted as "name or content containing brown AND extension = pdf", so "brown.pdf" and "a brown paper.pdf" will appear, but not "brown.txt" or "tito.pdf"

Some example of complex searches:

Files containing "confidential" updated since 2023 whose size is less than 10MB: confidential mtime:>2023-01-01 size.mb:<10
Folders containing more than 1GB: type:folder size.mb:>1024
Images between March and June 2020: mime:image mtime:[2020-03-01 TO 2020-06-30]
~pdf or txt files containing "oxygen" or "helium": (oxygen OR helium) AND (ext:pdf OR ext:txt)~ (no lnoger applies)

Note that matching by name is pretty lax, so expect a bunch of unexpected results. Anyway, good results are expected to be on top.

Migrating to the "RelevanceV2" connector:

If you haven't indexed anything yet, you're encouraged to setup the connectors you want to use as part of the app configuration. The recommended one is "RelevanceV2" for write and search.

If you have indexed data, these are the steps to migrate to the new index.

Let's assume you have the "Legacy" connector setup for write and search.
Add the "RelevanceV2" connector to the list of write connectors. The list should have both "Legacy" and "RelevanceV2"
Run the occ search:index:fillSecondary RelevanceV2 <user> command. The command needs to be run for all the users (or at least the ones using the search feature), and it's expected to take a lot of time.
Once the indexed data have been migrated for all the users, you can switch the search connector to use the "RelevanceV2" one. From this point you can use the new features coming from the new connector.
After checking everything is good, you can remove the old "Legacy" connector from the list of write connectors.
Then you can completely remove the index from elasticsearch.

With step 2 you'll be writing in both indexes at the same time. This is expected to be slower. Note that step 2 just takes care of new files. Files indexed previously won't be present in the new index. This is why step 3 is there. Step 4 is important and you should stop at that point for a while. If something goes wrong, you can still revert things, in particular, you can switch back to the "Legacy" connector. From step 5 the actions are irreversible. If you want to go back, you'll have to start a new migration.

It's important to notice there isn't any expected downtime while the migration happens. Until step 4, the "Legacy" connector will keep updating the index normally. When the switch happens in the search connector, the new "RelevanceV2" connector will access to the new index, which should have been fully updated.

NOTE: This might not be the final version. This is mostly the state of the PR and things might change without notice. The official documentation is expected to contain the final information once this PR is completely finished. @mmattel FYI this will need documentation.

mmattel commented 1 year ago

@jvillafanez great 👍 pls ping me again when close to merge so I can prepare the docs part.

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
134 Code Smells

0.0% Coverage
5.5% Duplication

JRundfeldt commented 1 year ago

these features seem to be too good to not merge them... @hodyroff I was just cleaning up and updating projects, when I found this change, which we apparently never merged. I do think our customers would benefit - maybe this is something for the 10.13?

mmattel commented 1 year ago

@jvillafanez occ search:index:fillSecondary RelevanceV2 <user> with this PR, are there also core change(s) needed?

jvillafanez commented 1 year ago

No core changes are needed. Everything is part of the app.

mmattel commented 1 year ago

No core changes are needed. Everything is part of the app.

In this case, just to be noted, an app upgrade could be made to be regulary downloaded from the marketplace making the change availabe asap plus adding it to the default for the 10.13 release. There is no doc restriction going that path.

cdamken commented 1 year ago

@jnweiger could we do some QA here?

jnweiger commented 1 year ago

could we do some QA here?

@jnweiger could we do some QA here?

Please review and merge into release-2.4.0 branch, I''ll build a 2.4.0-rc.1 from there and then start QA.

jvillafanez commented 1 year ago

I don't see a release-2.4.0 branch, so I guess it will be created after this PR is merged.

jnweiger commented 1 year ago

It wasn't pushed. Sorry. Pushed now, and retargetted this PR.

jnweiger commented 1 year ago

For testing: search_elastic-2.3.0+refactor_connector.tar.gz

jnweiger commented 1 year ago

Screenshot for @mmattel Admin -> Settings -> Search grafik

jvillafanez commented 1 year ago

Latest commit has changed the behavior of the search. The initial post has been updated to reflect those changes.

jvillafanez commented 1 year ago

I'd suggest to make all new keywords upper case. SIZE, EXT, MTIME, TYPE, MIME

This will need some voting... I'd rather keep them lowercase :smile:

size.b and size.mb is supported. size.kb is missing.

We'd need to send that info because as far as I know, elasticsearch doesn't make any calculations (or it's complex to setup). As far as elasticsearch is concerned, you can send size.b = 50 and size.mb = 70 and those values will be stored so that's a potential problem we have. That's why I'd rather store the minimum information possible. Furthermore, the size.gb is also missing for the same reason.

Migration step 6 is unclear: "Then you can completely remove the index from elasticsearch." -> search:index:reset ? or some 'drop table' SQL in the elastic server?

It's mainly for elasticsearch maintenance, to remove data that won't be used any longer. It's possible to skip that step without any problem on our side, but you have to live with junk data in elasticsearch.

jnweiger commented 1 year ago

Migration step 6 is unclear: "Then you can completely remove the index from elasticsearch." -> search:index:reset ? or some 'drop table' SQL in the elastic server?

It's mainly for elasticsearch maintenance, to remove data that won't be used any longer. It's possible to skip that step without any problem on our side, but you have to live with junk data in elasticsearch.

Understood. Instructions, how to exactly remove this junk data is missing.

jnweiger commented 1 year ago

Most of the problems are fixed in the current state of #319 but I believe there is one regression now https://github.com/owncloud/search_elastic/issues/331#issuecomment-1654061258

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
19 Code Smells

0.0% Coverage
5.5% Duplication

owncloud / search_elastic

Refactor connector #319