Use "full" tokenizer for FlexSearch indexing

acka47 commented 3 years ago

From https://github.com/skohub-io/skohub-vocabs/pull/152#issuecomment-913483180:

(However, I noticed that results will only show up if a string matches thre beginning of a word, e.g. searching for "spiel" won't give me "Lernspiel" back. That might be a limitation of FlexSearch or – if FlexSearch can handle this – we should create another issue to address this shortcoming.)

I think this might be possible. We might have to change the tokenizer: https://github.com/nextapps-de/flexsearch/tree/0.6.32#tokenizer

acka47 commented 2 years ago

Actually, I think the "forward" tokenizer must be the one we already use and we should switch to the "reverse" tokenizer. Updating the issue title accordingly.

acka47 commented 2 years ago

@sroertgen @literarymachine If you can point me to the place where the tokenizer is configured, I would try to make a pull request.

sroertgen commented 2 years ago

@acka47 I will have a look. Though I think we need the "full" tokenizer (https://github.com/nextapps-de/flexsearch/tree/0.6.32#tokenizer)

acka47 commented 2 years ago

c588301d9d4c3b03b572cd6369e5664781c517a1 seems to do the trick. Tested it with the example from https://github.com/dini-ag-kim/hochschulfaechersystematik/issues/20 and it looks great:

The index for https://github.com/dini-ag-kim/hochschulfaechersystematik is 3,3MB now.

acka47 commented 2 years ago

Resolved with #180 . However, this renders SkoHub Vocabs search unusuable for bigger vocabs. See https://test.skohub.io/acka47/testing-skohub-vocabs/heads/master/lod.nal.usda.gov/nalt/14209.en.html which has nearly 28k concepts and currently supports two languages. The index has >80MB and the search doesn't work seemlessly.

At best, people should be able to configure the tokenizer by themselves in a conf file... What do you think, @sroertgen ?

sroertgen commented 2 years ago

Definitley. Do we want to handle this in this issue or do we want to open a generic conf-file issue?

acka47 commented 2 years ago

Do we want to handle this in this issue or do we want to open a generic conf-file issue?

I opened #185 for this.

acka47 commented 2 years ago

However, this renders SkoHub Vocabs search unusuable for bigger vocabs. See https://test.skohub.io/acka47/testing-skohub-vocabs/heads/master/lod.nal.usda.gov/nalt/14209.en.html which has nearly 28k concepts and currently supports two languages. The index has >80MB and the search doesn't work seemlessly.

Taking a stroll in Köpenick with @literarymachine by the end of May, he pointed out that we could offer the index file as a zipped file that is unzipped in the browser. This fix would decrease the file/traffic size significantly. @sroertgen, if you have some time, it would be great if you implemented it.

sroertgen commented 2 years ago

Meanwhile and additionally we can make the tokenizer configurable.

@acka47 if you have a look at #185 and are ok with the approach, we can do this quickly I guess

acka47 commented 2 years ago

I just merged #194 Unfortunately, I can not test it as it looks like builds are currently not working on test.skohub.io (although already built static files are served without problems):

I guess, we will have to wait until @dr0i is back in August to fix the build server.

dr0i commented 2 years ago

Don't see this 503 here : https://github.com/skohub-io/skohub-vocabs/settings/hooks/316821715?tab=deliveries Maybe github removes failed deliveries from that list after some time? But yeah, the webhook triggering and deploying of dev could not have worked in the first place:. The script, which the webhook calls lacked a cd $repo where it would fetch and deploy everything. Fixed that. Also, monit has had a race condition as the timeout of 300 seconds is too low (maybe that worked before, but not atm it - just takes longer to start skohub-vocabs). Increased the timeout to 900. Should work now, please have a look @acka47 .

acka47 commented 2 years ago

Don't see this 503 here : https://github.com/skohub-io/skohub-vocabs/settings/hooks/316821715?tab=deliveries Maybe github removes failed deliveries from that list after some time?

I use this repo for testing: https://github.com/acka47/testing-skohub-vocabs

By now I could once again trigger a build but it doesn't seem to finish: https://test.skohub.io/build/?id=e0772272-c6f9-430e-b21a-e0e2f3bbad2e (Note that quite a big vocab with a 15MB ttl file is included in the rpeo which might cause a problem.)

dr0i commented 2 years ago

Ah, ok, you use your repo so I cannot see the 503. Look again at https://test.skohub.io/build/?id=e0772272-c6f9-430e-b21a-e0e2f3bbad2e#53, seems to finish, just takes more than an hour. Is this correct?

sroertgen commented 2 years ago

Using your vocab on my test server, the build also crashes with a JavaScript heap out of memory-error.

EDIT: I set up a very small machine, so that might be the reason, if your build finally goes through

acka47 commented 2 years ago

However, this renders SkoHub Vocabs search unusuable for bigger vocabs. See https://test.skohub.io/acka47/testing-skohub-vocabs/heads/master/lod.nal.usda.gov/nalt/14209.en.html which has nearly 28k concepts and currently supports two languages. The index has >80MB and the search doesn't work seemlessly.

Taking a stroll in Köpenick with @literarymachine by the end of May, he pointed out that we could offer the index file as a zipped file that is unzipped in the browser. This fix would decrease the file/traffic size significantly. @sroertgen, if you have some time, it would be great if you implemented it.

The compressed index now is around 5.4MB which is great. However, the problem obviously (and not surprisingly) wasn't the download size but something else. Search still doesn't work well because of immense loading times and now visual feedback when typing. Nonetheless, it is an improvement, so +1. (We can finally close this then and focus on configurability and/or oin improving performance for bigger vocabs.

skohub-io / skohub-vocabs

Use "full" tokenizer for FlexSearch indexing #153