paolobenve / myphotoshare

MOVED TO GITLAB! --- A Web 2.0 Photo Gallery Done Right via Static JSON, Dynamic Javascript and a bit of php for sharing
15 stars 0 forks source link

managing stop words in js? #90

Closed paolobenve closed 6 years ago

paolobenve commented 6 years ago

I'm seeing that stop words are used in scanner when preparing the words lists for media and albums. The stop words for one language are used, so that if a media or album has name/description in another language, result is undefined.

What about forgetting stop words in python and transferring managing to js? and an option to disable them could be easily implemented?

pmetras commented 6 years ago

Yes. My hypothesis was that media content file names were in the language defined in MyPhotoShare config file. If that's not the case, results are undefined as a stopword in one language can have meaning in another one...

I've done it in the scanner because it did not depend on user input. Stopwords, by definition, are non-meaningful words. So there's no benefits at creating indexes on them. If we filter them out at the scanner phase, we reduce the size on disk too.

Disabling them is more a work around to support multilingual content. Let's find a real multilingual solution, that could work without requiring tweaks from the user browsing the gallery. If my old mother is searching a picture, she won't be able to understand that she has to disable an option in a menu to find pictures. So this decision of having a media findable or not has to be done by the content owner, not the final user. And it must be available for all users, keeping MyPhotoShare simple to use.

One possible way is to specify the language of album or media at the album level. If we defined a language value in the album.ini file, one can switch the language used by the scanner. For instance:

[DEFAULT]
# If not specified, all media in this album is in Italian.
language=it

[album]
# Exception for this album name that uses French words
language=fr

[On the beach.jpg]
# Another exception. This picture is in English.
# As "on" and "the" are stopwords in English, only "beach" will be indexed.
language=en

[Je bois le thé.jpg]
language=multi
description=I'm drinking tea with my wife in a coffee.

If in that example, MyPhotoShare was configured to be in Spanish, this special album could have content in Italian, French and English. So correct index would be built for the media in these languages. The JavaScript application does not have to understand what language is used: it only looks if an index for the word entered by the user exists. If that's the case, being a stopword or not is not important, results can be displayed.

It still works if the user searches with multiple words, even stopwords in various languages, as long as they are correctly indexed...

A problem occurs when a same media uses multiple languages, like the last one Je bois le thé.jpg in the example. The file name and the description uses two different languages. In that case, thé in French (tea in English) is considered as the stopword the in English. If we still want to make this photo findable, the solution is to disable stopwords for that media. That's what the language=multi does, meaning that we use multiple languages in metadata and that we must disable stopwords.

The album.ini file must be seen as a config file per album, that can change the scanner behaviour. It was my intent to add a noindex directive, to prevent a photo or the whole album from being indexed by the scanner and copied to cache, but I haven't had the time to work on it. Or one could use it to specify parameter for OpenCV or to create thumbnails like with thumbnail=crop or thumbnail=center...

The code of the scanner has to be adapted as it cached only the stopwords for the default language. Now it should cache multiple languages and switch them based on context. The stopwords JSON file has data for 50+ languages.

Does it make sense?

paolobenve commented 6 years ago

your analysis makes sense, the language=(xx|multi) lines in album.ini seem a good solution for multilingual albums

I'd only think on adding an option to disable stop words check, maybe it's a good solution for little album trees

noindex directive is already there, see the exclude_(files|tree)_marker options, it seems an easy task to modify the checks on those file so that take into account noindex directives in album.ini files.