Detect and merge similar tags

shaarli / Shaarli

The personal, minimalist, super-fast, database free, bookmarking service - community repo

https://shaarli.readthedocs.io/

Other

3.4k stars 287 forks source link

Detect and merge similar tags #1024

Open Phyks opened 6 years ago

Phyks commented 6 years ago

Hi,

Thanks for the work you are putting in maintaining this fork of Shaarli! I have been using it for a couple of years now and realize I have been using slightly different tags for the same category over the course of years.

For instance, sometimes I have "Photo", sometimes I have "Photos", resulting in two different tags with very close spelling.

So, I've just had the idea of adding a new "merge close tags" feature which would try to detect close tags and list them with an option to merge them together. What do you think about it?

For the close tags detection, I think something as simple as stemming might actually be enough.

nodiscc commented 6 years ago

Something similar was suggested in ~~#96~~ (equivalency terms). I'm not sure this falls within the scope of Shaarli - but maybe there's a KISS way to implement it. Edit: sorry #968

In the mean time (and because I often have the same problem with singular/plural tags), you can go to Tag cloud > Alphabetical or Most used, which helps

finding similar tags (for example Photo and Photos will be next to each other)
finding typos or rarely used tags (for examples gmaes tag with 1 item...)
Renaming and deleting tags quickly

Hope this helps, let me know if it should be better documented

taglist

Phyks commented 6 years ago

Oh, indeed, thanks for the tip!

Sorry, I completely missed #96 when searching for similar issues :/

virtualtam commented 6 years ago

This could be achieved with language-specific dictionaries+stemming and/or a Natural Language Processing approach using a lexical database:

I've worked with WordNet and Python libraries like NTLK and Spacy to address similar needs, but I'm not sure there are such tools available for (easy) integration in a PHP application.

The most straightforward approach might be to implement this (as a first step) as a command-line utility to python-shaarli-client.

ArthurHoaro commented 6 years ago

@nodiscc I think you made a mistake, #96 doesn't seem to be related.

@virtualtam Without going to the usage of language-specific lexical databases, PHP provides built-in functions to calculate the similarity/distance between strings, such as similar_text and levenshtein. We could easily make a very basic function to detect similar strings.

However, I'm not sure how it should work in the UI. Maybe another block in ?do=changetag page?

ArthurHoaro commented 5 years ago

This 3rd party API seems to be pretty straightforward for synonyms: https://www.datamuse.com/api/

e.g. https://api.datamuse.com/words?rel_syn=love

nodiscc commented 5 years ago

Moved from #1310

When adding a new link and attributing tags to it, I'm often wondering if I do not have already another different tag conveying the same idea. And I'm always bothered by the fact that I could create several different tags for the same purpose.

Could we imagine a way to define kind of synonyms for tags so that when we type that synonym in the tag field of the add link or edit link page, the corresponding existing tag appears and I can choose it instead.

Another way would be to have a mapping engine that uses external source to display those propositions automatically based on the meaning of words. But I guess we're talking about much bigger effort in that case.

I don't know. I understand the use case, but maintaining a synonyms database within Shaarli might be a bit overkill.