kartevonmorgen / openfairdb

Open Fair DB is the CreativCommons Backend of Kartevonmorgen.org
http://www.openfairdb.org
GNU Affero General Public License v3.0
55 stars 18 forks source link

Clean-up invalid/ unused/ wrong tags #175

Closed uklotzde closed 4 years ago

uklotzde commented 5 years ago

Unused tags

Unused tags can be deleted periodically. Unused tags: select * from tags where id not in (select tag_id from entry_tag_relations)

Invalid tags

Invalid tags should be replaced by their valid counterpart.

Leading space: select * from tags where id like ' %' Trailing space: select * from tags where id like '% ' Leading hash symbol: select * from tags where id like '# ' Contains any hash symbol: select * from tags where id like '%#%'

Related to #188 that also involves a database cleanup task

navid-kalaei commented 4 years ago

Tags should go through normalization pipeline before storing to database as follows:

  1. convert to lowercase (or an alternative solution would be to gain benefit of methods that detect words similarities to keep the original meaning of tags)
  2. split with '#' sign
  3. strip white spaces of each tag
  4. remove empty strings from the array of tags

Finally store all tags of the array separately.

uklotzde commented 4 years ago

Welcome @navid-kalaei

Most of the improvements you proposed should already be in place, except conversion to lowercase.

uklotzde commented 4 years ago

The server-side conversion to lowercase is fixed in v0.8.11. I wasn't aware that this was still missing.

navid-kalaei commented 4 years ago

Thank you for your warm welcome @uklotzde I possessively confirm most of the suggestions are already implemented however there exists some tags like:

Perhaps that's because of the previous versions of backend.

Also, I made a playground worth ckecking: if a string contains '#' sign the whole string would be a single tag.

wellemut commented 4 years ago

Funny thing @navid-kalaei that you build in play.rust-lang.org...

I think we should also do kind of a manual replacement of tags.

foodsharing-.... is such a word, where you get endless stupid suggestions.

@uklotzde if you could send me a csv with all tags in our DB, I could write back to you, which ones can be deleted and which ones should be replaced by others. You can choos in which format I should give it back to you.

wellemut commented 4 years ago

inklusiv = inklusion

navid-kalaei commented 4 years ago

@wellemut Yes, I have learned Rust has a similar playground like Golang to quick test and share results. :+1: If there is any other convenience or standard for code sharing for openfairdb, it would be a massive pleasure for me to have it as my default environment :ok_hand:

flosse commented 4 years ago

@wellemut Yes, I have learned Rust has a similar playground like Golang to quick test and share results. +1 If there is any other convenience or standard for code sharing for openfairdb, it would be a massive pleasure for me to have it as my default environment ok_hand

Nice! So probably our blog post series might interesting for you: https://slowtec.de/posts/2019-12-20-porting-javascript-to-rust-part-1.html

We'll publish the third post within the next days :)

wellemut commented 4 years ago

@navid-kalaei I just had a call with @flosse and we thought about an easy hack to solve it. Wrong Hashtags are mainly annoying in the suggstions-List. Thats why we can clean it up like this:

  1. Can you figure out, how the tag-suggestions-List is generated? (It was done by Botho https://github.com/elbotho as a frontend issue: https://github.com/kartevonmorgen/kartevonmorgen/issues/435)
  2. Can you buffer/ cache the tag-list in the frontend?
  3. Can you filter only for tags that are used more often than 3 times
  4. Can you sort the suggested tags by times how often they are used and suggest most used tags first?
  5. Can you implement the same suggestion function to the search-field? (https://github.com/kartevonmorgen/kartevonmorgen/issues/468)

Thank you very much

flosse commented 4 years ago

off-topic: https://slowtec.de/posts/2020-02-28-porting-javascript-to-rust-part-3.html @navid-kalaei

uklotzde commented 4 years ago

Cleaning up tags manually is pointless and not recommended.

Instead we need a set of API functions that allow to delete or rename (includes merge) tags for all entries, i.e. both places and events. Then the admin UI could be extended with a view for managing tags. This feature is needed even if more sophisticated algorithms are applied to normalize tags when entries are created/updated.

  1. Add or extend operations for retrieving tags and tag statistics -> tags/most-popular
  2. Add operations for deleting and renaming tags for all entries at once. These batch operations should require special permissions, i.e. admin only! Do we need to generate a new version for each affected entry??
  3. Improve tag normalization algorithm
navid-kalaei commented 4 years ago

I carefully went through the mentioned queries, and findings are as follows:

  1. Can you figure out, how the tag-suggestions-List is generated? (It was done by Botho https://github.com/elbotho as a frontend issue: kartevonmorgen/kartevonmorgen#435) Thanks to your guide to issue #435, I could reach to the source of changes resulting in the tag-suggestion functionality. The corresponding pull request is #462

  2. Can you buffer/ cache the tag-list in the frontend? I should affirm your question however there are some justifications about caching I am pleased to mention:

    1. Caching mechanism requires a massive network consumption as each API call may cause transferring a considerable size of messages from backend to the frontend. Hence, the network delays in the network may result in the client's unsatisfactory as well as increasing their financial loss as some mobile operators charge bills based on the transferred data.
    2. A policy for update/invalidation of cache is needed. Eighter frontend pulls from the backend or the backend pushes the new caches. Low rate pulling results in inaccurate suggestions. High rate pulls raise the chance of self brute force.
  3. Can you filter only for tags that are used more often than 3 times?

  4. Can you sort the suggested tags by times how often they are used and suggest the most used tags first? The answer to both questions is positive as we have already issue #495. However, the remained questions are:

    1. As the number of the tags increases, the data needs to be transferred to the client-side decreases, the performance on the frontend drops since the size of the cache grows. While the backend can handle these performance and network issues properly. Is that acceptable from the business logic?
    2. Also, what cache updating event would be? Timed based intervals, or user events like stop typing?
  5. Can you implement the same suggestion function to the search-field? The feasibility of the implementation depends on details. For example, if we had a search query like: #meet us at the #coffee should the dropdown gets updated when the cursor moves from coffee to meet or once a tag is added it is considered done? Could I kindly as for a practical example if you don't mind?

Considering the above facts I propose to implement an event-based request based on the inputs the frontend receives. Being more specifically, when a user stops typing after a safety time frontend requests the backend for new recommendations (debounce). The backend filters and responses back to be presented to the client. The norm is to have WebSockets for the sake of speeding up the network transfers. However, for simplicity, we could have a traditional REST endpoint.

wellemut commented 4 years ago

its solved.