hoarder-app / hoarder

A self-hostable bookmark-everything app (links, notes and images) with AI-based automatic tagging and full text search
https://hoarder.app
GNU Affero General Public License v3.0
3.3k stars 117 forks source link

[Feature request] Force AI to use existing tags (instead of creating them) #111

Open mowsat opened 4 months ago

mowsat commented 4 months ago

An option in the settings for forcing AI to use pre-existing tags would allow for more fine-tuned organization

MikeKMiller commented 4 months ago

Possibly having existing tags be passed with the content, and have the AI api return any that 'could' apply, and new ones if 'none' apply. This way it does not just always come up with a new one, even if the same thing already exists. For example, mine has these two tags, that are exactly the same thing: AI Artificial Intelligence

If we passed the existing 'Artificial Intelligence' tag, it would have chosen it, and not created 'AI'

MohamedBassem commented 4 months ago

This seems to be a popular request, so i'll probably have to implement it at some point. The main problem though is that the naive implementation will be expensive if you have a lot of tags. Basically, the naive implementation is that you pass all the tags of the user to openai/ollama on every request and ask it to only select from those tags. While this is easy to implement, every word you add to the AI request basically costs more money. So if you have 1000 tags for example, and every article you add is around 1000 words, you'll end up paying twice as much per inference request. I'm happy to add this as a feature with a big warning about this limitation but I'm not sure I like it.

The more advanced approach which I'm planning to implement is much more complex but will achieve the best result. The way it works from a high level is that we'll have a mechanism to find the potentially relevant tags from all the existing tags and pass only those to OpenAI making the request much cheaper. This will take a bit more time to implement though, but it's on my radar.

Does that make sense?

1d618 commented 1 month ago

While this is easy to implement, every word you add to the AI request basically costs more money. So if you have 1000 tags for example, and every article you add is around 1000 words, you'll end up paying twice as much per inference request. I'm happy to add this as a feature with a big warning about this limitation but I'm not sure I like it.

With gpt4o-mini, 3-5 thousand tokens are extremely cheap in input. and in the future the price will only go down, as it seems to me. and this is not to mention the use of local opensource models.

by the way, can I ask you a question? are there plans to add a function to summarise the content of the added page and use this summarisation in search?

ant1fr commented 3 weeks ago

To address these near-duplicate tags, I suggest a few potential solutions:

Additionally, a complementary approach could involve periodic tag review and standardization. This would entail running a specific prompt that provides ChatGPT with all the AI-generated tags, asking it to suggest merges, clean up, and standardize the tags.