polyrabbit / hacker-news-digest

:newspaper: Let ChatGPT Summarize Hacker News for You
http://hackernews.betacat.io/
GNU Lesser General Public License v3.0
684 stars 91 forks source link

Gemma "▁viciss" token appearing randomly on summary #36

Open brunodoamaral opened 4 months ago

brunodoamaral commented 4 months ago

Hi!

I notice that Gemma-generated summary has some issues when it "hallucinates" the specific token "▁viciss" (id: 200507, as found on tokenizer file). Here are a few examples (today's news):

LogiCola, a software for learning logic, has been redesigned and released as version 3.0 vicissolar definitions and propositional translations are now available in a quiz mode. Malik Piara aims to continuously improve and maintain the open-source software.

Scientists have found evidence that giant blobs of material left behind by a cosmic collision 4 vicissitation 4 Kün 4 vicissitation billion years ago may be responsible for modern plate tectonics. Their computer models suggest the blobs caused subduction and surface sinking, leading to the formation of early tectonic boundaries.

I didn't have time to look at this repo code, but I'm a regular user of https://hackernews.betacat.io/ and I remember seeing the same issue yesterday.

QINGCHARLES commented 4 months ago

I see it in pretty much every summary that has a number now. It's been like that for about a week or so. It makes the summaries very hard to read. I wonder what code change caused this? It is something to do with numerical parsing AFAICT.

polyrabbit commented 4 months ago

@brunodoamaral Thanks for reporting. @thiswillbeyourgithub also mentioned the same issue and suggested to use logit_bias to avoid those words. I gave a quick try but it didn't work as expected. Now I know the reason - the token id I used is wrong, I didn't know the prefix trick (why?).

Just added those words to the bias list and it works perfectly now. I suppose there are more words like this, I'll keep an eye on it. Thanks for the knowledge! https://github.com/polyrabbit/hacker-news-digest/blob/8167ef6ac832307921349e07873ec988d0ba101f/hacker_news/llm/openai.py#L78-L80

@QINGCHARLES Yes, there is a change recently. I used to use the gpt model from openai, but it is expensive for such a free and long-running project. So I switched to the free Gemma model from openrouter, and here comes the issue.

polyrabbit commented 4 months ago

Oops, I see lots of weird 196 interspersely now. Need to seek another model...

thiswillbeyourgithub commented 4 months ago

Have you tried playing with the frequency and repetition penalty?

https://platform.openai.com/docs/guides/text-generation/parameter-details

polyrabbit commented 4 months ago

Haven't tried other values - both are set to 1 currently. I suppose we cannot get rid of those magic words completely in Gemma, need to find a better model.

You can find parameters here: https://github.com/polyrabbit/hacker-news-digest/blob/master/hacker_news/llm/openai.py#L69-L77

thiswillbeyourgithub commented 4 months ago

Alright, I do think it's a good rule of thumb to not stop before banning like 10 tokens, right now you banned 5. I already had to do this kind of thing a while ago and after banning a few more the model worked as expected (not gemma though).

Also, I don't know what you use to parse the webpage, but you might be interested in this: https://github.com/jina-ai/reader

it's a very simple parser for urls that makes it LLM friendly, it even parses images as a caption! It's quite new though and they had issue with scaling at some point so maybe use a timeout when querying from them.

I'm bringing that up because a good web parsing can greatly help LLMs to summarize, especially smaller models.

polyrabbit commented 4 months ago

not stop before banning like 10 tokens

I fine-tuned some code and switched to llama3 now, I'll use it for a while and see how it goes. Hope I don't need to spend time to fine-tune one model's tokenizer issues again.

Also, I don't know what you use to parse the webpage, but you might be interested in this: https://github.com/jina-ai/reader

It's a handwritten Python library that is small and easy to maintain. It's been used for more than 10 years, since the very beginning of this project.

The jina parser looks very helpful. I'm considering using it as a fallback for dynamic web pages. Thanks!

QINGCHARLES commented 4 months ago

@polyrabbit I just want to say thank you for this app. It is literally life-changing the amount of time it saves me each day so that I don't have to click into articles on HN to see if they are worth exploring.

polyrabbit commented 4 months ago

I'm considering using it as a fallback for dynamic web pages.

Done, now we have summaries for substack etc.