polyrabbit / hacker-news-digest

:newspaper: Let ChatGPT Summarize Hacker News for You
http://hackernews.betacat.io/
GNU Lesser General Public License v3.0
668 stars 87 forks source link

some articles are not summarized #21

Closed thiswillbeyourgithub closed 1 year ago

thiswillbeyourgithub commented 1 year ago

For example : https://www.smithsonianmag.com/history/the-photographer-who-forced-the-us-to-confront-its-child-labor-problem-180982355/

The summary is extremely long and unformatted. I've seen this several times so I think there is a bug in your code for some websites of input file type?

Thanks for the website btw

polyrabbit commented 1 year ago

Hi, thanks for the feedback. This article is already summarized by OpenAI, as denoted by the model icon. But the returned summary is longer than 400 characters, so I have to truncate and append an ellipsis to the end.

Quote for the official doc:

Note however that instructing the model to generate a specific number of words does not work with high precision.

image
thiswillbeyourgithub commented 1 year ago

Hi, thanks for the quick answer but I'm confident there is a bug.

I am using your website via RSS and when I click on most links I see that the content is indeed a summary, but for the article I linked the summary is pages and pages long, a wall of text.

Here's a picture :

Normal article summary: image

Suspicious article: image image

I only supplied the first and last page but the whole wall of text is about 5 times the content of a single screenshot.

polyrabbit commented 1 year ago

Oh, I see the problem, here is what happened:

  1. When a new post appears on HN frontpage, OpenAI will not be used to summarize it at first until its score reaches a certain threshold (to reduce OpenAI cost as this website has no income yet).
  2. Your RSS reader scraps the content, and I DO have a bug when rendering the RSS feed - the content is not truncated when summary model is not OpenAI, thus you see pages and pages of text.
  3. When the post is upvoted to a certain score. I submit the raw content to OpenAI to do the summary - that's the result you see on website now. But it seems your RSS reader will never update its content to the new summary.

The rendering issue can be fixed quickly. But for the RSS reader not update summary issue - I'm planning to not output posts that below certain score points, only output posts that has been summarized by OpenAI. So that we can have a consistent view from RSS reader's perspective. What do you think?

Thanks for reporting.

thiswillbeyourgithub commented 1 year ago

Nice.

I think that an RSS feeder updating a feed supposed to contain summarized article should not find non summarized article so the best fix IMO seems to be to not output posts that have not been summarized.

polyrabbit commented 1 year ago

Fix committed. Now new feeds should have correct summaries. Let me know if you have any other issues.

thiswillbeyourgithub commented 1 year ago

Thanks!