inhumantsar / slurp

Slurps webpages and saves them as clean, uncluttered Markdown. Think Pocket, but better.
https://inhumantsar.github.io/slurp/
MIT License
127 stars 2 forks source link

Tags are not deduplicated before saving the page #23

Closed Truncated closed 1 month ago

Truncated commented 2 months ago

Summary

Some websites are slapping 2 and 3 sources of labels in the header space. Sometimes it's identical, sometimes not. Adding a de-dupe routine to cycle through the end list of found tags, maybe using a.filter(onlyUnique); as described here, or the distinct method?

Details:

Their metadata in the header appears to be "throw against the wall" approach:

From example: https://www.forbes.com/sites/jiawertz/2024/05/07/ai-can-boost-solopreneurs-productivity-by-40/?sh=234850ea4cd1

<meta name="keywords" itemprop="keywords" content="Generative AI,AI,artificial intelligence,solopreneur,automation,chatGPT,content creation,personalization">
<meta name="news_keywords" itemprop="keywords" content="Generative AI,AI,artificial intelligence,solopreneur,automation,chatGPT,content creation,personalization">
.... further down...
<meta name="news_keywords" itemprop="keywords" content="Generative AI,AI,artificial intelligence,solopreneur,automation,chatGPT,content creation,personalization">

At least Forbes is consistent... economictimes mixes it up: From example: https://m.economictimes.com/small-biz/sme-sector/how-software-and-it-jobs-are-disappearing-in-favour-of-ai-and-what-is-going-to-fill-that-vacuum/amp_articleshow/109640608.cms

<meta name="news_keywords" content="startups,Small Business,AI,technology,IT,skills,workforce,jobs,future">
<meta content="AI,technology,IT,skills,workforce,jobs,future" name="keywords">
inhumantsar commented 1 month ago

should be fixed in 0.1.12!