inhumantsar / slurp

Slurps webpages and saves them as clean, uncluttered Markdown. Think Pocket, but better.
https://inhumantsar.github.io/slurp/
MIT License
164 stars 6 forks source link

Getting a fair number of "Invalid dates" for the date field. #25

Closed Truncated closed 4 months ago

Truncated commented 4 months ago

I'm finding a fair number of Invalid date errors. Here's a report I made with source links for examples of pages that have all given that problem for me:

Slurp Problem Analysis.pdf

inhumantsar commented 4 months ago

Unfortunately not every site is going to have parsable data for every field. For either Slurp or Readability to pick up on a date, it has to be contained in a well structured HTML element, ideally a meta tag. Right now, Slurp supports OpenGraph metadata and a handful of others since Readability doesn't have great metadata support at the moment though I'm working on expanding that.

KD Nuggets seems to publish some OpenGraph metadata, but they don't use OpenGraph's article:published_time:

<meta property="og:url" content="https://www.kdnuggets.com/5-free-stanford-ai-courses">
<meta property="og:site_name" content="KDnuggets">
<meta property="og:locale" content="en_US">
<meta property="og:type" content="article">
<meta property="article:author" content="https://www.facebook.com/kdnuggets">
<meta property="article:publisher" content="https://www.facebook.com/kdnuggets">
<meta property="article:section" content="Originals">
<meta property="article:tag" content="Artificial Intelligence">
<meta property="og:title" content="5 Free Stanford AI Courses - KDnuggets">
<meta property="og:description" content="Want to learn more about Artificial Intelligence? These five courses from Stanford will help you kickstart that journey.">
<meta property="og:image" content="https://www.kdnuggets.com/wp-content/uploads/Wijaya_5_Free_Stanford_AI_Courses_1.png">
<meta property="og:image:secure_url" content="https://www.kdnuggets.com/wp-content/uploads/Wijaya_5_Free_Stanford_AI_Courses_1.png">
<meta property="og:image:width" content="1280">
<meta property="og:image:height" content="720">
<meta property="og:image:alt" content="5 Free Stanford AI Courses">
<meta name="twitter:card" content="summary">
<meta name="twitter:site" content="@kdnuggets">
<meta name="twitter:creator" content="@kdnuggets">
<meta name="twitter:title" content="5 Free Stanford AI Courses - KDnuggets">
<meta name="twitter:description" content="Want to learn more about Artificial Intelligence? These five courses from Stanford will help you kickstart that journey.">
<meta name="twitter:image" content="https://www.kdnuggets.com/wp-content/uploads/Wijaya_5_Free_Stanford_AI_Courses_1-1024x576.png">

On the other hand, Nature includes publication dates in their dc and prism metdata, but not in the citation_* metadata:

    <meta name="dc.date" content="2024-04-09"/>
...
    <meta name="prism.publicationDate" content="2024-04-09"/>
...
    <meta name="citation_author" content="Jackson, Joshua Conrad"/>
    <meta name="citation_author_institution" content="Booth School of Business, University of Chicago, Chicago, USA"/>
    <meta name="citation_author" content="Medvedev, Danila"/>
    <meta name="citation_author_institution" content="Booth School of Business, University of Chicago, Chicago, USA"/>

By contrast, Our World in Data doesn't use dc or prism and they don't use OpenGraph's publishedTime either, but they do use the citation_publication_date field which Nature doesn't.

Even when sites use standardized fields, they don't always use them properly. So while I can add dc, prism, and citation_published_date to Slurp or contribute upstream to Readability (and wait for someone from Mozilla to review the PR), you probably see how quickly this snowballs.

At some point I'd like to spend some time getting to run Phi-3 on-device directly in Obsidian and extracting content+metadata that way rather than relying on Readability and specific bits of structured data, but that will come with a whole host of caveats on its own.

In the meantime though, I'll go ahead and create a new issue for the Nature and OWID fields I mentioned above.

If there are other sites you'd really like to start getting the date (or any other missing field) from, check the page source for <meta> tags. If there's one for that, then feel free to open up an issue for that field. Depending on how it would have to be implemented, I will either contribute it upstream or add it to Slurp.

Truncated commented 4 months ago

Thank you for explaining your position in such detail. I figured, if nothing else, that you may find a list of examples helpful if it was something you were looking to expand on.

The date coming in is mildly annoying, but certainly not a show-stopper for my purposes, so I'm fine as-is.

Thanks!

inhumantsar commented 4 months ago

I definitely appreciate the examples. I'll be going through them at some point to check out the bigger sites.