inhumantsar / slurp

Slurps webpages and saves them as clean, uncluttered Markdown. Think Pocket, but better.
https://inhumantsar.github.io/slurp/
MIT License
164 stars 6 forks source link

Threw an error on ingestion #21

Closed Truncated closed 4 months ago

Truncated commented 4 months ago

<Edited to focus on the relevant error, rather than the entire log contents which had slurps that were fine even if the title was missing>

1715205555277 | DEBUG | onValidate called
[
  {
    "enabled": true,
    "custom": false,
    "_key": "link",
    "_idx": 0,
    "id": "link",
    "metaFields": [
      "url",
      "og:url",
      "parsely-link",
      "twitter:url"
    ],
    "defaultIdx": 0,
    "defaultKey": "link",
    "description": "Page URL provided or a permalink discovered in metadata."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "byline",
    "_idx": 1,
    "id": "byline",
    "metaFields": [
      "author",
      "article:author",
      "parsely-author",
      "cXenseParse:author"
    ],
    "defaultIdx": 1,
    "defaultKey": "byline",
    "description": "Name of the primary author or the first author detected."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "site",
    "_idx": 2,
    "id": "siteName",
    "metaFields": [
      "og:site_name",
      "page.content.source",
      "application-name",
      "apple-mobile-web-app-title",
      "twitter:site"
    ],
    "defaultIdx": 2,
    "defaultKey": "site",
    "description": "Website or publication name."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "date",
    "_idx": 3,
    "_format": "d|YYYY-MM-DDTHH:mm",
    "id": "publishedTime",
    "metaFields": [
      "article:published_time",
      "parsely-pub-date",
      "datePublished",
      "article.published"
    ],
    "defaultIdx": 3,
    "defaultKey": "date",
    "description": "Date/time that the page was initially published.",
    "defaultFormat": "d|YYYY-MM-DDTHH:mm"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "updated",
    "_idx": 4,
    "_format": "d|YYYY-MM-DDTHH:mm",
    "id": "modifiedTime",
    "metaFields": [
      "article:modified_time",
      "dateModified",
      "dateLastPubbed"
    ],
    "defaultIdx": 4,
    "defaultKey": "updated",
    "description": "Date/time that the page was last modified, if available.",
    "defaultFormat": "d|YYYY-MM-DDTHH:mm"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "type",
    "_idx": 5,
    "id": "type",
    "metaFields": [
      "og:type",
      "parsely-type",
      "medium",
      "page.content.type"
    ],
    "defaultIdx": 5,
    "defaultKey": "type",
    "description": "Type of publication, eg: \"page\", \"post\", \"article\"."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "excerpt",
    "_idx": 6,
    "id": "excerpt",
    "metaFields": [
      "description",
      "og:description",
      "twitter:description"
    ],
    "defaultIdx": 6,
    "defaultKey": "excerpt",
    "description": "Often used for subtitles, excerpts, descriptions, and abstracts."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "twitter",
    "_idx": 7,
    "_format": "s|https://twitter.com/{s}",
    "id": "twitter",
    "metaFields": [
      "twitter:creator",
      "twitter:site"
    ],
    "defaultIdx": 7,
    "defaultKey": "twitter",
    "description": "Twitter/X link for the author or site.",
    "defaultFormat": "s|https://twitter.com/{s}"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "tags",
    "_idx": 8,
    "_format": "S|{prefix}/{tag}",
    "id": "tags",
    "metaFields": [
      "tags",
      "keywords",
      "article:tag",
      "parsely-tags",
      "news_keywords"
    ],
    "defaultIdx": 8,
    "defaultKey": "tags",
    "description": "Tags and keywords present in the page's metadata.",
    "defaultFormat": "S|{prefix}/{tag}"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "onion",
    "_idx": 9,
    "id": "onion",
    "metaFields": [
      "onion-location"
    ],
    "defaultIdx": 9,
    "defaultKey": "onion",
    "description": "Link to a mirror of the content on Tor."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "slurped",
    "_idx": 10,
    "_format": "d|YYYY-MM-DDTHH:mm",
    "id": "slurped",
    "defaultIdx": 10,
    "defaultKey": "slurped",
    "description": "Date/time that the page was accessed by Slurp.",
    "defaultFormat": "d|YYYY-MM-DDTHH:mm"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "title",
    "_idx": 11,
    "id": "title",
    "metaFields": [
      "og:title",
      "twitter:title"
    ],
    "defaultIdx": 11,
    "defaultKey": "title",
    "description": "Page title as seen in the browser, falling back to the title presented in metadata."
  }
]
inhumantsar commented 4 months ago

hey thanks for the report! looks like neither readability nor slurp were able to find a title for this page. i'll probably have to submit a patch upstream of slurp for this one.

can you share the URL? i'm not seeing it in the logs

Truncated commented 4 months ago

These were from www.fastcompany.com Any link does this; if I'm reading the log above correctly, that represents multiple different links but I honestly didn't think to record the URLs with the auto-generated bug log. I will in the future (got a few more to submit).

inhumantsar commented 4 months ago

The logs were quoting a product page. I looked it up and found this: https://sparksoftcorp.com/dev-sec-ops-delivery

The site doesn't have any meta tags or even a title tag so there's not much that Slurp can do on its own. Filenames are sourced from the title. I could set it up to just call it Untitled Page or something but this feels like a pretty rare edge case.

I will be adding more options to the Slurp New Note dialog soon though. That will be the best place to manually give it a title to use.

Truncated commented 4 months ago

That's a red herring - the sparkssoft pages were ones I had ingested prior; yes, there wasn't much to pull, but I was most concerned with the text and didn't care about the metadata.

It's the links from the fast company site which is what throws the error. The log output in settings didn't give me a good way to reliably tell what was needed for just the error message, so you got both of the ingestions.

Literally any link from Fastcompany.com throws an error. Here's a clean example from https://www.fastcompany.com/91122708/heres-how-california-state-agencies-plan-use-generative-ai

1715349697499 | DEBUG | onValidate called
[
  {
    "enabled": true,
    "custom": false,
    "_key": "Source",
    "_idx": 0,
    "id": "link",
    "metaFields": [
      "url",
      "og:url",
      "parsely-link",
      "twitter:url"
    ],
    "defaultIdx": 0,
    "defaultKey": "link",
    "description": "Page URL provided or a permalink discovered in metadata."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "byline",
    "_idx": 1,
    "id": "byline",
    "metaFields": [
      "author",
      "article:author",
      "parsely-author",
      "cXenseParse:author"
    ],
    "defaultIdx": 1,
    "defaultKey": "byline",
    "description": "Name of the primary author or the first author detected."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "site",
    "_idx": 2,
    "id": "siteName",
    "metaFields": [
      "og:site_name",
      "page.content.source",
      "application-name",
      "apple-mobile-web-app-title",
      "twitter:site"
    ],
    "defaultIdx": 2,
    "defaultKey": "site",
    "description": "Website or publication name."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "date",
    "_idx": 3,
    "_format": "d|YYYY-MM-DDTHH:mm",
    "id": "publishedTime",
    "metaFields": [
      "article:published_time",
      "parsely-pub-date",
      "datePublished",
      "article.published"
    ],
    "defaultIdx": 3,
    "defaultKey": "date",
    "description": "Date/time that the page was initially published.",
    "defaultFormat": "d|YYYY-MM-DDTHH:mm"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "updated",
    "_idx": 4,
    "_format": "d|YYYY-MM-DDTHH:mm",
    "id": "modifiedTime",
    "metaFields": [
      "article:modified_time",
      "dateModified",
      "dateLastPubbed"
    ],
    "defaultIdx": 4,
    "defaultKey": "updated",
    "description": "Date/time that the page was last modified, if available.",
    "defaultFormat": "d|YYYY-MM-DDTHH:mm"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "type",
    "_idx": 5,
    "id": "type",
    "metaFields": [
      "og:type",
      "parsely-type",
      "medium",
      "page.content.type"
    ],
    "defaultIdx": 5,
    "defaultKey": "type",
    "description": "Type of publication, eg: \"page\", \"post\", \"article\"."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "excerpt",
    "_idx": 6,
    "id": "excerpt",
    "metaFields": [
      "description",
      "og:description",
      "twitter:description"
    ],
    "defaultIdx": 6,
    "defaultKey": "excerpt",
    "description": "Often used for subtitles, excerpts, descriptions, and abstracts."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "twitter",
    "_idx": 7,
    "_format": "s|https://twitter.com/{s}",
    "id": "twitter",
    "metaFields": [
      "twitter:creator",
      "twitter:site"
    ],
    "defaultIdx": 7,
    "defaultKey": "twitter",
    "description": "Twitter/X link for the author or site.",
    "defaultFormat": "s|https://twitter.com/{s}"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "tags",
    "_idx": 8,
    "_format": "S|{prefix}/{tag}",
    "id": "tags",
    "metaFields": [
      "tags",
      "keywords",
      "article:tag",
      "parsely-tags",
      "news_keywords"
    ],
    "defaultIdx": 8,
    "defaultKey": "tags",
    "description": "Tags and keywords present in the page's metadata.",
    "defaultFormat": "S|{prefix}/{tag}"
  },
  {
    "enabled": false,
    "custom": false,
    "_key": "onion",
    "_idx": 9,
    "id": "onion",
    "metaFields": [
      "onion-location"
    ],
    "defaultIdx": 9,
    "defaultKey": "onion",
    "description": "Link to a mirror of the content on Tor."
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "slurped",
    "_idx": 10,
    "_format": "d|YYYY-MM-DDTHH:mm",
    "id": "slurped",
    "defaultIdx": 10,
    "defaultKey": "slurped",
    "description": "Date/time that the page was accessed by Slurp.",
    "defaultFormat": "d|YYYY-MM-DDTHH:mm"
  },
  {
    "enabled": true,
    "custom": false,
    "_key": "title",
    "_idx": 11,
    "id": "title",
    "metaFields": [
      "og:title",
      "twitter:title"
    ],
    "defaultIdx": 11,
    "defaultKey": "title",
    "description": "Page title as seen in the browser, falling back to the title presented in metadata."
  }
]
inhumantsar commented 4 months ago

ah ok, yeah the error message slurp displays says that it got a 403 back from fast company, so I'm guessing that they block non-browsers from accessing their pages. I'll have a look but there's likely not much we can do about that

inhumantsar commented 4 months ago

fast company does seem to block application access entirely, so i've added a validation step to new note creation which will complain if a fast company link is used. did the same for that product site too.

let me know if you find any other sites which just refuse to be slurped!