inhumantsar / slurp

Slurps webpages and saves them as clean, uncluttered Markdown. Think Pocket, but better.
https://inhumantsar.github.io/slurp/
MIT License
127 stars 2 forks source link

Customizable template for Slurp notes #3

Closed chrisgrieser closed 2 months ago

chrisgrieser commented 3 months ago

Thanks for this plugin. Looking forward to replace Advanced URI + MarkDownload with it.

One of the main features missing would be to customize the properties (frontmatter) of the slurped files. Adding the date and the site name is good (already an improvement to MarkDownload!) but some things author, other names, manually added properties would be really useful

inhumantsar commented 3 months ago

Good news, author is already supported! It will appear when it can be found during parsing. It currently leaves out properties it doesn't get a value for though.

Would you want to be able to set those properties when you're loading a URL or after the page is saved?

I haven't tried MarkDownload, sounds like I should check it out and look for friction.

chrisgrieser commented 3 months ago

It will appear when it can be found during parsing. It currently leaves out properties it doesn't get a value for though.

Ah, I see. I would pretty much prefer it to always add the property, so that I can manually add them when they cannot be parsed.

Would you want to be able to set those properties when you're loading a URL or after the page is saved?

I think just saving everything in a file would be the most straightforward solution, since adding values to properties works fine in a regular Obsidian note already

inhumantsar commented 3 months ago

I've added this as a settings toggle in v0.1.3. it won't affect existing notes though!

chrisgrieser commented 3 months ago

Thanks for the quick implementation!

However, I do not think this issue should be closed (yet), since it has been only partially adressed – it is not possible to assign custom keys or define which properties to be included in what order etc. 

inhumantsar commented 3 months ago

ah i see! yeah that goes a bit deeper than what i understood from your initial comment. i'll be honest, i'm a bit wary of going deep on customization in these early versions.

automatically populating properties is high on my priority list, but adding new properties after people have been using slurp for a while could introduce a fair bit of friction. key conflicts in particular worry me. maybe it makes sense to establish a relatively comprehensive set of "reserved" keys and data types first, then offer customization options later.

right now i've been looking at adding fields to handle multiple authors, reference IDs (eg: arXiv:1234.56789), named entities (people, places, things), tags, and internal links. are there other fields that come to mind? is there an application using a frontmatter convention you would like to see in use here?

chrisgrieser commented 3 months ago

tbh, for me personally, I'd need full customization, since I tend to change up things sometimes.

here is for example what MarkDownload does, with {baseURI} etc. being populated with the respective values. Pasted image 2024-04-09 at 20 55 43@2x

inhumantsar commented 3 months ago

yeah that makes sense. i was looking at replacing the hardcoded format with a template at some point anyway.

i see that markdownload extracts all <meta> tags from websites, is that something you've found useful in the past?

chrisgrieser commented 3 months ago

i see that markdownload extracts all tags from websites, is that something you've found useful in the past?

Kinda. It does include some information, but as I mentioned, MarkDownload does not include the publication date or the site name, for instance (though it does include the host, which is mostly similar). It also lacks a few quality-of-life features such as removing the "by" in the author-byline, e.g. by Jane Doe.

inhumantsar commented 3 months ago

seems odd because according to their GitHub, they use the same library as slurp under the hood. I'll look a little more deeply into their codebase and see if they're doing things that I should avoid.

do you have a couple links to pages where that's happened? would be good for testing

chrisgrieser commented 3 months ago

the publication year is missing everywhere, it's simply not available in MarkDownload as a token. Author is missing a lot of places, a simple example could be this article

inhumantsar commented 3 months ago

thanks for the link! that's a great one to know about. what's interesting is that slurp didn't get the author out of that either, even though there is a meta tag for author.

i'm going to open a few issues for supported properties that don't get picked up when they should. if you could throw more links like that verge one into those it would be a huge help.

inhumantsar commented 3 months ago

i may have got a bit sidetracked adding new metadata fields while fixing the old ones...

image

inhumantsar commented 2 months ago

decided against the raw template since i can't rule out breaking changes to the available properties in the near future. plus, this way it'll be easier to find out about new properties.

https://github.com/inhumantsar/slurp/assets/494253/609d4e40-bd9f-4607-9526-ef0ba05c1a4c

chrisgrieser commented 2 months ago

looks awesome, makes for a much better UI as well!

A small suggestion: could you add the possibility to add custom properties as well? One use case is to add a property field read: false to all scrapped articles, and toggling the checkbox to true when I got time to actually read it.

inhumantsar commented 2 months ago

yep, that's on the agenda! wanted to nail the existing properties down first.

i've pushed up the initial version of this for testing. i had to abandon the fancy drag and drop functionality as it wouldn't play nice with the rest of the components, but the functionality is all there.

image

it would be awesome if you could help test it out by setting up BRAT.

Edit: Forgot to mention that the ordering won't work yet. it will save the ordering you configure but won't actually write new notes with that ordering.

chrisgrieser commented 2 months ago

it would be awesome if you could help test it out by setting up BRAT.

Would love to, you haven't created a new beta release for BRAT, so it still installs 0.1.4

inhumantsar commented 2 months ago

oops, sorry about that. should be fixed now

chrisgrieser commented 2 months ago

0.1.5b1 throws an error:

Received URL action {url: 'https://www.theverge.com/2024/4/10/24125572/fcc-broadband-nutrition-labels-isp-deadline-today', action: 'slurp'}action: "slurp"url: "https://www.theverge.com/2024/4/10/24125572/fcc-broadband-nutrition-labels-isp-deadline-today"[[Prototype]]: Object
plugin:slurp:3915 Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'get')
    at maybe (plugin:slurp:3915:40)
    at SlurpPlugin.createContent (plugin:slurp:3935:18)
    at SlurpPlugin.slurpNewNoteCallback (plugin:slurp:3952:26)
    at async SlurpPlugin.slurp (plugin:slurp:3895:5)
maybe @ plugin:slurp:3915
createContent @ plugin:slurp:3935
slurpNewNoteCallback @ plugin:slurp:3952
await in slurpNewNoteCallback (async)
eval @ plugin:slurp:3786
t @ app.js:1
(anonymous) @ VM283:1
(anonymous) @ node:electron/js2c/renderer_init:2
(anonymous) @ node:electron/js2c/renderer_init:2
emit @ node:events:517
onMessage @ node:electron/js2c/renderer_init:2
inhumantsar commented 2 months ago

ok so i had to shave a lot of yaks along the way but it should be working now.

i made a lot of changes to the settings file format, so you might run into an issue there. a quick sanity check would be to check for slurped in the properties settings. if it's there and things seem to work, then it's probably fine. if slurped isn't there though:

  1. close obsidian
  2. send me a copy of <obsidian dir>/plugins/slurp/data.json
  3. delete that file and re-open obisidian
chrisgrieser commented 2 months ago

Thanks, b2 seems to work now.

With the default settings, the article gets downloaded correctly. However, with some custom settings, the metadata creation seems to fail:


some other issues I noticed:

inhumantsar commented 2 months ago

the properties do not seem to accept keys with a hyphen in them (e.g. data-type does not work)

hmm yes it seems my validation function is a bit overzealous. i'll rework it. the blocked characters are only meant to trigger a validation error only if they're at the start or end of the string.

the dates always print full dates, there is no possibility of formatting them

i've implemented the necessary functions for formatting but it hasn't been added as an option yet. i'm still working out how to best to expose that through the settings UI.

so while i haven't tested the formatter thoroughly yet, you can try it out by modifying the data.json file directly for now:

    "publishedTime": {
      "id": "publishedTime",
      "key": "year",
      "idx": 2,
      "format": "d|YYYY-MM-DDTHH:mm",
      "enabled": true
    },
    "modifiedTime": {
      "id": "modifiedTime",
      "key": "updated",
      "idx": 3,
      "format": "d|YYYY-MM-DDTHH:mm",
      "enabled": false
    },

the d| tells slurp to pass it through the date formatter, and the format after is the usual date format syntax used in moment. eg: if you wanted the date to show up as "Wednesday, April 10th 2024 at 1:55pm", you could change d|YYYY-MM-DDTHH:mm to d|dddd, MMMM do YYYY [at] h:mma (i just made up that format string off the top of my head, it may not work as-is).

worth noting too that all string properties can be formatted using the same syntax as well. it's already being used for tags. the format string just needs to start with s| and slurp will replace all instances of {s} in the format string with the property value. this is being used at the moment to convert Twitter usernames into links: s|https://twitter.com/{s}

S| can also be used to replace multiple different placeholders. eg: for tags, S|{prefix}/{tag} is fed into the formatter along with an object {prefix: <tag prefix setting>, tag: <a keyword pulled from site metadata>}. this likely won't be exposed as an option though since it would be very difficult to handle those placeholders automatically and predictably, but it will get used for supported properties.

There is no title property

the page title is the parsed title anyway, so i didn't see much of a point to duplicating it in the note properties. is there a use-case you have in mind for that?

there is no option to add a custom property ... though I assume you simply haven't implemented that yet

that is indeed the case :) most of the time i put in this weekend was to ensure that the core properties and custom properties would play nicely together.

there's still more work to do on that front. in particular i need to ensure that settings will be gracefully migrated between plugin updates. right now slurp will just try to slam whatever is there together with the core options without checking for incompatibilities first.

this will likely get done before i do the custom date format options.

the small animation when moving properties up/down is cute, but I feel like they might annoy some users? (I personally have no issue with them though :) )

anyone who doesn't like it can deal 😉 it's a pretty minor thing and shouldn't interfere with anything else. i found that, without an animation of some kind, i sometimes didn't notice that the items changed order so i'd click the button again only to accidentally flip the ordering back. the animation does seem to be pretty inefficient compute-wise though, so i might simplify it in the future.

inhumantsar commented 2 months ago

also, regarding the undefined you're seeing. would you be willing to reproduce and post your console output? Ctrl+Shift+I will open the console on Windows. as you might have noticed, i started adding a "debug" option in the settings but it's not wired up to anything yet, so unfortunately it has to be a manual copy paste job for now.

i'll try to reproduce on my end as well later tonight, but it would be good to have your logs too for comparison.

edit: i couldn't help myself and ended up reproducing it just now. looks like when there are no tags, slurp doesn't get rid of the set object it uses to store them and js-yaml doesn't like that. should be an easy fix. i'll push that tonight and let you know when it's ready.

thanks again for your help testing this by the way! i promise i'll add some automated tests soon so the easy stuff won't have to be caught by users like you 😄

inhumantsar commented 2 months ago

hah ok so don't worry about reproducing it!

i fixed the issue. it was actually two issues, one was the empty set breaking the YAML parser, and the other was disabled properties were forcing an early exit from the metadata parsing function entirely. in testing the fix i also noticed a third issue: disabled properties were re-enabling themselves.

i've pushed those fixes up. a new v0.1.5b3 release will be available shortly.

chrisgrieser commented 2 months ago

Can confirm, b3 fixes the undefined issue. With the other information / fixes, that leaves only these issues / todos:


the page title is the parsed title anyway, so i didn't see much of a point to duplicating it in the note properties. is there a use-case you have in mind for that?

There are multiple reasons for a title property:

  1. Filenames have various restrictions when it comes to special characters (:, /), and also an os-dependent maximum length. Thus, long titles and titles with special characters cannot be correctly reflected in the file name.
  2. You might want to change filenames for various reasons, while preserving the title information
inhumantsar commented 2 months ago

alright so i went through and added all of those.

in classic form, i broke some of it while working on state management at the same time. 🙃

enabling/disabling and deletion seem to be affected. adding new fields, adjusting formats, and changing keys should all still be working though.

overhauled the UI to be more obsidian-like as well. new beta should find its way to your machine shortly.

chrisgrieser commented 2 months ago

BRAT complains, since there is only a tagged commit for b4, but no pre-release yet

inhumantsar commented 2 months ago

just added it. obsidian's build workflow was complaining and it was too late at night to troubleshoot it

chrisgrieser commented 2 months ago

Thanks! the UI is a nice idea. Some issues I've noticed:

inhumantsar commented 2 months ago

I have a fix for the issue with Enabled already, should be able to push it up along with a couple other changes later today.

The data type thing is annoying. I switched to using a yaml parser as part of this and I'm not really a fan of how it wraps everything in quotes like that. YAML doesn't require that and it makes handling things like boolean values particularly tedious.

It does offer a lot of options at least, so I'm hoping I'll be able to configure away that behaviour.

inhumantsar commented 2 months ago

alright, things looking better now. keep your eyes peeled for a new release.

chrisgrieser commented 2 months ago

Okay, checked out b6, custom properties & b|false work nicely. New issues I noticed:

inhumantsar commented 2 months ago

ok! those should be fixed in 0.1.5b7.

thanks for all your help on this!

chrisgrieser commented 2 months ago

Thank you! At long last, it seems everything works now – cannot find any issues anymore 🥳

Will migrate my setup to Slurp now. Maybe there will be some minor leftover issues I'll find in daily use, but I guess that would be a new GitHub issue.