inhumantsar / slurp

Slurps webpages and saves them as clean, uncluttered Markdown. Think Pocket, but better.
https://inhumantsar.github.io/slurp/
MIT License
127 stars 2 forks source link

Wrong tables, links and image import from Wikipedia article #34

Closed neoromantic closed 1 month ago

neoromantic commented 1 month ago

I've tried Slurp on random wikipedia article I'm interested in.

It worked well, and I can use that somewhat, but it immediately showed a number of problems.

Since Wikipedia is probably one of the more prominent & important use cases, I guess it would be nice to fix that.

All examples are based on this article: https://ru.wikipedia.org/wiki/%D0%A2%D0%B5%D1%81%D1%82_%D0%9A%D1%83%D0%BF%D0%B5%D1%80%D0%B0

Tables

Tables are broken. Here's original, how it looks in obsidian and code I got.

image image
|     |     |     |     |     |     |     |
| --- | --- | --- | --- | --- | --- | --- |
|     |     |     |     |     |     |     |
12-минутный тест езды на велосипеде(из книги Купера «Аэробика для хорошего самочувствия»)

|Возраст|Пол|Отлично|Хорошо|Средне|Плохо|Очень плохо|
|13—19|M|>9200 м|7600—9200 м|6000—7500 м|4200—6000 м|<4200 м|
|Ж|>7600 м|6000—7600 м|4200—6000 м|2800—4200 м|<2800 м|
|20—29|M|>8800 м|7200—8800 м|5600—7100 м|4000—5500 м|<4000 м|
|Ж|>7200 м|5600—7200 м|4000—5500 м|2400—4000 м|<2400 м|
|30—39|M|>8400 м|6800—8400 м|5200—6700 м|3600—5100 м|<3600 м|
|Ж|>6800м|5200—6800 м|3600—5200 м|2000—3500 м|<2000 м|
|40—49|M|>8000 м|6400—8000 м|4800—6400 м|3200—4800 м|<3200 м|
|Ж|>6400 м|4800—6400 м|3200—4800 м|1600—3200 м|<1600 м|
|50—59|M|>7200 м|5500—7200 м|4000—5500 м|2800—4000 м|<2800 м|
|Ж|>5600 м|4000—5600 м|2400—4000 м|1200—2400 м|<1200 м|
|60+|M|>6400 м|4800—6400 м|3600—4700 м|2800—3500 м|<2800 м|
|Ж|>4800 м|3200—4800 м|2000—3200 м|1200—200 м|<1200 м|

I guess, problem here is with table title in wikipedia, which breaks the parsing.

Also, this table has two-row cells, and that's probably complicated too, but that's not the reason for broken table.

Links with footnotes

Links that contain footnotes (very common in Wikipedia) look bad:

image image
Кеннет Купер создал более 30 подобных тестов, однако именно этот широко используется в профессиональном спорте (например, [футболе](https://ru.wikipedia.org/wiki/%D0%A4%D1%83%D1%82%D0%B1%D0%BE%D0%BB "Футбол")[[1]](https://ru.wikipedia.org/wiki/%D0%A2%D0%B5%D1%81%D1%82_%D0%9A%D1%83%D0%BF%D0%B5%D1%80%D0%B0#cite_note-ReferenceA-1)). 

Formulas look bad

Not sure if that's my local trouble or what, but LaTeX formula looks bad (too big, black on black):

image image
![{\displaystyle \mathrm {VO_{2}\;max} ={d_{12}-504.9 \over 44.73}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/f1bbad17b44d494864f74e02217ced4562645be6)
inhumantsar commented 1 month ago

thanks for the report! i did some troubleshooting and i think i know where the issue is, unfortunately it's not really a Slurp issue specifically so it won't be a quick fix.

for background, Slurp uses Mozilla's Readability library to simplify pages and then relies on Obsidian's HTML to Markdown function to convert the simplified page. Obsidian in turn relies on Turndown for the conversion process.

when i run that article through Readability the tables, links, and images all come out fine (aside from the black-image-on-dark mode issue).

image

image

image

image

image

image

so the issue lies in the way Obsidian+Turndown handles the HTML to Markdown conversion. Turndown does see some active development, though not a lot relatively speaking. on top of that, Obsidian isn't open source so there's no way to know what version of Turndown it uses or when the company might update it. it's also possible that they apply some of their own internal magic on top of Turndown which could complicate things further.

that said, i have spent some time thinking about workarounds as there have been a few issues like these i've noticed as well. i'll create a few new enhancement issues to track those separately. they'll show up here as references in case you'd like to subscribe to them and track my progress. it might be a while before they show up in a release though. they'll be pretty hacky and that means they're likely going to come with knock on effects that would also have to be worked around.

neoromantic commented 1 month ago

@inhumantsar thank you for taking such a deep dive! I hope it'll get better eventually.

Have you considered using LLMs for Slurp? I mean, they excel in converting semantically rich documents between formats.

inhumantsar commented 1 month ago

I have indeed! I'm planning to spend some time experimenting with getting Phi-3 to run on-device.

neoromantic commented 1 month ago

I have indeed! I'm planning to spend some time experimenting with getting Phi-3 to run on-device.

I guess go-to solution (as many do) would be to support some kind of standard, like ollama, lmstudio and remote APIs as well.

So you dont' have to think about the model itself, it's either ollama/lmstudio server running locally (which everyone should have now) or openai-compatible external API.

inhumantsar commented 1 month ago

i won't be supporting external LLM APIs here for a few reasons.

inconsistency and support are big ones. i have no desire to spend my limited available time dealing with GitHub issues like "slurp isn't talking to my local ollama server" or "i'm not getting good results in slurp from ". just because models might be served by an API which uses the same spec as OpenAI's, doesn't mean they're going to respond in the same way. the process of simplifying and markdownifying is inconsistent enough as it is, if i introduce LLM support, it will need to be consistent and testable.

similarly...

ollama/lmstudio server running locally (which everyone should have now)

unless they're using obsidian on their phones or they're not meganerds who self-host everything. hell, i am a meganerd who selfhosts everything and i still don't bother running a local LLM.

openai-compatible external API

these aren't free. considering the sheer volume of tokens that any given article contains (10,000+ words is not uncommon, particularly for academic papers), would be an easy way to rack up some pretty obscene costs pretty quickly. again, not the sort of thing i want to spend my limited time on.

knowing what model is running, what it's capable of, and what kinds of limitations it has are all absolutely the sort of things i want to be thinking about. it's the best way to ensure that the largest number of people are getting the most reliable and consistent user experience possible.

besides, it'd be more fun that way.