inhumantsar / slurp

Slurps webpages and saves them as clean, uncluttered Markdown. Think Pocket, but better.
https://inhumantsar.github.io/slurp/
MIT License
181 stars 6 forks source link

Unsupported language/site (vk.com) #56

Open Dzhuks opened 3 months ago

Dzhuks commented 3 months ago

I encountered an issue while trying to extract text from articles on a Russian social media site, VK. The articles on VK were not processed correctly—the Russian text appeared garbled and unrecognizable. You can see an example of this issue in the article from this URL: Escape from Google Translate.

Slurp Original

Initially, I suspected that the problem was due to the Russian language itself. However, I tested the extraction process on an article from a Russian news site, and it worked perfectly. Here's an example article that was processed correctly: How to Transfer Money to Kazakhstan from Russia in 2023-2024.

Slurp Original

This indicates that the issue is specific to the VK platform rather than the Russian language as a whole.

inhumantsar commented 3 months ago

interesting! thanks for digging into whether it was a language or site issue. it is strange that VK produces that kind of garbled text, since that would usually indicate an unsupported encoding. it's unlikely they would be using an older standard, like ISO or Windows-1251.

I've got a few bugs like this queued up and will hopefully be writing fixes for them in the next week or two. I'll look for a cause this morning tho and will comment if I find it.

thanks for the report!

inhumantsar commented 3 months ago

no obvious cause but Firefox's reader view displays the page correctly, so it is likely something to do with Slurp or Obsidian

inhumantsar commented 3 months ago

interestingly, it slurped fine on my android device.

Screenshot_20240830-093457.png

can you provide some detail on your setup? OS version and Obsidian version especially

Dzhuks commented 2 months ago

OS version: Windows 11 Obsidian: 1.67 Slurp: 0.1.12