hpc-social / blog

Community syndicated blog for hpc.social! 🗒️
https://hpc.social/blog
MIT License
5 stars 7 forks source link

<a name="more"> appears to stop markdownification when feed parsing #7

Closed glennklockwood closed 1 year ago

glennklockwood commented 1 year ago

Blogger uses <a name="more"></a> to separate the part of blog posts that should be shown "above the fold" of the landing page. When I try to run my blog's RSS through the blog syndicator though, everything past the <a name="more"> is no longer converted into markdown and is spit out as escaped HTML.

I tried to visually inspect the RSS feed coming out of blogger and its contents look the same above and below this <a name="more"> divider, so I think something is going wrong upstream of generate_posts.py (like feedparser?). Any ideas? I couldn't find any obvious causes.

vsoch commented 1 year ago

Let me do a quick manual debug to see what's going on - back in a bit!

vsoch commented 1 year ago

Looks like we will also want some cleanup of the markdown name - blogger produces an interesting path!

blog/_posts/glennklockwood/2022-11-24-tag:blogger.com,1999:blog-4307061427721284246.post-2068110509046297403.md
vsoch commented 1 year ago

okay does this reproduce? E.g., looks ok, but then a lot of html block?

image

vsoch commented 1 year ago

I'm going to also do a nice little refactor to quickly show the author tag image

vsoch commented 1 year ago

don't worry working on your bug now!

vsoch commented 1 year ago

okay got that fixed - new bug! Blogger (it looks like) prevents you from linking an external image url:

image

Need to think about a way around this.

vsoch commented 1 year ago

This looks like a known (intentional) issue that Google knows about https://support.google.com/blogger/thread/133238986/image-url-from-blogger-googleusercontent-com-is-not-accepted-by-other-websites-if-i-want-to-insert-m?hl=en. I think for now I'm going to try to filter out these images, and perhaps with a later update we can do an efficient way to get and store them. I don't want to start with that because it will take up space very quickly.

vsoch commented 1 year ago

okay I've pushed a fix to get you on the map! It includes (for the time being) removing these images that we aren't allowed to embed. If you want to discuss different parsing strategy please open an issue! The PR also added the nice tags, and better handled the markdown file name and subsequent permalink URL.

glennklockwood commented 1 year ago

Thanks for getting this fixed so quickly! Doesn't seem like many people use Blogger anymore so I appreciate you getting this to work.

It looks like some residual Python crept into the rendered output of https://hpc.social/blog/2022/sc-22-recap/:

Screenshot 2022-11-28 at 21 33 02

Not the end of the world; just fyi.

vsoch commented 1 year ago

haha no you are spot on, I caught that too (just pushed a fix!) I needed to stringify the soup instead of returning renderedContent.

vsoch commented 1 year ago

should be less terrible now :laughing: image