snarfed / bridgy

📣 Connects your web site to social media. Likes, retweets, mentions, cross-posting, and more...
https://brid.gy
Creative Commons Zero v1.0 Universal
715 stars 52 forks source link

extracting non-content links from wordpress.com blog posts #207

Closed snarfed closed 10 years ago

snarfed commented 10 years ago

looks like either we're overly aggressive or superfeedr is including non-content (e.g. links to tags, prev/next, feeds, etc) in their content and summary fields, which we extract all links from.

example for http://likeiwassayingblog.com/2014/06/27/im-making-a-list-of-things-i-need-to-do-before-julie-gets-home-its-extraordinary-how-many-items-contain-the-words-clean-and-cat/ (from https://www.brid.gy/wordpress/likeiwassayingblog.wordpress.com#blogposts ):

"content" : "<br />Filed under: <a href='http://likeiwassayingblog.com/category/blurts/'>Blurts</a> <a rel=\"nofollow\" href=\"http://feeds.wordpress.com/1.0/gocomments/likeiwassayingblog.wordpress.com/1495/\"><img alt=\"\" border=\"0\" src=\"http://feeds.wordpress.com/1.0/comments/likeiwassayingblog.wordpress.com/1495/\" /></a> <a rel=\"nofollow\" href=\"http://feeds.wordpress.com/1.0/godelicious/likeiwassayingblog.wordpress.com/1495/\"><img alt=\"\" border=\"0\" src=\"http://feeds.wordpress.com/1.0/delicious/likeiwassayingblog.wordpress.com/1495/\" /></a> <a rel=\"nofollow\" href=\"http://feeds.wordpress.com/1.0/gofacebook/likeiwassayingblog.wordpress.com/1495/\"><img alt=\"\" border=\"0\" src=\"http://feeds.wordpress.com/1.0/facebook/likeiwassayingblog.wordpress.com/1495/\" /></a> <a rel=\"nofollow\" href=\"http://feeds.wordpress.com/1.0/gotwitter/likeiwassayingblog.wordpress.com/1495/\"><img alt=\"\" border=\"0\" src=\"http://feeds.wordpress.com/1.0/twitter/likeiwassayingblog.wordpress.com/1495/\" /></a> <a rel=\"nofollow\" href=\"http://feeds.wordpress.com/1.0/gostumble/likeiwassayingblog.wordpress.com/1495/\"><img alt=\"\" border=\"0\" src=\"http://feeds.wordpress.com/1.0/stumble/likeiwassayingblog.wordpress.com/1495/\" /></a> <a rel=\"nofollow\" href=\"http://feeds.wordpress.com/1.0/godigg/likeiwassayingblog.wordpress.com/1495/\"><img alt=\"\" border=\"0\" src=\"http://feeds.wordpress.com/1.0/digg/likeiwassayingblog.wordpress.com/1495/\" /></a> <a rel=\"nofollow\" href=\"http://feeds.wordpress.com/1.0/goreddit/likeiwassayingblog.wordpress.com/1495/\"><img alt=\"\" border=\"0\" src=\"http://feeds.wordpress.com/1.0/reddit/likeiwassayingblog.wordpress.com/1495/\" /></a> <img alt=\"\" border=\"0\" src=\"http://stats.wordpress.com/b.gif?host=likeiwassayingblog.com&#038;blog=67615833&#038;post=1495&#038;subd=likeiwassayingblog&#038;ref=&#038;feed=1\" width=\"1\" height=\"1\" />",
"summary" : "Filed under: Blurts<img alt=\"\" border=\"0\" src=\"http://stats.wordpress.com/b.gif?host=likeiwassayingblog.com&#038;blog=67615833&#038;post=1495&#038;subd=likeiwassayingblog&#038;ref=&#038;feed=1\" width=\"1\" height=\"1\" />",
kylewm commented 10 years ago

This is "known issue" with mf2py backcompat and Wordpress.com blogs. They mark up the whole center part with entry-content... When that gets mapped to e-content, it includes way more than e-content is supposed to include.

Another example of this is when I tried to reply to my test wordpress.com blog https://kylewm.com/reply/2014/05/05/1

snarfed commented 10 years ago

interesting parallel! the content and summary fields are coming from superfeedr here, but it sounds like wordpress.com's overly broad entry-content may be causing the same problem for both superfeeder and mf2py.

snarfed commented 10 years ago

reopening due to #213. the current fix assumes categories and tags are at the bottom, but they can be at the top too. e.g. http://eflnotes.wordpress.com/2014/01/17/corpus-linguistics-community-news-2/ for user https://www.brid.gy/wordpress/eflnotes.wordpress.com

instead of filtering the content HTML, i'm going to just discard all webmentions for "self links" to the user's domain.

snarfed commented 10 years ago

seeing this again for https://www.brid.gy/wordpress/peterccook.com . reopening.

snarfed commented 10 years ago

false alarm. his posts just tend to have a lot of links.