pfefferle / wordpress-webmention

A Webmention plugin for WordPress
https://wordpress.org/plugins/webmention/
MIT License
117 stars 31 forks source link

Are source URLs getting incorrectly URL-decoded? #359

Open snarfed opened 1 year ago

snarfed commented 1 year ago

Hi @dshanske @pfefferle! I'm seeing an odd issue with source URLs with URL-encoded # characters, eg https://fed.brid.gy/render?id=https%3A%2F%2Findieweb.social%2Fusers%2Fsnarfed%23likes%2F709275 . That page has a u-like-of with a full p-author h-card, with name and photo, but when WordPress receives it as a webmention source, Semantic-Linkbacks doesn't find that author at all.

However, if I double-URL-encode the # character, ie https://fed.brid.gy/render?id=https%3A%2F%2Findieweb.social%2Fusers%2Fsnarfed%2523likes%2F709275 , the webmention works fine and correctly shows the author name and image.

I know URLs with #s are awkward, even when URL-encoded, but the first source URL is working ok with other wm receivers, eg https://www.jvt.me/week-notes/2023/09/ (scroll down and expand Interactions with this post), so I suspect this is a bug in this plugin or Semantic-Linkbacks?

Thanks in advance!

pfefferle commented 1 year ago

@snarfed might be perhaps an issue with the Mf2 parser, because it supports fragment-parsing.

pfefferle commented 1 year ago

@snarfed is the author outside of the fragment?

snarfed commented 1 year ago

The source URL doesn't contain a fragment, it contains %23, which happens to be an encoded # character. I think the plugin(s) are decoding that part of the URL, but shouldn't be, since the form-encoded POST body shouldn't be URL-decoded. (I think?)

Ideally the plugins/parser would leave that %23 in the URL alone when fetching it and parsing mf2.

pfefferle commented 1 year ago

This is a really good question!

pfefferle commented 1 year ago

I would assume that they have to be URL encoded because otherwise an = might be misinterpreted as param of the form.

pfefferle commented 1 year ago

And the content type is: application/x-www-form-urlencoded so it literally mentions "urlencoded", but I will have a look at the spec.

snarfed commented 1 year ago

From @sknebel in chat:

for keys and vaues, percent-encode everything "except the ASCII alphanumeric, U+002A (*), U+002D (-), U+002E (.), and U+005F (_). " HTML spec: https://url.spec.whatwg.org/#concept-urlencoded-serializer (and the quote specifically from https://url.spec.whatwg.org/#application-x-www-form-urlencoded-percent-encode-set )

I've confirmed that browsers URL-encode, so a form-encoded POST with key url and value http://test/url%23fragment results in the raw request body url=http%3A%2F%2Ftest%2Furl%2523.

snarfed commented 1 year ago

I've also confirmed that my code is doing the same thing, ie the # is double-URL-encoded to %2523, so the raw webmention POST body looks like:

source=https%3A%2F%2Ffed.brid.gy%2Frender%3Fid%3Dhttps%253A%252F%252Findieweb.social%252Fusers%252Fsnarfed%2523likes%252F709275&target=https%3A%2F%2Fsnarfed.org%2F2023-03-28_49662

Note the %2523 in the source value. So @pfefferle you're absolutely right, the Webmention/Semantic Linkbacks plugins should URL-decode it once to get %23, but I think not twice, which they seem to be doing right now?

pfefferle commented 1 year ago

OK, that might be possible because of the interaction of both (Webmention & SL) plugins, I will re-check the latest version of the Webmention plugin.

snarfed commented 1 year ago

Looks like this isn't about the # character at all. I added custom encoding for #s, I'm now replacing them with ^^, and I'm still hitting this problem. Here's an example source URL:

https://fed.brid.gy/render?id=https%3A%2F%2Ftechhub.social%2Fusers%2Fdiazona^^likes%2F979471

If I send a webmention with this source, I get:

{"code":"resource_not_found","message":"Resource not found","data":{"status":400}}

Same if I %-encode the ^^, ie:

https://fed.brid.gy/render?id=https%3A%2F%2Ftechhub.social%2Fusers%2Fdiazona%5E%5Elikes%2F979471

However, if I double-encode those chars to %255E to the source URL below, it works.

https://fed.brid.gy/render?id=https%3A%2F%2Ftechhub.social%2Fusers%2Fdiazona%255E%255Elikes%2F979471

snarfed commented 1 year ago

Here are example WP debug logs I see for a failed webmention with a source URL with ^^ in it:

[25-May-2023 02:21:48 UTC] REST request: /webmention/1.0/endpoint: {"source":"https:\/\/fed.brid.gy\/convert\/activitypub\/webmention\/https:\/mastodon.social\/users\/notblanklikes\/88327162","target":"https:\/\/snarfed.org\/2023-05-24_50288"}(Header Present)
[25-May-2023 02:21:48 UTC] REST result: /webmention/1.0/endpoint: {"code":"source_error","message":"Bad Gateway","data":{"status":400}}(400) - [](User ID: 0)

The full source URL was https://fed.brid.gy/convert/activitypub/webmention/https:/mastodon.social/users/notblank^^likes/88327162. Note that the logged source URL is missing the ^^. I get the same logs if I URL-encode the ^^ to %5E%5E.

Btw this is on pre-merge plugins, ie Webmention 4.0.9 and Semantic-Linkbacks 3.12.0.

pfefferle commented 1 year ago

Why do people put everything in URLs...??? (and please do not answer with: because they can ☺️ )

snarfed commented 1 year ago

Hah, fair point, maybe I'm being a bit difficult here. Sorry! This bug does seem unrelated to any individual characters though, since it happens when they're URL-encoded too, eg the examples here with both %23 and %5E%5E still break the plugin.

I'm open to other ideas! I need to be able to include arbitrary URLs, including ones with # fragments, but I can encode them however works best for you all.

pfefferle commented 1 year ago

esc_url, esc_url_raw and sanitize_url seems to remove the ^^ special chars. That is not really good, because these are highly recommended when dealing with URLs.

pfefferle commented 1 year ago

It is at least no double encoding or something similar.

snarfed commented 1 year ago

Odd: I switched back from ^^ to %23 recently, and now I'm seeing some of these source URLs work after all. Example: https://ap.brid.gy/convert/web/https:/bayes.club/users/zerology%23likes/32983 on https://snarfed.org/2023-07-10_50589

pfefferle commented 1 year ago

@snarfed that make sense, because if you check the HTML of the fed.brid.gy links (vs the AP links), then you find only an h-card without any context... that's why the plugin ignores them, it does not know how to handle them...

snarfed commented 1 year ago

Hmm! You're right about the top source URL in the original description, https://fed.brid.gy/render?id=https%3A%2F%2Findieweb.social%2Fusers%2Fsnarfed%23likes%2F709275 . Not sure what's going on there.

The rest of the source URLs here are valid u-like-ofs though, including the second one in the description, https://fed.brid.gy/render?id=https%3A%2F%2Findieweb.social%2Fusers%2Fsnarfed%2523likes%2F709275 .