python / cpython

The Python programming language
https://www.python.org
Other
63.72k stars 30.53k forks source link

3.14.0a1 urljoin regression wrt // #125974

Open hroncok opened 1 month ago

hroncok commented 1 month ago

Bug report

Bug description:

The tests of lxml do (simplified):

from urllib.parse import urljoin
urljoin('file:', '/foo/bar')

This change of behavior was introduced in fc897fcc01964649f023e0baa4c95d142e4e8a10 https://github.com/python/cpython/pull/123273 cc @serhiy-storchaka

The NEWS entry in that change does not seem to indicate this change was intentional.

CPython versions tested on:

3.14, CPython main branch

Operating systems tested on:

Linux

ZeroIntensity commented 1 month ago

Duplicate of #125926

hroncok commented 1 month ago

That seemed like a different problem to me, but perhaps the cause is the same, yes.

serhiy-storchaka commented 1 month ago

This is not a duplicate of #125926.

But file:///foo/bar and file:/foo/bar are equivalent URIs. And we can argue that since file: and /foo/bar have undefined authority, the result should have undefined authority as well. This is a result of the algorithm in RFC 3986, Section 5.2.2, although it need an absolute base URI.

ZeroIntensity commented 1 month ago

I marked this as duplicate because it seemed like there was some underlying issue introduced into urljoin. I was assuming they will both be fixed by the same PR, but I guess not, sorry!

serhiy-storchaka commented 3 weeks ago

I suggest to close this issue as "not a bug".

hroncok commented 3 weeks ago

In that case, we should at least note the change in What's New in Python 3.14

serhiy-storchaka commented 2 weeks ago

Maybe. But what should we write? This is not the only change in urljoin(). I'm planning to do more changes (see #69589, #84774).

I hesitated between backporting these changes or not. This is a long standing behavior, so user code can have workarounds or even been depending on it. AFAIK the last large changes were in 3.5, and they were not backported.

If not backport these changes, they should be documented as a new feature. Note that urljoin() still is not completely RFC 3986 conforming -- some differences I left for backward compatibility. But we perhaps should change these parts too.

asvetlov commented 2 weeks ago

It's a hard choice, but I prefer the conformance to the RFC even if the change breaks backward compatibility.