python / cpython

The Python programming language
https://www.python.org/
Other
61.14k stars 29.51k forks source link

Suffix in pathlib is not behaving like a file extension #121347

Open bbilly1 opened 1 week ago

bbilly1 commented 1 week ago

Bug report

Bug description:

According to the docs here a suffix is defined as:

The file extension of the final component, if any

But pathlib doesn't behave as expected, illustrating that on Python 3.12.4:

>>> Path("Mr. Smith resume for review").suffix
'. Smith resume for review'

This is particularly problematic for methods like with_suffix. According to the docs here that should:

Return a new path with the suffix changed. If the original path doesn’t have a suffix, the new suffix is appended instead. If the suffix is an empty string, the original suffix is removed

But as established above, that results in:

>>> Path("Mr. Smith resume for review").with_suffix(".pdf")
PosixPath('Mr.pdf')

I've characterized that as a bug as that is not working as described, but could also be a matter of documentation improvement.

I see a few ways:

  1. Implement add_suffix, or an argument to with_suffix, but I know that has been discussed before and was ultimately decided against. Also this is a matter of how suffix is defined, not how it's processed.
  2. Sanity check what a suffix can be, e.g. suffix can't have white space. But ultimately difficult without making assumption.
  3. Adjust the documentation: Clarify that suffix is not equal to file extension but something like last segment separated by a dot or similar. Maybe even with a warning that this can have unexpected behavior as showcased above.

CPython versions tested on:

3.12

Operating systems tested on:

Linux, Windows

barneygale commented 1 week ago

How are you defining "file extension"?

bbilly1 commented 1 week ago

Common denominator I see is: "The file extension defines what kind of file it is."

Some definitions I have found:

barneygale commented 1 week ago

Common denominator I see is: "The file extension defines what kind of file it is."

That's only true on Windows. On other operating systems, file extensions are an indicator and nothing more. I can rename an .mp3 file to .jpg on Linux and play it just fine.

Microsoft's definition ("three- or four-character extension") excludes some valid extensions like .a, .so and .patch, but fails to exclude file extensions with spaces such as . Smith resume for review

barneygale commented 1 week ago

The point I'm getting at is that there is no standard definition of a file extension!

Since #82805 was solved, pathlib's suffix splitting works exactly like os.path.splitext(). A non-empty suffix starts with a dot and contains at most one dot, and a non-empty suffix must be preceded by a stem that contains at least one non-dot character.

bbilly1 commented 1 week ago

I get what you are saying. But even on Linux, that will depend on the implementation. E.g. xdg-open will happily ignore the file extension. But common GUI file browsers will not (tested on Thunar).

As there is no standard definition, I'd suggest to either define it, or avoid using the term all together and use the implementation as definition, e.g. last segment separated by a dot.

Even though it's not authoritatively defined, I'd argue, the above with_suffix example is unexpected behavior.

eryksun commented 1 week ago

Microsoft's definition ("three- or four-character extension") excludes some valid extensions like .a, .so and .patch, but fails to exclude file extensions with spaces such as . Smith resume for review

The Windows shell API currently supports permanently associating a programmatic identifier (ProgID) with any file extension that does not include white space characters and that has a length from 1 to 198 characters (not including the dot). Thus the API supports ".a", ".so", and ".patch" as normal file extensions. If a file has no extension, or if the extension is longer than 198 characters or contains white space characters, then the API displays an open-with dialog that allows opening the file with an application just once instead of setting a permanent association.

barneygale commented 1 week ago

Hum, that does give some legitimacy to the idea of forbidding whitespace in file extensions in pathlib.

bbilly1 commented 1 week ago

I think ideal behavior would be, that if you are trying to create a suffix with whitespace, throw an error. If you are trying to access a suffix, if there is whitespace, it's not a suffix but just a regular part of the filename.