python / cpython

The Python programming language
https://www.python.org
Other
63.75k stars 30.54k forks source link

Docs and docstrings for io.*.seek are inconsistent #107801

Open MatthiasWiesmann opened 1 year ago

MatthiasWiesmann commented 1 year ago

Documentation

The documentation on text handles is misleading, in particular for the seek method.

Steps to reproduce

handle = open('/tmp/lines.txt')
help(handle.seek)

This produces the following documentation:

Help on built-in function seek:

seek(cookie, whence=0, /) method of _io.TextIOWrapper instance
    Change stream position.

    Change the stream position to the given byte offset. The offset is
    interpreted relative to the position indicated by whence.  Values
    for whence are:

    * 0 -- start of stream (the default); offset should be zero or positive
    * 1 -- current stream position; offset may be negative
    * 2 -- end of stream; offset is usually negative

    Return the new absolute position.

Issues

Generally, the behaviour is very inconsistent, seek relative to the end fail, seek relative to the start works, but might yield a situation when read fails.

Full investigation of the issue

Linked PRs

sunmy2019 commented 1 year ago

Note: If anything is changed here, the relevant doc should also be changed. https://docs.python.org/3/library/io.html#io.IOBase.seek

AA-Turner commented 1 year ago

@MatthiasWiesmann would you have time to write a PR with suggestied revisions of the help text?

To note, Python 3.12 will include the constants in os (see #104418), though there is quite a bit of inconsistency between the various seek() methods' docstrings:

cc @erlend-aasland as the author of #104418 if you have any thoughts.

A

erlend-aasland commented 1 year ago

I agree that it is confusing that the parameter names for the low-level API seek() aren't used consistently throughout the docs/docstrings. Luckily, these are all positional-only, so we can easily normalise them without backwards compatibility concerns.

Suggested improvements:

serhiy-storchaka commented 1 year ago
  1. The confusion is caused by the fact that TextIOWrapper.seek() has no docstring, so pydoc shows a docstring from a parent class, which is inadequate due to significant difference in semantic.
  2. The first argument of TextIOWrapper.seek() is not a byte offset, and not an offset at all. It is a "cookie", i.e. zero or an opaque number returned by tell(). Passing anything other than zero and the value returned by tell() when whence=SEEK_SET has undefined behavior.
  3. SEEK_CUR and SEEK_END only support zero as the first argument.

Neither forward seeking nor backward seeking are supported. The only supported are rewinding to the start and restoring the previous position saved by tell(). Confusion could be avoided if TextIOWrapper.tell() returned a special type instead of integer, but for now it would be a large breaking change (because the cookie can be saved in a file as an integer and used in other process).

erlend-aasland commented 1 year ago

As a data point: on macOS, Linux, and FreeBSD, the lseek and fseek manpages uses offset and whence as parameter names.

serhiy-storchaka commented 1 year ago

Because they only work with binary files. They do not work with TextIOWrapper which support multibyte stateful codecs.

erlend-aasland commented 1 year ago

There are a lot of tasks here. Suggesting to proceed in this order:

  1. Use whence instead of how as the name of the third position-only parameter of os.lseek. Rationale: the SEEK_* constants are also used for this function. If the SEEK_* constants are to talk about the whence parameter, we should use it for os.lseek as well.
  2. Improve the documentation of os.SEEK_SET, os.SEEK_CUR and os.SEEK_END; explain that they are used for the whence parameter of os.lseek and of the seek() method of file objects; add a very short description to each constant.
  3. Document os.SEEK_HOLE and os.SEEK_DATA (exposed in os though posix)
  4. Consistently name _io.*.seek() params where it makes sense.
  5. Improve the docs of the seek() methods (Doc/library/io.rst)
  6. Improve the docstrings of the seek() methods
erlend-aasland commented 1 year ago

Because they only work with binary files. They do not work with TextIOWrapper which support multibyte stateful codecs.

Yes, that is fine; TextIOWrapper is a special case. For the other cases, it may make sense to use a consistent naming.

erlend-aasland commented 1 year ago

__text_signature__ for _io types with seek:

3.11 ``` BufferedRandom.__text_signature__: ($self, target, whence=0, /) BufferedReader.__text_signature__: ($self, target, whence=0, /) BufferedWriter.__text_signature__: ($self, target, whence=0, /) BytesIO.__text_signature__: ($self, pos, whence=0, /) FileIO.__text_signature__: ($self, pos, whence=0, /) StringIO.__text_signature__: ($self, pos, whence=0, /) TextIOWrapper.__text_signature__: ($self, cookie, whence=0, /) ```
3.12 ``` BufferedRWPair.__text_signature__: ($self, offset, whence=os.SEEK_SET, /) BufferedRandom.__text_signature__: ($self, target, whence=0, /) BufferedReader.__text_signature__: ($self, target, whence=0, /) BufferedWriter.__text_signature__: ($self, target, whence=0, /) BytesIO.__text_signature__: ($self, pos, whence=0, /) FileIO.__text_signature__: ($self, pos, whence=0, /) StringIO.__text_signature__: ($self, pos, whence=0, /) TextIOWrapper.__text_signature__: ($self, cookie, whence=0, /) _BufferedIOBase.__text_signature__: ($self, offset, whence=os.SEEK_SET, /) _IOBase.__text_signature__: ($self, offset, whence=os.SEEK_SET, /) _RawIOBase.__text_signature__: ($self, offset, whence=os.SEEK_SET, /) _TextIOBase.__text_signature__: ($self, offset, whence=os.SEEK_SET, /) ```
3.13 ``` BufferedRWPair.__text_signature__: ($self, offset, whence=os.SEEK_SET, /) BufferedRandom.__text_signature__: ($self, target, whence=0, /) BufferedReader.__text_signature__: ($self, target, whence=0, /) BufferedWriter.__text_signature__: ($self, target, whence=0, /) BytesIO.__text_signature__: ($self, pos, whence=0, /) FileIO.__text_signature__: ($self, pos, whence=0, /) StringIO.__text_signature__: ($self, pos, whence=0, /) TextIOWrapper.__text_signature__: ($self, cookie, whence=0, /) _BufferedIOBase.__text_signature__: ($self, offset, whence=os.SEEK_SET, /) _IOBase.__text_signature__: ($self, offset, whence=os.SEEK_SET, /) _RawIOBase.__text_signature__: ($self, offset, whence=os.SEEK_SET, /) _TextIOBase.__text_signature__: ($self, offset, whence=os.SEEK_SET, /) ```
serhiy-storchaka commented 1 year ago

Even in C, fseek/ftell and lseek and not interoperable. You will have issues when pass the result of lseek to fseek on Windows (because newline translation and buffering). There were such issues in the CPython interpreter. So "offset" in fseek/ftell is not just an offset.

fgetpos and fsetpos are more modern analogs of fseek/ftell, and they work with position which is not a number.

erlend-aasland commented 1 year ago

Even in C, fseek/ftell and lseek and not interoperable. You will have issues when pass the result of lseek to fseek on Windows (because newline translation and buffering). There were such issues in the CPython interpreter. So "offset" in fseek/ftell is not just an offset.

fgetpos and fsetpos are more modern analogs of fseek/ftell, and they work with position which is not a number.

Are you responding to a particular suggested change of https://github.com/python/cpython/issues/107801#issuecomment-1676940817? Are you suggesting another change? I'm not sure how to interpret your post.

serhiy-storchaka commented 1 year ago

No, it was response to https://github.com/python/cpython/issues/107801#issuecomment-1676868944.

As for your proposition, it all looks reasonable. The main entries are 6 and 5. Perhaps there should be only two separate docstrings: for binary and text streams, all other classes perhaps can inherit them from parent. The module documentation can be more verbose and provide more specific details. Other entries can go to other issues.

erlend-aasland commented 1 year ago

No, it was response to #107801 (comment).

Thanks for clarifying.

As for your proposition, it all looks reasonable. The main entries are 6 and 5. Perhaps there should be only two separate docstrings: for binary and text streams, all other classes perhaps can inherit them from parent. The module documentation can be more verbose and provide more specific details. Other entries can go to other issues.

That sounds reasonable. And of course, the TextIOWrapper docstring is a special case.

hugovk commented 1 year ago

Lots of merged PRs here. 🚀

Anything still to do, or can the issue be closed?

vadmium commented 4 months ago

Issue #82969 is open about IOBase.seek/tell vs TextIOWrapper and the offset/cookie parameter Issue #122299 is about the byte vs character seek offsets