Add support for reading records with arbitrary separators to the standard IO stack

ncoghlan commented 19 years ago

BPO	1152248
Nosy	@birkenfeld, @rhettinger, @facundobatista, @amauryfa, @ncoghlan, @pitrou, @benjaminp, @merwok, @bitdancer, @4kir4, @vadmium, @phmc, @wm75, @maggyero
Files	pep-newline.txt: draft PEP for expanding the newline argument pep-peek.txt: draft PEP for adding IOBase.peek, making this easier for end users to solve io-newline-issue1152248.patch: Added support for alternative newlines in _pyio.TextIOWrapper. Updated documentation. Added more io tests. No C implementation. No implemention for binary files. io-newline-issue1152248-2.patch: Reuploaded the patch so that it applies cleanly on the current tip

BPO

Nosy

@birkenfeld, @rhettinger, @facundobatista, @amauryfa, @ncoghlan, @pitrou, @benjaminp, @merwok, @bitdancer, @4kir4, @vadmium, @phmc, @wm75, @maggyero

Files

pep-newline.txt: draft PEP for expanding the newline argument

pep-peek.txt: draft PEP for adding IOBase.peek, making this easier for end users to solve

io-newline-issue1152248.patch: Added support for alternative newlines in _pyio.TextIOWrapper. Updated documentation. Added more io tests. No C implementation. No implemention for binary files.

io-newline-issue1152248-2.patch: Reuploaded the patch so that it applies cleanly on the current tip

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-feature', 'library', 'expert-IO'] title = 'Add support for reading records with arbitrary separators to the standard IO stack' updated_at = user = 'https://github.com/ncoghlan' ``` bugs.python.org fields: ```python activity = actor = 'calestyo' assignee = 'none' closed = True closed_date = closer = 'pitrou' components = ['Library (Lib)', 'IO'] creation = creator = 'ncoghlan' dependencies = [] files = ['36008', '36009', '36098', '36114'] hgrepos = [] issue_num = 1152248 keywords = ['patch'] message_count = 43.0 messages = ['61179', '61180', '61181', '61182', '61183', '61184', '63060', '63067', '63068', '63134', '64084', '82188', '87801', '87802', '87803', '87805', '87806', '87807', '87808', '87817', '87823', '109038', '109098', '109117', '111152', '111168', '111177', '111189', '111202', '111220', '111453', '223490', '223491', '223492', '224016', '224077', '224149', '224155', '226397', '387491', '387512', '387515', '387552'] nosy_count = 21.0 nosy_names = ['georg.brandl', 'rhettinger', 'facundobatista', 'amaury.forgeotdarc', 'ncoghlan', 'pitrou', 'benjamin.peterson', 'nessus42', 'eric.araujo', 'ralph.corderoy', 'r.david.murray', 'ysj.ray', 'akira', 'Douglas.Alan', 'jcon', 'martin.panter', 'pconnell', 'wolma', 'abarnert', 'maggyero', 'calestyo'] pr_nums = [] priority = 'normal' resolution = 'later' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue1152248' versions = ['Python 3.4'] ```

ncoghlan commented 19 years ago

There is no canonical way to iterate through a file on chunks *other* than whole lines without reading the whole file into memory.

Allowing the separator to be specified as an argument to file.readlines and file.xreadlines would greatly simplify the task.

See here for an example interface of the useful options: http://mail.python.org/pipermail/python-list/2005-February/268482.html

birkenfeld commented 19 years ago

Logged In: YES user_id=1188172

I don't know whether (x)readlines is the right place, since you are _not_ operating on lines.

What about (x)readchunks?

38cb7cc8-ba19-40f3-bd8a-e99b83058e93 commented 19 years ago

Logged In: YES user_id=401880

In reply to birkenfeld, I'm not sure why you don't want to call lines separated with an alternate line-separation string "lines", but if you want to call them something else, I should think they should be called "records" rather than "chunks".

|>oug

rhettinger commented 19 years ago

Logged In: YES user_id=80475

The OPs request is not a non-starter. There is a proven precedent in AWK which allows programmer specifiable record separators.

ncoghlan commented 19 years ago

Logged In: YES user_id=1038590

As Douglas Alan's sample implementation (and his second attempt [1]) show, getting this right (and reasonably efficient) is actually a non-trivial exercise. Leveraging the existing readlines infrastructure is an idea worth considering.

[1] http://mail.python.org/pipermail/python-list/2005-February/268547.html

smontanaro commented 19 years ago

Logged In: YES user_id=44345

Seems the most likely place you'd want to use this is to select a non- native line ending in a situation where you didn't want to use universal newlines (select \r as a line ending on Unix, for example, and allow \n to just be another character). In that case they'd clearly still be lines, so embellishing the normal line reading machinery without adding a new method would be most appropriate.

facundobatista commented 16 years ago

Raymond disapproved it, Skip discouraged it, and Nick didn't push it any more, all more than two years ago.

Nick, please, if you feel this is worthwhile, raise the discussion in python-dev.

rhettinger commented 16 years ago

For the record, I thought it was a reasonable request.

AWK has a similar feature. The AWK book shows a number of example uses. Google's codesearch shows that the feature does get used in the field: http://www.google.com/codesearch?q=lang%3Aawk+RS&hl=en

I think this request should probably be kept open.

facundobatista commented 16 years ago

Sorry, I misunderstood you. I assign this to myself to give it a try.

ncoghlan commented 16 years ago

The mail.python.org link I posted previously is broken. Here's an updated link to the relevant c.l.p. thread: http://mail.python.org/pipermail/python-list/2005-February/310020.html

From my point of view, I still think it's an excellent idea and would be happy to review a patch, but I'm unlikely to get around to implementing it myself.

Also keep in mind that we now have the option of doing this only for the new io module in Python 3.0 - it may be easier to do that and implement something in pure Python rather than having to deal with the 2.x file implementation.

(P.S. I found the double negative in Raymond's original comment a little tricky to parse even as a native English speaker. I would also take Skip's comment as merely discouraging adding a completely new method rather than the original idea)

facundobatista commented 16 years ago

I took a look at it...

It's not as not-complicated as I original thought.

The way would be to adapt the Py_UniversalNewlineFread() function to *not* process the normal separators (like \n or \r), but the passed one.

A critical point would be to handle more-than-1-byte characters... I concur with Nick that this would better suited for Py3k.

So, I'm stepping down from this, and flagging it for that version.

ncoghlan commented 15 years ago

Any further work on this should wait until the io-in-c branch has landed (or at least be based on that branch).

bitdancer commented 15 years ago

> cat temp
this is$#a weird$#file$#
> ./python
Python 3.1b1+ (py3k:72632:72633M, May 15 2009, 05:11:27)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('temp', newline='$#')
[50354 refs]
>>> f.readlines()
['this is$#', 'a weird$#', 'file$#', '\n']

All I did was comment out the 'newline' argument validity check in textio.c.

ncoghlan commented 15 years ago

While RDM's quick test is encouraging, I think one of the key things is going to be developing tests for the various cases:

binary mode, single byte line ending
binary mode, multi-byte line ending
text mode, single byte single char line ending*
text mode, multi-byte single char line ending
text mode, multiple char line ending

The text mode tests would need to cover a variety of encodings (e.g. ASCII, latin-1, UTF-8, UTF-16, UTF-32 and maybe something like koi8-r and/or some of the CJK codecs).

*if applicable to codec under test

pitrou commented 15 years ago

-1 on this idea. readlines() exists precisely because line endings are special when it comes to text IO (because of the various platform differences).

If you want to split on any character, you can just use read() followed by split(). No need to graft additional complexity on the file IO classes.

pitrou commented 15 years ago

And it's certainly not easy to do correctly :)

pitrou commented 15 years ago

Uh, trying again to remove the keyword :-(

pitrou commented 15 years ago

Ok, let me qualify my position a bit:

-1 for abusing the newline parameter
-1 for abusing readlines()
+0 on an additional method ("readchunks" was suggested) which does the splitting, either on a single character or on a string

Please bear in mind the latter should involve, for each of the C and Python implementations:

a generic unoptimized version for BufferedIOBase
a generic unoptimized version for TextIOBase
an optimized version for BufferedReader/BufferedRandom
an optimized version for TextIOWrapper

However, it is certainly an interesting task for someone wanting to play with C code, optimizations, etc.

ncoghlan commented 15 years ago

I agree with Antoine - given that the newlines parameter now deals with Skip's alternate line separator use case, a new method "readrecords" that takes a mandatory record separator makes more sense than using readlines to read things that are not lines. (of course, taking the alternate line ending use case away also reduces the total number of use cases for the new method).

Note that the problem with the read()+split() approach is that you either have to read the whole file into memory (which this RFE is trying to avoid) or you have to do your own buffering and so forth to split records as you go. Since the latter is both difficult to get right and very similar to what the IO module already has to do for readlines(), it makes sense to include the extra complexity there.

pitrou commented 15 years ago

Note that the problem with the read()+split() approach is that you either have to read the whole file into memory (which this RFE is trying to avoid) or you have to do your own buffering and so forth to split records as you go. Since the latter is both difficult to get right and very similar to what the IO module already has to do for readlines(), it makes sense to include the extra complexity there.

I wonder how often this use case happens though. Usually you first split on lines, and only then you split on another character or string (think CSV files, HTTP headers, etc.).

When you don't split on lines, conversely, you probably have a binary format, and binary formats have more efficient ways of chunking (for example, a couple of bytes at the beginning indicating the length of the chunk).

38cb7cc8-ba19-40f3-bd8a-e99b83058e93 commented 15 years ago

Antoine Pitrou \report@bugs.python.org\ wrote:

Nick Coghlan \ncoghlan@gmail.com\ added the comment:

> Note that the problem with the read()+split() approach is that you > either have to read the whole file into memory (which this RFE is trying > to avoid) or you have to do your own buffering and so forth to split > records as you go. Since the latter is both difficult to get right and > very similar to what the IO module already has to do for readlines(), it > makes sense to include the extra complexity there.

I wonder how often this use case happens though.

Every day for me. The reason that I originally brought up this request some years back on comp.lang.python was that I wanted to be able to use Python easily like I use the xargs program.

E.g.,

find -type f -regex 'myFancyRegex' -print0 | stuff-to-do-on-each- file.py

With "-print0" the line separator is chaged to null, so that you can deal with filenames that have newlines in them.

("find" and "xargs" traditionally have used newline to separate files, but that fails in the face of filenames that have newlines in them, so the -print0 argument to find and the "-0" argument to xargs were thankfully eventually added as a fix for this issue. Nulls are not allowed in filenames. At least not on Unix.)

When you don't split on lines, conversely, you probably have a binary format,

That's not true for the daily use case I just mentioned.

|>ouglas

P.S. I wrote my own version of readlines, of course, as the archives of comp.lang.python will show. I just don't feel that everyone should be required to do the same, when this is the sort of thing that sysadmins and other Unix-savy folks are wont to do on a daily basis.

P.P.S. Another use case is that I often end up with files that have beeen transferred back and forth between Unix and Windows and god-knows-what-else, and the newlines end up being some weird mixture of carriage returns and line feeds (and sometimes some other stray characters such as "=20" or somesuch) that many programs seem to have a hard time recognizing as newlines.

54bf8cd3-fa5f-46be-85f2-1de8e23f9bb5 commented 14 years ago

Google has led me here because I'm trying to see how to process find(1)'s -print0 output with Python. Perl's -0 option and $/ variable makes this trivial.

find -name '*.orig' -print0 | perl -n0e unlink

awk(1) has its RS, record separator, variable too. There's a clear need, and it should also be possible to modify or re-open sys.stdin to change the existing separator.

merwok commented 14 years ago

Ralph, core developers have not rejected this idea. It needs a patch now (even rough) to get the discussion further.

6cd43691-7480-42fe-8afa-0105badefc53 commented 14 years ago

Until this feature gets built into Python, you can use a Python-coded generator such as this one to accomplish the same effect:

def fileLineIter(inputFile,
                 inputNewline="\n",
                 outputNewline=None,
                 readSize=8192):
   """Like the normal file iter but you can set what string indicates newline.

   The newline string can be arbitrarily long; it need not be restricted to a
   single character. You can also set the read size and control whether or not
   the newline string is left on the end of the iterated lines.  Setting
   newline to '\0' is particularly good for use with an input file created with
   something like "os.popen('find -print0')".
   """
   if outputNewline is None: outputNewline = inputNewline
   partialLine = ''
   while True:
       charsJustRead = inputFile.read(readSize)
       if not charsJustRead: break
       partialLine += charsJustRead
       lines = partialLine.split(inputNewline)
       partialLine = lines.pop()
       for line in lines: yield line + outputNewline
   if partialLine: yield partialLine

amauryfa commented 14 years ago

This fileLineIter function looks like a good recipe to me. Can we close the issue then?

ncoghlan commented 14 years ago

A recipe in the comments on a tracker item isn't enough reason to close the RFE, no.

An entry on the cookbook with a pointer from the docs might be sufficient, although I'm still not averse to the idea of an actual readrecords method (with appropriate tests).

09aa2b3e-3f5a-4c34-93c0-4a96363de691 commented 14 years ago

I think it's a good idea adding a keyword argument to specify the separator of readlines().

I believe most people can accept the universal meaning of "line", which has similar meaning of "record", that is a chunk data, maybe from using line separators other than '\n' in perl, or akw, or the find command. Maybe doing this doesn't pollute the meaning of "readlines". Splitting the file contents with s special character is really a common usage. Besides, I feel using a line separator other than '\n' doesn't mean we're dealing with binary format, in fact, I often deal with text format with the record separator '\t'.

6cd43691-7480-42fe-8afa-0105badefc53 commented 14 years ago

Personally, I think that this functionality should be built into Python's readlines. That's where a typical person would expect it to be, and this is something that is supported by most other scripting language I've used. E.g., awk has the RS variable which lets you set the "input record separator", which defaults to newline. And as I previously pointed out, xargs and find provide the option to use null as their line separator.

pitrou commented 14 years ago

Personally, I think that this functionality should be built into Python's readlines. That's where a typical person would expect it to be, and this is something that is supported by most other scripting language I've used.

Adding it to readline() and/or readlines() would modify the standard IO Abstract Base Classes, and would therefore probably need discussion on python-dev.

ncoghlan commented 14 years ago

On Fri, Jul 23, 2010 at 3:54 AM, Antoine Pitrou \report@bugs.python.org\ wrote:

Antoine Pitrou \pitrou@free.fr\ added the comment:

> Personally, I think that this functionality should be built into > Python's readlines. That's where a typical person would expect it to > be, and this is something that is supported by most other scripting > language I've used.

Adding it to readline() and/or readlines() would modify the standard IO Abstract Base Classes, and would therefore probably need discussion on python-dev.

That's also the reason why I'm suggesting a separate readrecords() method - the appropriate ABC should be able to implement it as a concrete method based on something like the recipe above.

54bf8cd3-fa5f-46be-85f2-1de8e23f9bb5 commented 14 years ago

fileLineIter() is not a solution that allows this bug to be closed, no. readline() needs modifying and if that means python-dev discussion then that's what it needs. Things to consider include changing the record separator as the file is read.

    $ printf 'a b c\nd e f ' |
    > awk '{print "<" $0 ">"} NR == 1 {RS = " "}'
    <a b c>
    <d>
    <e>
    <f>
    $

4037176b-12fa-49a8-88d4-2c9d85093e95 commented 10 years ago

http://thread.gmane.org/gmane.comp.python.ideas/28310 discusses the same idea.

Guido raised a serious problem with either adding an argument to readline and friends, or adding new methods readrecord and friends: It means the hundreds of existing file-like objects that exist today no longer meet the file API.

Putting the separator in the constructor call solves that problem. Construction is not part of the file API, and different file-like objects' constructors are already wildly different. It also seems to fit in better with what perl, awk, bash, etc. do (whether you either set something globally, or on the file, rather than on the line-reading mechanism). And it seems conceptually cleaner; a file shouldn't be changing line-endings in mid-stream—and if it does, that's similar to changing encodings.

Whether this should be done by reusing newline, or by adding another new parameter, I'm not sure. The biggest issue with reusing newline is that it has a meaning for write mode, not just for read mode (either terminal \n characters, or all \n characters, it's not entire clear which, are replaced with newline), and I'm not sure that's appropriate here. (Or, worse, maybe it's appropriate for text files but not binary files?)

R. David Murray's patch doesn't handle binary files, or _pyio, and IIRC from testing the same thing there was one more problem to fix for text files as well… but it's not hard to complete. If I have enough free time tomorrow, I'll clean up what I have and post it.

4037176b-12fa-49a8-88d4-2c9d85093e95 commented 10 years ago

While we're at it, Douglas Alan's solution wouldn't be an ideal solution even if it were a builtin. A fileLineIter obviously doesn't support the stream API. It means you end up with two objects that share the same file, but have separate buffers and out-of-sync file pointers. And it's a lot slower.

That being said, I think it may be useful enough to put in the stdlib—even more so if you pull the resplit-an-iterator-of-strings code out:

def resplit(strings, separator):
    partialLine = None
    for s in strings:
        if partialLine:
            partialLine += s
        else:
            partialLine = s
        if not s:
            break
        lines = partialLine.split(separator)
        partialLine = lines.pop()
        yield from lines
    if partialLine:
        yield partialLine

Now, you can do this:

with open('rdm-example') as f:
    chunks = iter(partial(f.read, 8192), '')
    lines = resplit(chunks, '\0')
    lines = (line + '\n' for line in lines)

# Or, if you're just going to strip off the newlines anyway:
with open('file-0-example') as f:
    chunks = iter(partial(f.read, 8192), '')
    lines = resplit(chunks, '\0')

# Or, if you have a binary file:
with open('binary-example, 'rb') as f:
    chunks = iter(partial(f.read, 8192), b'')
    lines = resplit(chunks, b'\0')

# Or, if I understand ysj.ray's example:
with open('ysj.ray-example') as f:
    chunks = iter(partial(f.read, 8192), '')
    lines = resplit(chunks, '\r\n')
    records = resplit(lines, '\t')

# Or, if you have something that isn't a file at all:
lines = resplit((packet.body for packet in packets), '\n')

4037176b-12fa-49a8-88d4-2c9d85093e95 commented 10 years ago

One last thing, a quick & dirty solution that works today, if you don't mind accessing private internals of stdlib classes, and don't mind giving up the performance of _io for _pyio, and don't need a solution for binary files:

class MyTextIOWrapper(_pyio.TextIOWrapper):
    def readrecord(self, sep):
        readnl, self._readnl = self._readnl, sep
        try:
            return self.readline()
        finally:
            self._readnl = readnl

Or, if you prefer:

class MyTextIOWrapper(_pyio.TextIOWrapper):
    def __init__(self, *args, separator, **kwargs):
        super().__init__(*args, **kwargs)
        self._readnl = separator

For binary files, there's no solution quite as simple; you need to write your own readline method by copying and pasting the one from _pyio.RawIOBase, and the modifications to use an arbitrary separator aren't quite as trivial as they look at first (at least if you want multi-byte separators).

7fe5d93b-2a2c-46a0-b5cd-5602c591856a commented 10 years ago

To make the discussion more specific, here's a patch that adds support for alternative newlines in _pyio.TextIOWrapper. It aslo updates the documentation and adds more io tests. It does not provide C implementation or the extended newline support for binary files.

As a side-effect it also fixes the bug in line_buffering=True behavior, see issue22069O.

Note: The implementation does no newline translations unless in legacy special cases i.e., newline='\0' behaves like newline='\n'. This is a key distinction from the behavior described in http://bugs.python.org/file36008/pep-newline.txt

The initial specification is from https://mail.python.org/pipermail/python-ideas/2014-July/028381.html

7fe5d93b-2a2c-46a0-b5cd-5602c591856a commented 10 years ago

As a side-effect it also fixes the bug in line_buffering=True behavior, see issue22069O.

It should be bpo-22069 "TextIOWrapper(newline="\n", line_buffering=True) mistakenly treat \r as a newline"

Reuploaded the patch so that it applies cleanly on the current tip.

4037176b-12fa-49a8-88d4-2c9d85093e95 commented 10 years ago

Akira, your patch does this:

-        self._writetranslate = newline != ''
-        self._writenl = newline or os.linesep
+        self._writetranslate = newline in (None, '\r', '\r\n')
+        self._writenl = newline if newline is not None else os.linesep

Any reason you made the second change? Why change the value assigned to _writenl for newline='\n' when you don't want to actually change the behavior for those cases? Just so you can double-check at write time that _writetranslate is never set unless _writenl is '\r', '\r\n', or os.linesep?

7fe5d93b-2a2c-46a0-b5cd-5602c591856a commented 10 years ago

Akira, your patch does this:

self._writetranslate = newline != ''

self._writenl = newline or os.linesep

self._writetranslate = newline in (None, '\r', '\r\n')

self._writenl = newline if newline is not None else os.linesep

Any reason you made the second change? Why change the value assigned to _writenl for newline='\n' when you don't want to actually change the behavior for those cases? Just so you can double-check at write time that _writetranslate is never set unless _writenl is '\r', \r\n', or os.linesep?

If newline='\n' then writenl is '\n' with and without the patch. If newline='\n' then write('\n\r') writes '\n\r' with and without the patch.

If newline='\n' then writetranslate=False (with the patch). It does not change the result for newline='\n' as it is documented now [1]:

[newline] can be None, '', '\n', '\r', and '\r\n'. ... If newline is any of the other legal values [namely '\r', '\n', '\r\n'], any '\n' characters written are translated to the given string.

[...] are added by me for clarity.

[1] https://docs.python.org/3.4/library/io.html#io.TextIOWrapper

writetranslate=False so that if newline='\0' then write('\0\n') would
write '\0\n' i.e., embed '\n' are not corrupted if newline='\0'. That is
why it is the "no translation" patch:

+ When writing output to the stream: + + - if newline is None, any '\n' characters written are translated to + the system default line separator, os.linesep + - if newline is '\r' or '\r\n', any '\n' characters written are + translated to the given string + - no translation takes place for any other newline value [any string].

vadmium commented 9 years ago

bpo-563491: 2002 proposal for parameter to readline, rejected at the time
bpo-17083: Newline is hard coded for binary file readline

Fixing this issue for binary files would probably also satisfy bpo-17083.

811ace97-a738-4117-bd90-4469a41b9e37 commented 3 years ago

Just wondered whether this is still being considered?

Cheers, Chris.

pitrou commented 3 years ago

I don't think so.

811ace97-a738-4117-bd90-4469a41b9e37 commented 3 years ago

Oh, what a pity,...

Seemed like a pretty common use case, which is unnecessarily prone to buggy or inefficient (user-)implementations.

811ace97-a738-4117-bd90-4469a41b9e37 commented 3 years ago

btw, just something for the record:

I think the example given in msg109117 above is wrong:

Depending on the read size it will produce different results, given how split() works:

Imagine a byte sequence:
>>> b"\0foo\0barbaz\0\0abcd".split(b"\0")
[b'', b'foo', b'barbaz', b'', b'abcd']

Now the same sequence, however with a different read size (here a shorter one):
>>> b"\0foo\0barbaz\0".split(b"\0")
[b'', b'foo', b'barbaz', b'']
>>> b"\0abcd".split(b"\0")
[b'', b'abcd']

=> it's the same bytes, but in the 2nd case one get's an extra b''.

python / cpython

Add support for reading records with arbitrary separators to the standard IO stack #41622