python / cpython

The Python programming language
https://www.python.org
Other
63.49k stars 30.41k forks source link

can't specify newline string for readline for binary files #61285

Open 19793928-8141-4323-94e9-980252738dc0 opened 11 years ago

19793928-8141-4323-94e9-980252738dc0 commented 11 years ago
BPO 17083
Nosy @pitrou, @bitdancer, @4kir4, @vadmium, @maggyero

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'library'] title = "can't specify newline string for readline for binary files" updated_at = user = 'https://bugs.python.org/susurrus' ``` bugs.python.org fields: ```python activity = actor = 'Alex Shpilkin' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'susurrus' dependencies = [] files = [] hgrepos = [] issue_num = 17083 keywords = [] message_count = 9.0 messages = ['180984', '180988', '180994', '181099', '181100', '181101', '181104', '181131', '223996'] nosy_count = 8.0 nosy_names = ['pitrou', 'v+python', 'r.david.murray', 'akira', 'martin.panter', 'susurrus', 'maggyero', 'Alex Shpilkin'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue17083' versions = ['Python 3.4'] ```

19793928-8141-4323-94e9-980252738dc0 commented 11 years ago

When opening binary files in Python 3, the newline parameter cannot be set. While this kind of makes sense, readline() can still be used on binary files. This is great for my usage, but it is doing universal newline mode, I believe, so that any \r, \n, or \r\n triggers an EOL.

The data I'm working with is mixed ASCII/binary, with line termination specified by \r\n. I can't read a line (even though that concept occurs in my file) because some of the binary data includes \r or \n even though they aren't newlines in this context.

The issue here is that if the newline string can't be specified, readline() is useless on binary data, which often uses custom EOL strings. So would it be reasonable to add the newline parameter support to binary files? If not, then shouldn't readline() throw an exception when used on binary files?

I don't know if it's helpful here, but I've written a binary_readline() function supporting arbitrary EOL strings:

def binary_readline(file, newline=b'\r\n'):
    line = bytearray()
    newlineIndex = 0
    while True:
        x = file.read(1)
        if x:
            line += x
        else:
            if len(line) == 0:
                return None
            else:
                return line
        # If this character starts to match the newline string, start that comparison til it matches or doesn't.
        while line[-1] == newline[newlineIndex]:
            x = file.read(1)
            if x:
                line += x
            else:
                return line
            newlineIndex += 1
            if newlineIndex == len(newline):
                return line

        # We failed checking for the newline string, so reset the checking index
        newlineIndex = 0
bitdancer commented 11 years ago

If you are reading in binary mode, then all readline does is get you the next \n terminated chunk of data, which is a convenience in some circumstances. You have to do all the newline handling yourself. Otherwise it isn't a binary read.

I think the "right way" to do what you want is to write a custom subclass of one of the IO classes. Should such a subclass be added to Python? That's an interesting question. A first step toward an answer might be to post it as a recipe on the ActiveState site and see how many people find it useful.

19793928-8141-4323-94e9-980252738dc0 commented 11 years ago

I'm not terribly worried about the "right" way for me to deal with my code, but that Python, in this instance, is inconsistent. While it doesn't want you to apply the concept of a "line" to a binary file in that it prevents you from specifying an EOL string, it does allow you to read that file as lines.

So my question is why shouldn't I be able to specify a newline of b"\r\n" and then use readline() on my binary file? I don't see why that concept shouldn't be applied here when it's definitely applicable in a lot of cases (any binary log format).

To resolve this I really think there're two options to maintain the consistency of Python's approach: 1) Have readline() error out for binary-mode files. 2) Allow specifying a byte-string as the newline string for binary files that readline() would then use.

b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 11 years ago

I think Bryant's request is reasonable, for consistency in functionality. If line oriented operations are allowed on binary files, then a binary newline value should be permitted at the time of open.

I think, for handling binary files, that it would also be interesting to have a version of readline that takes a binary newline file as a parameter to each readline call, because in binary files, the concept of newline can vary from section to section of the file... here, null-terminated records, there CR LF terminated encoded text records, elsewhere fixed-length records, and another place might have records delimited by some binary token of one or more bytes. Readline with a newline parameter could be useful in three of those cases, read in the fixed-length case. But this paragraph would be a new feature.

However, simpler binary files, which may have only one type of "terminated" records, could effectively use the operations Bryant is suggesting, which seems quite reasonable to me, along with a mix of read calls for non-delimited data, fixed-length data, or data requiring complex logic to decode.

bitdancer commented 11 years ago

Anything we do here is a new feature.

I have no objection to adding features in this area myself, but I will note that I was shot down for proposing (in another issue) that the newline attribute for text files be allowed to be an arbitrary string.

19793928-8141-4323-94e9-980252738dc0 commented 11 years ago

While my original description of this issue discussed arbitrary strings, I'd like to limit the scope of this issue down to just supporting the newline parameter to builtin.open() for binary files, just as it's supported for regular files. This adds no real new functionality, just makes the treatment of the concept of a "line" consistent between binary files and regular text files, since that concept *does* exist in many cases in binary files.

Given that everyone here seems to think this is at least reasonable at this point, what would the next step be given this is somewhere between a library fix and a feature addition? I haven't contributed to Python before, but the developer FAQ mentions either python-ideas or a PEP, or should this move right towards a patch?

bitdancer commented 11 years ago

Well, it's a feature by our policy, since it currently works as documented.

Probably the first thing would be to get the opinion of someone who works on the IO module, so I've nosied Antoine. Note that this was obviously a conscious design decision, as it is documented clearly.

Another possible first step is the one I suggested: posting a recipe on the recipe site and seeing if there is any uptake. Python-ideas would be another option. For this level of change, I don't believe a PEP is required :)

pitrou commented 11 years ago

I think readline() is quite clear in its current intent (it reads a line, which conventionally is delimited by '\n'). We mau want to add another method, say read_until_delimiter(), but it's not the same as readline() ;-) I think it would be best for it to be discussed, say on python-ideas (*), before any decision is made.

(*) http://mail.python.org/mailman/listinfo/python-ideas

7fe5d93b-2a2c-46a0-b5cd-5602c591856a commented 10 years ago

Related issue bpo-1152248: Add support for reading records with arbitrary separators to the standard IO stack

It suggests to extend the newline support for both text and binary files.