python / cpython

The Python programming language
https://www.python.org
Other
61.88k stars 29.76k forks source link

bz2.BZ2File doesn't support multiple streams #45966

Closed 55c68216-1b09-421f-b5a4-c375afb3e360 closed 13 years ago

55c68216-1b09-421f-b5a4-c375afb3e360 commented 16 years ago
BPO 1625
Nosy @loewis, @akuchling, @jcea, @pitrou, @vstinner, @djc, @merwok, @bitdancer
Files
  • bz2_patch.tar.bz2: issue1625 - bz2 multiple stream patch v1
  • py3k_bz2.patch: issue1625 - bz2 multiple stream patch v2
  • py3k_bz2.patch: issue1625 - bz2 multiple stream patch v3
  • py3k_bz2.patch: issue1625 - bz2 multiple stream patch v4
  • bz2ms.patch
  • cpython-bz2-streams.patch: Python 3.3 patch to Lib/bz2.py
  • bz2-multiple-stream-support-issue1625.patch: bz2 multiple stream patch for Python 2.7.2
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-feature', 'library'] title = "bz2.BZ2File doesn't support multiple streams" updated_at = user = 'https://bugs.python.org/therve' ``` bugs.python.org fields: ```python activity = actor = 'python-dev' assignee = 'nirai' closed = True closed_date = closer = 'nadeem.vawda' components = ['Library (Lib)'] creation = creator = 'therve' dependencies = [] files = ['15001', '15014', '15075', '15177', '18939', '22114', '23231'] hgrepos = [] issue_num = 1625 keywords = ['patch', 'needs review'] message_count = 56.0 messages = ['58619', '59897', '60236', '60268', '93323', '93326', '93405', '93407', '93408', '93721', '93841', '94316', '94318', '94321', '94434', '94446', '94499', '101521', '115198', '116964', '118946', '132842', '136689', '136690', '136723', '136753', '136790', '136799', '136889', '137018', '137019', '137074', '137098', '137197', '137228', '137229', '137230', '144404', '144406', '144443', '144445', '144479', '144551', '144552', '144553', '144782', '152144', '152222', '152257', '152327', '152393', '152513', '152551', '152645', '152674', '152681'] nosy_count = 19.0 nosy_names = ['loewis', 'akuchling', 'niemeyer', 'jcea', 'pitrou', 'therve', 'vstinner', 'thomaslee', 'nadeem.vawda', 'djc', 'eric.araujo', 'nirai', 'r.david.murray', 'dbonner', 'ozan.caglayan', 'Kontr-Olli', 'python-dev', 'gkcn', 'Crispin.Wellington'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue1625' versions = ['Python 3.3'] ```

    55c68216-1b09-421f-b5a4-c375afb3e360 commented 16 years ago

    The BZ2File class only supports one stream per file. It possible to have multiple streams concatenated in one file, it the resulting data should be the concatenation of all the streams. It's what the bunzip2 program produces, for example. It's also supported by the gzip module.

    Once this done, this would add the ability to open a file for appending, by adding another stream to the file.

    I'll probably try to do this, but the fact it's done in C (unlike gzip) makes it harder, so if someone beats me to it, etc.

    2ba56af2-62ea-4996-b0d1-725349dcf07d commented 16 years ago

    If you're referring to an 'append' mode for bz2file objects, it may be a limitation of the underlying library: my version of bzlib.h only provides BZ2_bzWriteOpen and BZ2_bzReadOpen - it's not immediately clear how you would open a BZ2File in append mode looking at this API.

    It may be possible to implement r/w/a using the lower-level bzCompress/bzDecompress functions, but I doubt that's going to happen unless somebody (such as yourself? :)) cares deeply about this.

    akuchling commented 16 years ago

    Like gzip, you can concatenate two bzip2 files:

    bzip2 -c /etc/passwd >/tmp/pass.bz2

    bzip2 -c /etc/passwd >>/tmp/pass.bz2

    bunzip2 will output both parts, generating two copies of the file.

    So nothing needs to be done on compression, but uncompression needs to look for another chunk of compressed data after finishing one chunk.

    55c68216-1b09-421f-b5a4-c375afb3e360 commented 16 years ago

    The gzip module supports reopening an existing file to add another stream. I think the bz2 module should not the same.

    53e4c739-6ca7-415d-8358-bcd9aada3608 commented 14 years ago

    I've got a patch that fixes this. It allows BZ2File to read multi-stream files as generated by pbzip2, allows BZ2File to open files in append mode, and also updates bz2.decompress to allow it to handle multi-stream chunks of data.

    We originally wrote it against 2.5, but I've updated the patch to py3k trunk, and attached it here. If there's interest in a patch against 2.7 trunk, please let me know.

    53e4c739-6ca7-415d-8358-bcd9aada3608 commented 14 years ago

    sorry, the previous patch was from an old version. attaching the correct version now. apologies for the noise.

    pitrou commented 14 years ago

    Some notes about posting patches:

    I'll look at the patch itself another day, I don't have the time right now. But thanks for posting it!

    53e4c739-6ca7-415d-8358-bcd9aada3608 commented 14 years ago

    Thanks for the reply.

    My company's legal dept. told me that we needed to put the boilerplate into the files as part of releasing it under the apache license. I used a tarball because they also recommended including a full copy of the license with the patch.

    I'm reattaching just the patch to the bug now. I'll check with legal and see if they'd have a problem with removing the boilerplate.

    bitdancer commented 14 years ago

    If the patch is substantial enough that legal boilerplate is even an issue, then I'm pretty sure a contributor agreement will be required for patch acceptance, at which point I think the boilerplate won't be needed. The Apache license is certainly acceptable. I'm obviously not the authority on this, though. That would be van Lindburg.

    53e4c739-6ca7-415d-8358-bcd9aada3608 commented 14 years ago

    I can remove the boilerplate from the code as long as I add the following to the submittal:

    VMware, Inc. is providing this bz2 module patch to you under the terms of the Apache License 2.0 with the understanding that you plan to re-license this under the terms and conditions of the Python License. This patch is provided as is, with no warranties or support. VMware disclaims all liability in connection with the use/inability to use this patch. Any use of the attached is considered acceptance of the above.

    pitrou commented 14 years ago

    As far as I can tell, the patch looks mostly good. I just wonder, in Util_HandleBZStreamEnd(), why you don't set self->mode to MODE_CLOSED if BZ2_bzReadOpen() fails.

    As a sidenote, the bz2 module implementation seems to have changed quite a bit between trunk and py3k, so if you want it to be backported to trunk (2.7), you'll have to provide a separate patch.

    53e4c739-6ca7-415d-8358-bcd9aada3608 commented 14 years ago

    Hrm...yeah, I should probably be setting it to closed as soon as BZ2_bzReadClose() returns, and then back to open once BZ2_bzReadOpen succeeds. Wasn't intentional...thanks for the catch. You guys need a new patch with that change in it?

    I'll try and get a 2.7 patch done and uploaded in a day or two.

    bitdancer commented 14 years ago

    A new patch will make it more likely that it will actually get applied :)

    Thanks for your work on this.

    53e4c739-6ca7-415d-8358-bcd9aada3608 commented 14 years ago

    Understandable. New patch attached.

    pitrou commented 14 years ago

    I'm not comfortable with the following change (which appears twice in the patch):

    -           BZ2_bzReadClose(&bzerror, self->fp);
    +           if (self->fp)
    +               BZ2_bzReadClose(&bzerror, self->fp);
                break;
            case MODE_WRITE:
    -           BZ2_bzWriteClose(&bzerror, self->fp,
    -                    0, NULL, NULL);
    +           if (self->fp)
    +               BZ2_bzWriteClose(&bzerror, self->fp,
    +                        0, NULL, NULL);

    If you need to test for the file pointer, perhaps there's a logic flaw in your patch. Also, it might be dangerous in write mode: could it occur that the file isn't closed but the problem isn't reported?

    53e4c739-6ca7-415d-8358-bcd9aada3608 commented 14 years ago

    That was mostly just out of paranoia, since the comments mentioned multiple calls to close being legal. Looking at it again, that particular case isn't an issue, since we don't hit that call when the mode is MODE_CLOSED. The testsuite runs happily with those changes reverted.
    Should I upload a new patch?

    pitrou commented 14 years ago

    That was mostly just out of paranoia, since the comments mentioned multiple calls to close being legal. Looking at it again, that particular case isn't an issue, since we don't hit that call when the mode is MODE_CLOSED. The testsuite runs happily with those changes reverted.
    Should I upload a new patch?

    You don't need to, but on the other hand I forgot to ask you to update the documentation :-) (see Doc/library/bz2.rst)

    53e4c739-6ca7-415d-8358-bcd9aada3608 commented 14 years ago

    Picking this back up again. There's actually no docs changes necessary...the docs never mentioned that the module didn't support multiple logical streams, and I didn't see any other mentions in the docs that seemed to need updating. I supposed I could add something along the lines of "BZ2File supports multiple logical streams in a single compressed file", if that's what you/re looking for.

    Working on a patch for trunk as well.

    a2d4b850-15a1-4ee9-8841-c561519deb94 commented 13 years ago

    Dear all,

    first of all, thank you for the patch making multiple file-streams in bz2 available in python. Yesterday, I've tried to adapt the recent patch for python 3k to the actual python 2.7. Most of the hunks could be easy adapted by editing just one ore two things concerning the different file-layout between p3k and python 2.7.

    Unfortunatelly it wasn't possible for me to completly downgrade this patch to python 2.7. Especially the last hunk in the patch and also the hunks which are related to self->rawfp couldn't be adapted succesfully by me.

    Could anybody assist me to make this patch available for python 2.7 or does a patch for this python version already exist?

    If you like, I can upload my recent changes in this patch for further investigation.

    Thank you!

    best regards, Oliver Deppert

    pitrou commented 13 years ago

    Here is an update of the patch against current py3k. I've added a copyright mention at the top of Modules/bz2module.c which I hope manages to capture the essence of msg93721. Martin, what do you think?

    a2d4b850-15a1-4ee9-8841-c561519deb94 commented 13 years ago

    Thanks for the update....

    Like I mentioned before in my previous comment, I'm still searching for a solution/patch for python 2.x able to handle multiple streams of bz2.

    Does anybody know a work-around or have a solution porting the p3k-patch to the good old python 2.x?!

    At the moment I try to use this patch with py3k and it seems to work. But I need this multistream option for pythone 2.x because the most of the time I deal with matplotlib....and matplotlib at the moment isn't able to deal with py3k....so, I would be very happy for any suggestion running multiple-streams with python 2.x ! !

    Thank you very much, and best regards Kontr-Olli

    pitrou commented 13 years ago

    The patch here is totally out of date, following bpo-5863.

    7b124e7b-b61d-49f6-9762-fd36b627e237 commented 13 years ago

    Hi, I attach a patch to Python 3.3 Lib/bz2.py with updated tests: cpython-bz2-streams.patch

    252699e1-f617-4a8b-9fb0-ae90945f6292 commented 13 years ago

    Thanks for the patch. I'll review it tomorrow.

    7b124e7b-b61d-49f6-9762-fd36b627e237 commented 13 years ago

    Wait, the tests seem wrong. I'll post an update later today.

    252699e1-f617-4a8b-9fb0-ae90945f6292 commented 13 years ago

    OK, I'll hold off on doing a detailed review until then.

    7b124e7b-b61d-49f6-9762-fd36b627e237 commented 13 years ago

    False alarm; go ahead with the review. I took a look too early in the morning before caffeine kicked in.

    Note Lib/test/test_bz2.py was directly upgraded from bz2ms.patch.

    A note on bz2 behavior: A BZ2Decompressor object is only good for one stream; after that eof is set and it will refuse to continue to the next stream; this seems in line with bzip2 manual: http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#bzDecompress

    252699e1-f617-4a8b-9fb0-ae90945f6292 commented 13 years ago

    False alarm; go ahead with the review. I took a look too early in the morning before caffeine kicked in.

    No worries. I know the feeling.

    The tests look fine. The bodies of testRead() and testReadMultiStream() appear to have been swapped, though. I'm guessing testRead() was supposed to remain unmodified, with testReadMultiStream() testing the case of streams=5?

    For the change to BZ2File._fill_buffer(), I'm not sure that the check for end-of-file is correct. It seems that if the end of rawblock lines up with the end of a stream, the mode will be set to EOF even if there are more streams waiting to be read. Is this possible?

    A note on bz2 behavior: A BZ2Decompressor object is only good for one stream; after that eof is set and it will refuse to continue to the next stream; this seems in line with bzip2 manual

    I think this is the right way to do things. BZ2Decompressor is a low- level wrapper around the underlying C API -- it should not grow extra features unnecessarily. Also, certain use-cases might depend on being able to detect the end of a logical stream; this would not be possible if BZ2Decompressor were to be changed.

    The difference in behaviour from BZ2File and decompress() should probably be documented, though.

    7b124e7b-b61d-49f6-9762-fd36b627e237 commented 13 years ago

    Right! I updated the patch and added a test for the aligned stream/buffer case.

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 13 years ago

    New changeset 8cebbc6473d9 by Nadeem Vawda in branch 'default': Issue bpo-1625: BZ2File and bz2.decompress() now support multi-stream files. http://hg.python.org/cpython/rev/8cebbc6473d9

    New changeset 0be55601f948 by Nadeem Vawda in branch 'default': Update bz2 docs following issue bpo-1625. http://hg.python.org/cpython/rev/0be55601f948

    252699e1-f617-4a8b-9fb0-ae90945f6292 commented 13 years ago

    Committed. Once again, thanks for the patch!

    merwok commented 13 years ago

    I made a few comments and asked two questions on the review page. (I should have said so here.)

    252699e1-f617-4a8b-9fb0-ae90945f6292 commented 13 years ago

    I seem to be unable to log in to rietveld, so I'll reply here.

    > result += decomp.decompress(data) Is this efficient? I understood that other Python implementations had poorly performing str.__iadd__, and therefore that using a list was the common idiom (using “return b''.join(result)” at the end).

    Good point. I hadn't thought about other implementations.

    Also, you're right about the superfluous comments in test_bz2; I'll do a general cleanup of the test code soon.

    Looks good. I only have one paranoid comment: since the tests use self.TEXT * 5 to create multiple streams, the ordering of the files is not tested. IOW, the code could mix-up the order of the files and the tests would not catch that. Is it a concern?

    I wouldn't think so. It's not as though there is an index that the code looks at to find the location of each stream. It just reads the data from the file and if it reaches the end of one stream, it assumes that the following data is a new stream, and decompresses it accordingly.

    That said, I wouldn't be opposed to adding a test for that sort of thing (just for paranoia's sake :P) if it doesn't involve adding large amounts of additional binary data in the file. I'll come back to it once I've tidied up the existing code.

    merwok commented 13 years ago

    If you’re logged into Roundup, you should automatically be logged into our Rietveld instance. You can file a bug on the meta-tracker (link in the left sidebar) if this does not work.

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 13 years ago

    New changeset 48e837b2a327 by Nadeem Vawda in branch 'default': Miscellaneous cleanups to bz2 and test_bz2 following issue bpo-1625. http://hg.python.org/cpython/rev/48e837b2a327

    252699e1-f617-4a8b-9fb0-ae90945f6292 commented 13 years ago

    If you’re logged into Roundup, you should automatically be logged into our Rietveld instance.

    I thought this was the case, but it isn't working for me. I've filed a bug on the meta-tracker.

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 13 years ago

    New changeset 3e5200abf8eb by Nadeem Vawda in branch 'default': Issue bpo-1625: Add stream ordering test to test_bz2. http://hg.python.org/cpython/rev/3e5200abf8eb

    290550c4-1290-49b5-be3b-57fda9aa7e97 commented 12 years ago

    I ported the bz2ms.patch to Python 2.7.2 and it works correctly within the bz2 module.

    But when you open a multistream (tar)bz2 with the tarfile module, even the tarfile uses the BZ2File() class, there exists unextracted missing files. I'll now try with the current tip.

    290550c4-1290-49b5-be3b-57fda9aa7e97 commented 12 years ago

    With the current tip everything works correctly. I think it's because of the complete rewrite of the bz2 module with python and the refactoring of _bz2.so.

    290550c4-1290-49b5-be3b-57fda9aa7e97 commented 12 years ago

    Attached patch is a revised version of bz2ms.patch against Python 2.7.2. The patch is tested using tarfile and bz2 modules. It also passes the included tests correctly.

    It also imports a missing class from BytesIO to fix the tests.

    It's up to you to take that into 2.7.x branch or not.

    merwok commented 12 years ago

    We don’t add news features in stable releases. Nadeem has closed this bug as fixed for 3.3 and it can’t go in 2.7, so I think we’re done here.

    252699e1-f617-4a8b-9fb0-ae90945f6292 commented 12 years ago

    Ozan: Thanks for taking the time to backport the patch. Unfortunately, as Éric said, 2.7 is in maintenance mode, so it no longer receives new features.

    fc834a70-ae28-4d80-9c32-a67f29245cbc commented 12 years ago

    This is all fine and well, but this is clearly a bug and not a feature. Can we please see this bug fix go into 2.7 at some point?

    e2bbff90-8059-49e9-9894-08caf76c01a5 commented 12 years ago

    +1. If we think this as a bug, python 2.x users will never be able to extract multiple-stream bz2 files.

    e2bbff90-8059-49e9-9894-08caf76c01a5 commented 12 years ago

    I mean "as a feature".

    252699e1-f617-4a8b-9fb0-ae90945f6292 commented 12 years ago

    This is all fine and well, but this is clearly a bug and not a feature.

    No, it is not at all clear that this is a bug. I agree that this is a desirable capability to have, but nowhere does the module claim to support multi-stream files. Nor is it an inherent feature of the underlying bzip2 library that we are failing to expose to users.

    [...] python 2.x users will never be able to extract multiple-stream bz2 files.

    Incorrect. It is perfectly possible to extract a multi-stream bz2 file in 2.x - you just need to open it with open() and decompress the data using BZ2Decompressor (pretty much the same way that 3.3's BZ2File does it).

    If there is really a large demand for these facilities in 2.x, I would be willing to port 3.3's BZ2File implementation to 2.x and make it available on PyPI. But this patch is not going in to the 2.7 stdlib; it is simply not the sort of behavior change that is acceptable in a bugfix release.

    jcea commented 12 years ago

    I think this support should be backported to Python 2.7 and 3.2. Current code can't decompress files generated by "pbzip2", fairly popular. I would consider that a bug, not a feature request.

    I am just recompressing a 77GB file because of this :-(.

    252699e1-f617-4a8b-9fb0-ae90945f6292 commented 12 years ago

    I am just recompressing a 77GB file because of this :-(.

    Sorry to hear that :(

    I would consider that a bug, not a feature request.

    Semantic issues aside, my concern here is that the patch for 2.7 is considerably larger than the one for 3.3, and the code it modifies is more fragile.

    An alternative solution I'd like to pursue is to backport 3.3's BZ2File implementation to run on 2.7, and release it on PyPI. That way there's no risk of introducing regressions for people who don't need this facility, and those who do can get it without needing to upgrade to 2.7.3 (or even wait for it to be released). It shouldn't take much effort to get it working on 2.6 as well.

    How does that sound?

    252699e1-f617-4a8b-9fb0-ae90945f6292 commented 12 years ago

    An alternative solution I'd like to pursue is to backport 3.3's BZ2File implementation to run on 2.7, and release it on PyPI.

    Well, that was easier than I expected. It didn't take much work to get it working under 2.6, 2.7 and 3.2. I've put up this "bz2file" module on GitHub \https://github.com/nvawda/bz2file\. I'll package it up and upload it to PyPI sometime soon.

    merwok commented 12 years ago

    I think this support should be backported to Python 2.7 and 3.2. I think our policy is pretty clear: the module docs did not say multiple streams were supported, so when support for them was added it was clearly a new feature.

    Current code can't decompress files generated by "pbzip2", fairly popular. That is unfortunate, but I don’t think that the features of one tool, even popular, should make us break the policy. That’s why I concur with Nadeem and think that a PyPI backport is the only way to provide multi-streams bz2 to previous versions.

    Best regards