plessbd / ocflzw-decompress

OCF LZW Decompress
GNU General Public License v3.0
7 stars 1 forks source link

IndexError: index out of range #2

Closed mahbuburrahman closed 1 year ago

mahbuburrahman commented 2 years ago

@plessbd @mkzia Thanks for sharing ocflzw-decompress code.

I am having few errors with your decompress code.

  1. binascii.Error: Incorrect padding
  2. IndexError: index out of range

If I solve the first error with '==' padding at the end of bytes blob_contents, I have the second error. The code works with your given sample blob_contents content but it shows either error 1 or error 2 if I use any other blob content.

I tried using python 3.6 and 3.8

Any thoughts why this is happening?

Thanks

plessbd commented 2 years ago

do you have your code in an example and the contents of the blob you are trying to decompress?

So far we have successfully decompressed 255k+ blobs from a cerner database without error in decompression. So if you have one that doesnt work I would like to fix it.

Sounds like you are using a base64 encoded string so it will need to be padded with == (if required)

I put it in that format for ease of use in the test, usually we get the direct bytes from that database

I have only used this on python 3.10.

mahbuburrahman commented 2 years ago

We generate blob from Oracle Cerner table. This is direct bytes from the oracle database too. We also tried with base64 encoded string. A sample base64 content is "SGVsbG8gZnJvbSBtZS4gSXQgaXMgYSBuaWNlIHdlYXRoZXIgb3V0c2lkZQ==" .

The code snap is below

Read blob file as binary file

def read_content_from_file(infile):
    """Reading content from file"""
    b_content = infile.read()
    b_content = b_content + '=='.encode('utf-8')   #  add padding == to avoid padding errors while base64 decoding
    return b_content 

Decompress code

def lzw_decompress(infile):
    """ Reading content using lzw compression algorithm"""
    blob_contents = read_content_from_file(infile)
    # to test as content from the file may have a problem
    blob_contents = 'SGVsbG8gZnJvbSBtZS4gSXQgaXMgYSBuaWNlIHdlYXRoZXIgb3V0c2lkZQ=='.encode('utf-8')
    lzw = LzwDecompress()
    # uncompressed = lzw.decompress(base64.urlsafe_b64decode(blob_contents))
    uncompressed = lzw.decompress(base64.decodebytes(blob_contents))
    actual = ""
    for bt in uncompressed:
        actual +=chr(bt)
    return actual

I have just also tried with Python 3.10. But same result.

Thanks for your help.

plessbd commented 2 years ago

that base64 contents is not compressed, it is just base 64 encoded

atob('SGVsbG8gZnJvbSBtZS4gSXQgaXMgYSBuaWNlIHdlYXRoZXIgb3V0c2lkZQ==')
"Hello from me. It is a nice weather outside" 

Our database has a COMPRESSION_CD field that is = 728 when it is compressed. If it is not = 728 it means it is not compressed and should just be decoded from Latin-1

mahbuburrahman commented 2 years ago

Right, the sample I gave here is just encoded data.

But our original data was generated from Oracle blob table. That data is compressed. Since that data may contain sensitive patient information, I am not able to share.

Does the code I shared above look fine?

plessbd commented 2 years ago

as long as the blob contents is actually compressed it should work, what you have in the example wont work because it isnt compressed, you also would not need to encode it since it would not be basse64 but just the bytestream

sql_statment = 'SELECT blob_contents, compression_cd, BLOB_LENGTH FROM blob table'
cursor_oracle.execute(sql_statement)
columns = [ele[0].lower() for ele in cursor_oracle.description]
for result in cursor_oracle.fetchall():
    blob_contents = tmp_data.get('blob_contents').read()
    if tmp_data.get('compression_cd') == 728:
        uncompressed_text = ""
        lzw = LzwDecompress()
        uncompressed = lzw.decompress(blob_contents)
        for bt in uncompressed:
            uncompressed_text += chr(bt)

What you could try is create a dummy patient with dummy information in it that you might be able to share...

mahbuburrahman commented 2 years ago

Thanks for sharing the code. We will generate some fake patients' data and share with you soon.

mahbuburrahman commented 2 years ago

@plessbd : sorry for a late post. We had to make sure that the data we would post here must be fake content.

I am having errors for many blobs. Few of them are following. All of the blobs have compression_cd=728 and directly retrieved from Oracle cerner database.

'Error': IndexError('index out of range') b"=\x97\x0eGC0\xc4\xb8a7\x1c\xcd0\x88Q\xa4\xc6p3\x8cFCQ\x91p\xc8e3\x19\x86\x11x\xc9\xb2\x13\x12\x18\x0c\xc6p#1\xbc\xdct:\x18\x8d\x92X\xe1\x98\xe6w4\x9c\xcee\xc31\x8c\xd0a9\x1c\xcc\xa7A\x80\x80\x82r4\x98M\x83\xb3\xec\x96\x0f/\x98\xcc\xe6\xa7\x03\x91\xc6-6\x9cN\xa7\x93\xe2\xb1\x94\xe4d\x84\x98h\xb2Z\x84\xc2e43Si\xf3Y\xbc\xe6w=\x10\x13a\xe7#y\xcc\xdef:\x08\np\x93\x9d\xce\xaei3Qh\xc5\xc3\x19\xbc\xd8o\x82J\xc4\x03\xb8\x19\x94\xc9\x1c3\x9c\x8c\xa6StrVu2\x8c0\xb8\xb3 \xc4s\x16\xc5c\r\xd9x\xb6@\xcb\x9d\xcaa\xe2\x83R\xe6k\x1b\x8f6drxlF\x9b\x17\x8d\xd2\x173\xfa\xcc\xae'a\x8e\xd9\xea\xb2G\xd0h(\xb8v4\x99N\xe6\xb3I\xb8\xc84.\x1dLps\x84\xe4\xc95\x97\x1c\xc6\\x93A\xa4\xceh6u\xcd\x13\xd9\xadxe\x8f.\x1cLb\x02\x1c\x9c\xcci\x8cJ(F\xc1\x018\xdet2\x88\x08\xb2\x8a\xbe\x1c@O7\x0e\x84\x02\x01\x80\xe0/\x0cC@\xbd\xe0\x0cCP\x801\x0cC\xa4\x89\xf3\x11\x05@\x81\xb3G\x1dga\xdav\x16\x91q\xce\x1c\x85\xc8I\xd9v\xdd\xd1\x99\xdfx^7\xf0PU\xd2a\xc8m}\xdf\x97\xed\xfd\x7f\xe0\x18\x0c0\x81x&\x0b\x11\xd0\x80b\x1e^U\xc0f\x19G\x90\xb0 \x13\x06Q\xcd\xda|\x82\x01@C\x83\xa1\x08i\xdb\x85\x1d\xc4\xfa\x17Nd\xb8N\x1dK\xa2\x06\xcd\xe2{`\xf1\x8a\x11\x93%H>\x18\x94\xa1\xc8VUM]8\x85\xedy\x86\xe7\xa1\xea\x1d\x1e\xc7\xb9\xf0\x90\xa5\xa9rS\x98\xe5\xf9F\x1b\x93a\xe9Xq{f\x19\xe6d\x9f\x05\xc1\xa6:\x9a\x9e\x966mP\xe6\xf7\xc5v\x8a\xe0\xf1\xa6s\x98\xa4\xe0\x80k\x1b\x06\xa1\xbci\x1a\x85\xc7hn\x91g\xe9zP\x86g\x89Q\xde\x99^\x07\x88r\x8e\x91\xa8\xf6?\x90d7\n#\x92\x02\x00\xb6,\x80 (\x12\x06\x82 \xa0\xce\x0c\x92j)\xd6\xa0o\x9b\xd0(\x00@@\x00\x00\x00ocf_blob\x00"

'Error': IndexError('list assignment index out of range') b'\x1e\x0f\xcf\x06\xd3\x80\xece9\x1c\xcd&\xf3p\xf4D1\x17\x0c\x04B\x03)\xb8\xc6o2\x1aM\xc6xy\xde6d7\x9d\xceb\xd1\x88\xc8j2\x11\x0f\xc7\xc3\xc1\t\x10\x9eC*\x16J\x04Q\x01\xa0\xe9\x05\x10\x14\n\xa4"a$\x86 \x11\x0bE\xe2\xf2\xb8\xcc\x87D"\x15\x08\x82\x02\xc1 \xa8M&\x08"#\x01\x01L\xe8r4\x98\xce\x94B):("\x9b\x9d\x0e\x03\xaa!\xde\xce.;\x8c\xc5\xc6\xf3\x91\x9c^T)\x0b\xcf\x13x(\xc4_J"\\\xee\xa6\xc1\x88\xb4\xe7X\xad\x1d\x05\xc6C\xa1\x90D>\x05\x0f/\x82\x08!\xb0\xdcs\x87\xd8\xac\x96kE\xaa\xd9n\x17\x8cG9\xcb\xdc\xe0\xd9\x14\xc7d\x07FC$<\x88y7\x18M\xb5\xa2!\xbc\xc6u6\xc5\x8e\x86\x13\xa40\xdd\x88\xc5\x1a\x0c\xa6\x13$\xb3lt6\x19G\xc2cq\x88\xe6p\x1d\x8f\x05\xfc\x0e\x14\xb0_\xbb\xde\xe2G\x86(\xc9\xe7\xa7\x1a;\x080\x07\x9e\x14<\xcd\r:_\xcd\'\xa3(\xea\xa628\x1e\x07b\x0f\x01\xbb\xc4f\xd5\x9aM\x87\x9f>\xd0\xd0o6\x98E\x86\x1a\xc9\x85\xa0v\x06\x97hc\x1b\x06\x11\xcd\x91\x08\x9aQ\xcceV\xdbp\x81\xa5\x1c\x86Q\x9a\x11\x1c\xc6\x81\x84bp\xa0\xf4i\x90BGHLe\x19GA\x94x\x1d!\xa8Dm\x1b\xc7h]\xc2E\x1aP\xea\x0b\x83P\xd4d\x19P\xf10O\x12D\xe1\x0c!\x0cC\x10\xcc6\x0eB\xd0\xdd\x14\x1aZ\x88_\x0c\x84\xdcB\x0c\x040\xe43\x0bCa\x10E\x11\x82\xd0\xd03\x11D \xb49\r\xc4P\xdaV\x95\x04\xe00\x99\x04@\xd00\x11\x91Gq\xde\x08\x87\x06\xf6\x1b\x19\xc2\xd7Tt\x1d\x1f\x97\x9c4z\x9e\xc9\xb9\xa5F\xe7\x19\xd5d\x08\'\x87\xac \x9e\xe7\x00\xb5\xc2\x19\x87I\xdey\xa1&\xf9\xf4-VFt\xde\x8b\x1e\x1b\x90\xf1\xd9\x08 H\x1a\x08ia\xf8\x86#\x89a\x01\x96(\x8a\xa1\x81\x96-\x19\x03\xa1\x88t\x1b\x86a\xb0omg\xd9\xae5\x08\xaa\xea\xc1\xe2\xa4Sy\x12F\x17\xc3@\xc44\r\x83p\xd46\x97\xc3a\x08A\x94&p\xc2A\x10\x841\x047\x94C\x80\xcc9\x11\x84YL0\x0e\x03$M,r\x06\x11\xba\x99\x81t>\x18k\xc6\xb8\x8a$b\x03\xcbv\xdf\xa6\xae \x8a\xe4\x18\xc6\xb7\x80r~\xa7A\x95\x85\xa8.\x9b\xad\xdb\x1d\x1d\xda\xd1\xd5\x1c\xa3A\xc8-\x9f\xdeu\xb8b\n\x03\x00\xb3\n\x0c\x02\x90\x800\x1c\x07G\xb3\x00\xc0\xa9\x01\xa6\x92\xa2\x82\x0c\x1f\t\xc2\xf0\xbc;\x10\xc4\x82\x01\xdch\x1ab%\xfen\x18\xdej\x12\x11\x0bGq\xc8arB\x0cQ\t\x9c\x86\xf9\xd2v\xc6\x86|#\x0c\xc7\xb0\xfcE\xec{\x9e!\xdce\xc5\xe90\x80n[_\xa1\xb1\xecF\xf1\xcb\x06\xce\xb1\xcc0)\x9e\xa8\xe4s5\xcd\xc6\xd7\x9f\x10\xa0\xf4\x15\xfe\xfdp\x9e}"\xf4\x7f\xde\xcb\x9d\xe2\x7f\xf1q\xb9\xe7\xa22\x1a\x16\x8f\xc1s\xea\x0fo\xd5\xf6\xddo?\x086p\xb64F2\xf6\xd9\r\xd8\x90\xd1\x95\xec~\x96\xe4ow\x89C\x04o\x13[q]\xdb9\xce\xf1\xdc7>\xdb\xb5i\xc6\xb9\xc6r\x06 Do\x076\x11{\xc2\xf1\x9cu\x7fQ\xabx,u9\xe0\xbd\x89q\w\'\xadr\x03\xbe\xc1\xc9\xeaB\x0e\xac\n\x14\x1c(\x18e{R\x04U\x17\xab\xe0\xb1\x90 \x1eF\xf1\xd7\x9e\x18\xf2A\xb0d\t\x90\xa1\xc4u\x1b\xc3\xb1\xcc \x81p\x06\xd5m\x1eq\xa1\x94s\x1dF\xc1\xd3\xd3\x84\xdf\x9f\x13\xc6\xe7\x87a\xa5\x0b\x89g]\xe5\xbb\x08\x04\x11\xb1\xc2[\xbd\x91\x0clF\xd5\xa0\xb8 \x10\xbb\xbf\r\r\xfa\xfd\xbe\xf1\n=\xd7\xbe\x0b\x19\x13\xffA\x87\xe4\xd9\x06\xe7\x86\x1d\x1fa\xe0}\xe4\x89>\x83\xa7j\xed\xdd\x9b\xb28\xce\xc6\nA#\xa6\xbe\xd5\x99\x0fn\x89\xc59\xa7V\xb4\xdc\x9a\xaa|j\xed\xc5\xaeBE\x0c\xe3\xe1@ p\xa1\x9d\xc3\xb7(b\xd7\x18x.\x06\xae-F\xc2W.\xd1\\xc9\xea_I\xb9o\xc1\xc4\xda\xe5\x9a\xc4!n\xed\xce!\xc2u\x19\x07\x94:\x12\x87\x8a\x0e\x17C\x08Y\x12\xdc\xc4F1\x00\x98\xd6\x1aVlr\x81z\xeb8\x90Y\xd7\xc5\xf8\xeb\xa3\x13\xb1\xe5d\xad\x82\x00\xc2\x1d\x83zE[\xcc\xa4\x10\x06\xf0\xcc\x08\x02\xf8.9qv\x0c\x98\xa86\xd8\x15\xa4K\x84\x0c\xe2)D\x80\xde\xa0$\x049\x89\x8a&#8C\xfb\x0b\xdb\2=P\xca\x1a\xc3x\xa7\x0e\xe2\xaa\xdc\x87\xeb\xf1\x7fA\xd8\x87\x1f\xa1\x14\x84P\xd1&#\xc8XV\xa3"\x84\x8c\x93\xaa>7(\xad\x16\t\x0b\x12\x8e\xd0\xfe/F8),\xa3\x08CyO\r\n\xbcg\x96M\xa3S\xbc\x0c!\x8d\x94\xa0v\xf2\x1b\xdf\xe8 #a\xa9\x06\x1bELEC\x81\x1b\x0c\xa1\xc04\x15\x90\xdc\xef\x03\xa1\x1f\r\xcc\xa05\xc6\x94J\x7f\xc81\xb66OMo<8\x1a\xab\xc3\xb8 \x08\xc1\xbc\x8c\xbe\xd7\xdeBC;\xd9\x08\xa6\xc8\xb7\x11`\xc6\xf6B\n1[\xee\xe5oGX\xb9+\xe3\xc2\xea\x92\xd1\x06>\xb3h\x8b\x08\xe1\xc4\x9e\x90R"\x82(\xf9E\x13\xe4TQ\x91\xd0\xce\x87\xc9\x00\xdd\ne<\x93\x952UoIt\xd9@\x1a\xcd\x07\x89r~\x89\xb7X\x9bA\xe5%\x1d\x88r\xa1\xaeJ\xa4\x8b+"\xdc]\x96r\xc60\xc1Ij|\xde\x1b\xcfm!\x984\xbd\xb3\xda[W\x03\xf5\x0ca\xa6\x8c=\xc0\xe0B_3\xc2\xa7S\xa1\xe4\xcd\xb2,\x19\xe9\xc9\x1b\x8e\x0f\x1e\x9e\x06\xe2\xb5N\x9c\xf4\x0bw\x93\x8a\x07\x91\xca\x8c\x19\x1e\x9b\xeaSEd3=\x9a\xaa\x83\xe9\xf8gi\x04-\xe9\xc7\x1a\xb5\x1agS\xf1\x823\xe9o\x12\xc0\xc4\xea\xa0\xd4\xfe\x8frfB\xc9\xbaK!h\xfd\x08\xa42\x1e\x81\xd2I\x1a\x1e!\x881\x924\x9a\x8a\xd2\x8a/\x10+\xb4B\xaf\x14\x06?\xc4\xa9\x03 \xec\x8c\xa1\xa4V\x02\x86J[(\xa1\xa9<=\x07\xc1\xbeWW\n]\x05\xe9\x84a\nK\x95\xf3N\xa0@\xfc\xe9\xe8 <\xf1\xd2\xd0\x06\xea\xe3\\xe3\xcdu\x93\x165CW\x9a\x07G\xa85\xba\x88tD\xb8k3\xe8|\x8e\xb0\xb4J\xbe\xc3\xa61%\'\xed@\xb1\x94n\x81JhMo.\x8aq\xb7\xf0\xb6\xccW\xab7b,\xed\x9f\xad\xf6\xc6\xd1F\n_\x18\xc2!@\xb5uD1\xda\xe9\xf3\x1d\xeb\x93\xb6\xae\x976\xdb\\\xfb!((-\x93\xbet&\xcbB\xcb\x03q.\x1d\x0e\xa24\x82\xe4\xb4k\x13s,]\xf0\x93V>NY\xa6\xe1t\xf0M~\x89\xd7\x02E\xdd\x9a)r\xa8\xb4W\xa5Qj\xd8K\x0bGx\xb0\xd4e\x99\xc7\xf4\xda\xbb\xc3\x84\xbd\x9c\xf1\x1f}\xec\xc9\xde\x1f\xa3\xe6\xbd\xd9\x13%\r\x13\x10\x8d\x98\x00\xe4\x1dg\xab\xe0\xa7u"u\x11\xc9z\x1c\x03\x84l=\xf0!\x12\x877\x92\xbd\xde\xea}\xbdS\xee\xf6;|\x07Fc\xe6\x06\xa3\x96\xf6\xbd\xe0\xbb\xed\x83i\x1d\xd8\xb0R>\x1b\\y%\x84\xecM\xe0\x8c\x8e\xd2\xef[+\xda\x02\x82\xa2\x16\r\xd3a\xe2\x87R\x0eB^\xc8m\xc6x\xb9y\xd6\xb9\xc6\x9fA\x00u\xab\x93\r7\x07#mO\x93r"\x98\xab~\xb1\x063\xfa\xef+Nhy\x12\xd8\x10\x93\xa7t\x82\xd4\xcc\xdb\xce\x8fL4\xc7,\xd1.\xc8Di\r\xcfe\xe7\xbd\xb6\xfed#\x83\xc8FD\'M\x9bP@\x0c\xc1\x81~\x06\xe0\xd8\x1a\x82\xd5\x87\xa9#\xaeHu\x96\x93\r\xbbMZ\x02\x82\x99\x1be(D\xfa:\x8de\x97\xc6\xba\xd5\xf8sX\xdb0\x82\x1b\x8dQV\x0e\x18\xcb\x1c\x87(\x06\x14\x02\x08-\x08p\x0c&\xec\xb0\xa7\x1e\x02\x80h\x0f$.\x9fQ\x80\x82\x81\xdf9\xb4=\xf8co\xe6\x03\x96vN\x98/\xdcf+r\xa0#\x13\xb9\xce\xa8d:\xfb\x9c\xbe\x03\xe2\x02\x00\x00\x00\x00ocf_blob\x00'

'Error': IndexError('list assignment index out of range') b"!\x9b\r&3X\x80\xe8h2\x88\x0c\x86\xf3\x19\xd4\xdae7\x1d\x04\x103t\x18\xc4e6\x1b\xce\xf0sx\x80\xe6e\x85B!FcI\xb8\xc2l\x10\x19\x8d\xe7#i\x84\xe8t2\x99\x04\x07c)\xc8\xe6i7\x9b\x84\x06\xf34\x1e\x13\x142\xcc&\xc0\xa0Q\x12\x1b\x0f\x88\xc4\xcaFS\x19\xa4\xe0i\xa5\x08\t\x86\x93\x99\xd0\x14-\xac\xd6\xabu\xca\xe8\xb6\x8aJ\x88\x9b\x8d&i\xb0\x80\xa9\x112Q\x08\xf1\x19\xb4\xa0@P\x99\x1ae\xe7(!\xcc@D4\x9d\xaa\xb3\x93p(c\x80\x10\x13`\x86\x83I\x9c\xc3; \xcdE\xc2\x02q\\x14W0\x9c\xf0\xa6\xe39\xd2t,\xbc\x10\xc4\x03!\x80\xc0b0\x05\x14\xce\x92\xf3\xa9\xcct !\x9b\xcd\xa7\x03e\nd \x18\x8c\x85\xfb<\xe8\xc4j \x1c\x8e\x86;\xa1\x88\xcc@A&\xd1\xa37\xa9\xb1\xe7\x05B4\x1b\xcc\x9a\x82LJm'\x94\x90\xc9\xc4\x82\x98\x80\x9eu:F\x8d\xf0b,\xb8\xd2l\x05\x1a\xa6&\xebQ\xc8\x80c7\x1bLb\xe9a\x9c\x14\x00\x80\x80\x00\x00ocf_blob\x00"

Thank you very much for your help.

plessbd commented 1 year ago

Sorry, I didnt get notifications of this message. Is it still an issue for you?

If not I will close if it is, we can start looking at it.

mahbuburrahman commented 1 year ago

@plessbd : Yes, it is still an issue. I couldn't resolve this error. Would be happy if you please have a look at the given blobs above.

Also for some blobs, it decompresses successfully without any error but the decompressed files have both readable text and unreadable binary content. Do you know why is this? I guess the original blobs contain text and any image?

mahbuburrahman commented 1 year ago

Looks like I am able to resolved it by following instructions given in the other thread: https://github.com/plessbd/ocflzw-decompress/issues/4

plessbd commented 1 year ago

Excellent! I will work to update the documentation to be more clear.