wbolster / plyvel

Plyvel, a fast and feature-rich Python interface to LevelDB
https://plyvel.readthedocs.io/
Other
531 stars 76 forks source link

Error "Corruption: corrupted compressed block contents" #137

Closed Avnsx closed 2 years ago

Avnsx commented 2 years ago

Trying to read google chromes, local storage leveldb. It is located in %LOCALAPPDATA%\Google\Chrome\User Data\Default\Local Storage\leveldb. When deleting all contents of the folder and browsing only a couple websites, then closing chrome, I can read it with plyvel. But when I browse too many websites and close chrome(else it is not readable & the most recent changes to local storage are not saved to local storage leveldb, because chrome is still using it and blocking other programs from reading it, unless you create a temporal copy of the folder and read that instead), it starts outputting the error in the title. How do I solve this issue?

used code:

db = plyvel.DB(r'C:\Users\me\AppData\Local\Google\Chrome\User Data\Default\Local Storage\leveldb', compression=None)
tok = db.get(b'mykey').decode('utf-8') # errors here

I'm using the plyvel for windows 10 fork on python 3.8.10, but I'm very sure it's not the issue and that your most up to date repo, which I can't even install on windows - the most used operating system in the world -, will replicate the exact same behaviour.

wbolster commented 2 years ago

i guess the contents are compressed?

Avnsx commented 2 years ago

i guess the contents are compressed?

Yeah well, that would make sense I guess since this only happens if there's a lot of data from local storage continously saved into the local storage leveldb folder. I'm very new to leveldb, do you have any experience with, what compression algorithms are being used commonly with leveldb or did you have the issue yourself ever before? Theoretically, I should be able to uncompress and then be able to read it with plyvel again, but how would I figure out the compression algorithm?

wbolster commented 2 years ago

don't specify anything and it will detect and use snappy if needed

wbolster commented 2 years ago

which I can't even install on windows - the most used operating system in the world

also wondering what you're trying to imply here

Avnsx commented 2 years ago

don't specify anything and it will detect and use snappy if needed

I just read through the documentation again, I couldn't find any functionality to decompress and just not specifying anything, if I understood you correctly you meant it like this(?):

db = plyvel.DB(dbdir)

that stil ends up in the db.get afterwards, to just error out. Even specifying compression='snappy', ends up in the same error.

I also used repair_db and that causes my database to lose around 80% of containing information; including the key that was I guess saved in the compressed parts of the leveldb?

wbolster commented 2 years ago

just tried this on (a copy of) this directory on my machine:

$ ls ~/.config/google-chrome/Default/'Local Storage'/leveldb/
000005.ldb  003096.ldb  003097.ldb  003099.ldb  003101.log  003102.ldb  CURRENT  LOCK  LOG  LOG.old  MANIFEST-000001

with these versions:

>>> import plyvel
>>> plyvel.__version__
'1.3.0'
>>> plyvel.__leveldb_version__
'1.22'

which gives me

>>> db = plyvel.DB('db/')
>>> next(iter(db))
(b'META:chrome-extension://...', b'...')

and similarly:

>>> db.get(b'META:chrome://bookmarks')
b'...'

i see data from chrome, though it's chrome's internal binary format so good luck interpreting that.

Avnsx commented 2 years ago

I've the exact same versioning as you do, but I can't run your code snippet, without getting the error in title, why is this happening for me??

Using this repo: https://github.com/AustEcon/plyvel-wheels, with python 3.8.10 and Chrome 96.0.4664.93 for windows 10 64bit

wbolster commented 2 years ago

🤔 perhaps your leveldb build lacks snappy support altogether? (ldd on the .so file will tell you on linux, no clue about other operating systems)

Avnsx commented 2 years ago

🤔 perhaps your leveldb build lacks snappy support altogether? (ldd on the .so file will tell you on linux, no clue about other operating systems)

First off thanks alot for taking your time and actually trying to help me, I really appreciate you alot 👍 I spent the entire time trying to reproduce this, I eventually switched to a virtual machine with fresh windows 10 and english chrome with python 3.10.1

There I could run your snippet without issues, at the start. But later on I figured it out that theres, some kind of additional compression after the local storage is above 800 kb big, all the files that were in the folder before 800 kb was reached, get unioned to 1 single file, which then ends up being around 200 and 300 kb.

You can reproduce this for yourself, if you delete all contents of the local storage leveldb folder and then browse enough websites until 800 kb is reached.

I prepared this short snippet for the run window, that comes up in windows when you press windows + r, this manages to overload local storage leveldb with enough data everytime and when the additional compression kicks in and you try to read the stuff with plyvel, you'll get the error from the title.

chrome de-de.facebook.com ups.com www.asds.net twitter.com www.wattpad.com de.wikipedia.org www.facebook.com www.instagram.com www.reddit.com www.apple.com vimeo.com www.google.com www.twitch.tv www.youtube.com

python code I used afterwards to read the leveldb after additional compression:

import plyvel
db = plyvel.DB(r'C:\Users\MyUserName\Appdata\Local\Google\Chrome\user data\Default\Local Storage\leveldb')
for each in db.iterator():print(each)

I don't understand what compression is used for this, on wikipedia it says leveldb only uses snappy compression and chrome is listed to be using leveldb, so what are they even doing after 800 kb to cause this error?

i see data from chrome, though it's chrome's internal binary format so good luck interpreting that.

Also this was not the case for me, not everything is using the internal binary format around 90% of it was always entirely visible as raw human readable string for me.

wbolster commented 2 years ago

do you have a stack trace? which call fails exactly and where?

never heard of two types of compression in leveldb. missing snappy lib leads to compression error messages that can be confusing. plyvel linux binary wheels accidentally suffered from that at some point in the past

Avnsx commented 2 years ago

do you have a stack trace? which call fails exactly and where?

never heard of two types of compression in leveldb. missing snappy lib leads to compression error messages that can be confusing. plyvel linux binary wheels accidentally suffered from that at some point in the past

Short video I recorded: https://youtu.be/vBLqgjMJelw It does not show, how the files went from below 800 kb to one file only, because the links I selected in the first run, did not set enough local storage data, to get just below the 800 kb limit, instead ended up with 7 kb.

The 2nd time I ran the same code, after opening way more websites, that loaded above 800 kb into local storage level db folder, extra compression kicked in and the file size was reduced to 247 kb in total. I also checked, this stuff has to be compressed, because websites which for example only used the local storage to save account information such as a token, would remember me even after the compression, so chrome somehow decompresses it and feeds it back to local storage in the browser which you can see if you press F12 > application tab > local storage

Here's the leveldb, that plyvel causes a error with: https://easyupload.io/j2z8mr

Traceback (most recent call last):
  File "C:\Users\Rando\Desktop\dog.py", line 3, in <module>
    for each in db.iterator():print(each)
  File "plyvel\_plyvel.pyx", line 841, in plyvel._plyvel.Iterator.__next__
  File "plyvel\_plyvel.pyx", line 886, in plyvel._plyvel.Iterator.real_next
  File "plyvel\_plyvel.pyx", line 91, in plyvel._plyvel.raise_for_status
plyvel._plyvel.CorruptionError: b'Corruption: corrupted compressed block contents'

Since chrome is based on chromium, I guess this might help if you understand C++ because I don't https://github.com/chromium/chromium/search?q=.ldb

@wbolster

wbolster commented 2 years ago

i tried this on my chrome profile's local storage database which is >10 mb large, and i cannot reproduce at all:

import plyvel
db = plyvel.DB('db')
print(list(db))

this dumps lots of stuff to the screen.

my plyvel is compiled with libsnappy support, as ldd on the installed .so files shows:

$ ldd .direnv/python-3.9.7+/lib/python3.9/site-packages/plyvel/_plyvel.cpython-39-x86_64-linux-gnu.so 
    linux-vdso.so.1 (0x00007ffc64941000)
    libleveldb-44f63a48.so.1.22.0 => /tmp/foo/.direnv/python-3.9.7+/lib/python3.9/site-packages/plyvel/../plyvel.libs/libleveldb-44f63a48.so.1.22.0 (0x00007f81c448b000)
    libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f81c424d000)
    libm.so.6 => /usr/lib/libm.so.6 (0x00007f81c4109000)
    libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f81c40ee000)
    libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f81c40cd000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007f81c3eff000)
    libsnappy-63ba3ec5.so.1.1.8 => /tmp/foo/.direnv/python-3.9.7+/lib/python3.9/site-packages/plyvel/../plyvel.libs/libsnappy-63ba3ec5.so.1.1.8 (0x00007f81c3ce1000)
    /usr/lib64/ld-linux-x86-64.so.2 (0x00007f81c492a000)
wbolster commented 2 years ago

i tried the same on your sample database, and it also worked fine:

>>> import plyvel
>>> import pprint
>>> db = plyvel.DB('leveldb')
>>> pprint.pprint(list(db.iterator(include_value=False)))
[b'META:https://de.wikipedia.org',
 b'META:https://vimeo.com',
 b'META:https://www.apple.com',
 ...  # snip
 b'_https://www.youtube.com\x00\x01yt.innertube::nextId',
 b'_https://www.youtube.com\x00\x01yt.innertube::requests',
 b'_https://www.youtube.com\x00\x01ytidb::LAST_RESULT_ENTRY_KEY']
Avnsx commented 2 years ago

i tried the same on your sample database, and it also worked fine:

>>> import plyvel
>>> import pprint
>>> db = plyvel.DB('leveldb')
>>> pprint.pprint(list(db.iterator(include_value=False)))
[b'META:https://de.wikipedia.org',
 b'META:https://vimeo.com',
 b'META:https://www.apple.com',
 ...  # snip
 b'_https://www.youtube.com\x00\x01yt.innertube::nextId',
 b'_https://www.youtube.com\x00\x01yt.innertube::requests',
 b'_https://www.youtube.com\x00\x01ytidb::LAST_RESULT_ENTRY_KEY']

Did you try this with the plyvel for windows version? Maybe @AustEcon did not compile it properly or do you have any idea why I can't run it, because if libsnappy or whatever didn't work at all, I should've been not able to build it / use plyvel on the smaller database in first place right? But as you see it works on the video, unfortunately as soon as it gets bigger and I try to read it with plyvel again I get the error 😨

wbolster commented 2 years ago

my testing was on an up-to-date linux system using the official (built by myself 🙃) plyvel wheel packages. i have not tried on windows, and i cannot / will not either; i have not used windows at all for ~20 years now.

that said, technically, snappy is an optional dependency for leveldb, but not compiling leveldb against it is setting yourself up for nasty surprises… since it means databases using compression (most of them in the real world!) cannot be opened. i further suspect leveldb+snappy use opportunistic compression, meaning only data that benefits from it gets compressed. this could explain the ‘tipping point’ you see.

wbolster commented 2 years ago

closing since this is very likely not an issue in this repo

Avnsx commented 2 years ago

Since chrome is based on chromium, I guess this might help if you understand C++ because I don't https://github.com/chromium/chromium/search?q=.ldb

Just a side note; I think it's pretty funny how someone from the chromium project / google read through this issue ticket and removed every single line of code that was assosciated with .ldb. I think at this point they're intentionally trying to dodge decompression of chrome's leveldb

iamqiz commented 2 years ago

@Avnsx same question when use window plyvel from AustEcon/plyvel-wheels, 😂 trying to use leveldb in window is very difficult in window 😂 i am so curious why google dont compile leveldb for window 😂

iamqiz commented 2 years ago

@Avnsx a workaround is to use leveldb in window WSL, see more here https://gist.github.com/Aceralon/d94a562840b858adc8585d7e44cbaa96

QGB commented 2 years ago

does plyvel has RepairDB?

zmic commented 2 years ago

Did you try this with the plyvel for windows version? Maybe @AustEcon did not compile it properly or do you have any idea why I can't run it, because if libsnappy or whatever didn't work at all, I should've been not able to build it / use plyvel on the smaller database in first place right? But as you see it works on the video, unfortunately as soon as it gets bigger and I try to read it with plyvel again I get the error 😨

FYI, I can confirm this problem occurs on Chromium database if your build of leveldb does not link in the snappy library. I had to rebuild leveldb with snappy (on Windows), then the problem disappeared.

Avnsx commented 1 year ago

had to rebuild leveldb with snappy (on Windows), then the problem disappeared.

Can you publish your build, so I can try it? @zmic