internetarchive / warcprox

WARC writing MITM HTTP/S proxy
380 stars 54 forks source link

If gdbm is not available, fall back to anydbm. #6

Closed jcushman closed 10 years ago

jcushman commented 10 years ago

gdbm isn't available in the built in Python 2.7.5 on Mac OS 10.9.1. Falling back to anydbm seems to work, although I don't know what the differences might be.

nlevitt commented 10 years ago

The dbm stuff is kind of a mess unfortunately. I guess anydbm is fine as a fallback. Some dbms don't have sync() so we should probably check for that.

Python3 dropped dbhash, which was the best of the bunch. (That might be what your mac anydbm chooses, which means if you want to switch to python3, you won't easily be able to carry your dedup db forward.) As you discovered, even though the dbm stuff is a part of core python, parts of it are left out of the install on many systems. Also because it's part of core python, listing it as a requirement in setup.py doesn't work.

It would be preferable to use a 3rd party library that can be specified as a requirement. Do you guys know of something good for that?

jcushman commented 10 years ago

I don't know anything about file-based DBs in general or your deduplication database in particular, so this may not help much. :) But to start with, is this an application than benefits from bdb over sqlite? I have the idea that sqlite doesn't have such a complicated ecosystem.

If it has to be bdb, https://pypi.python.org/pypi/bsddb3/ looks robust and actively developed. Installing it on a Mac sounds like a pain (you have to separately install the berkeleydb libraries using homebrew, and then set an environment variable pointing to the libraries before running pip), but at least it can be explicitly required and doesn't involve a Python reinstall.

One good option here might be progressive enhancement. I'm currently using warcprox to create standalone .warc files for individual web pages, so I'm not using the dedupdeebee option at all (much as I love its poetry). If you imported bsddb3 only on demand, with a fallback to anydbm and a warning that the resulting DB would not be portable, that might be the friendliest solution for both power users and dabblers.

nlevitt commented 10 years ago

Thanks jcushman. I'm hesitant to switch the bsddb3 because the api looks to be more low level. By contrast, the dbm modules have the same api as a dict, which is nice. Maybe there's a wrapper out there. However when I try installing bsddb3 with pip I get "Can't find a local Berkeley DB installation." That means it won't "just work" by adding a requirement to setup.py, which is the same problem we have now with gdbm. :-\

eldondev commented 10 years ago

I don't know that we have anybody worried about migrating from one dedupdb to the other. Let's worry about facilitating people who want to use it first, and then cross the migration bridge when we come to it. Chances are we will want multiple deduplication strategies in the long run anyway.