Open erickim555 opened 3 years ago
Hey, thats not supported.
@iFA88 Do you happen to know what it would take to get "import external SST files" with "perform reads on my Rocksdb table from python"?
One idea is: I could separately run the C++ Rocksdb code to do the "import external SST files" stuff to /path/to/my_db.db
, and then load this RocksDb from python-rocksdb
.
Do you happen to know if this would work?
Given that python-rocksdb's rocksdb.DB(db_name, opts, read_only=read_only)
is a thin wrapper around rocksdb::DB::Open()
, I'm ~90% sure that the following pipeline will work for ingesting externally-generated SST files into python-rocksdb
:
(1) Generate external SST files, eg from a big-data pipeline like Spark/MapReduce
(2) Ingest external SST files. On my dev machine, run a simple C++ program that calls
db_->IngestExternalFile() + manual compaction to ingest the external SST files into
a rocksdb "/path/to/my_db.db"
(3) Load Rocksdb from python-rocksdb. On my dev machine, create a Rocksdb handle
to my new DB by doing `rocksdb.DB("/path/to/my_db.db", opts)`
https://github.com/twmht/python-rocksdb/blob/master/rocksdb/_rocksdb.pyx#L1630 https://github.com/twmht/python-rocksdb/blob/3a5df052072dfb23fe63992df02d2e971d640320/rocksdb/db.pxd#L162 https://github.com/twmht/python-rocksdb/blob/3a5df052072dfb23fe63992df02d2e971d640320/rocksdb/db.pxd#L167
I'll give it a try and see how it goes
How did it go? And how do you generate the SST from your data pipeline? Would be really cool to have a Python way of generating and ingesting SST files.
This library has SstFileWriter and ingestExternalFile implemented: https://github.com/Congyuwang/RocksDict.
pip install rocksdict
. With pre-build wheels, no need to compile.
Build Write Demo:
from rocksdict import Rdict, Options, SstFileWriter
import random
# generate some rand bytes
rand_bytes1 = [random.randbytes(200) for _ in range(100000)]
rand_bytes1.sort()
rand_bytes2 = [random.randbytes(200) for _ in range(100000)]
rand_bytes2.sort()
# write to file1.sst
writer = SstFileWriter(options=Options(raw_mode=True))
writer.open("file1.sst")
for k, v in zip(rand_bytes1, rand_bytes1):
writer[k] = v
writer.finish()
# write to file2.sst
writer = SstFileWriter(options=Options(raw_mode=True))
writer.open("file2.sst")
for k, v in zip(rand_bytes2, rand_bytes2):
writer[k] = v
writer.finish()
# Create a new Rdict with default options
d = Rdict("tmp", options=Options(raw_mode=True))
d.ingest_external_file(["file1.sst", "file2.sst"])
d.close()
# reopen, check if all key-values are there
d = Rdict("tmp", options=Options(raw_mode=True))
for k in rand_bytes2 + rand_bytes1:
assert d[k] == k
d.close()
# delete tmp
Rdict.destroy("tmp")
FYI, you should use the Options(raw_mode=True)
with rocksdict so the db don't created with a custom comparator.
Yeah, that's right. Let me fix it.
To optimize for initial large bulk loads, this Rocksdb blog post recommends creating the SST files externally (eg from a big-data pipeline like Spark/MapReduce), and importing them into your DB: http://rocksdb.org/blog/2017/02/17/bulkoad-ingest-sst-file.html
The post refers to the C++ Rocksdb API, eg
db_->IngestExternalFile()
.Does
python-rocksdb
support this kind of "ingest external SST files" (egdb_->IngestExternalFile()
) behavior? I didn't see this function listed inpython-rocksdb
. Thanks!