spotify / sparkey

Simple constant key/value storage library, for read-heavy systems with infrequent large bulk inserts.
Apache License 2.0
1.18k stars 81 forks source link

does sparkey support open a store for writing subsequent times #32

Closed zqhxuyuan closed 8 years ago

zqhxuyuan commented 8 years ago

After creating indexFile SparkeyWriter and close it. Am I able to read this same indexFile and append new data to it? I have a scene which source data come from many large file. I read one by one large file and write to several different indexFile. the new large file should append data to existing indexfile.

for example . file1:

absssss,1111
acsssss,1222

when index original data to index file, I design to use the first two char from key. such as all prefix equal to ab will group to ab.spi file

file2:

abssss,23444
cdssss,34444

finally there are 3 indexFile: ab.spi, ac.spi, cd.spi because when query abssss, I only query ab.spi. when query cdssss only query cd.spi.

something similar to query routing to reduce each file's query load. because our data has nearly 100 billion.that's why I want to split different index file

one way is keep SparkeyWriter in memory by a map like

ab->SparkeyWriter1
ac->SparkeyWriter2
cd->SparkeyWriter3

and read all original file just one time: read line, substring first two char ,get corresponding SparkeyWriter from map, and write data to this index.

as must get all file and keep SparkeyWriter untill all work done seems too slow,also may be memory insufficient. so I want to know does sparkey support open a store for writing subsequent times

As I check paldb project by linkedin, It says :

Can you open a store for writing subsequent times?  
No, the final binary file is created when StoreWriter.close() is called.

so I want to know sparkey support this feature?
or may be I don't need split index file: just pull all 100 billion data is sufficient fast query support by sparkey?

spkrka commented 8 years ago

Sparkey uses two files, a log file and an index file. The log file can be reopened and appended to. The index file can not be appended but it can be rebuilt using a log file.

I don't see any problems with reading though your input once and maintaining a set of active writers. Each active writer requires a small amount of memory and a file descriptor. If you only need around 100 or even 1000 writers or so this should not be a problem at all.

You could also open, append and close a writer for each entry. It will work fine, but it might be less efficient.

You might not need to split the file at all. The limiting factor is the memory required to build the index and the index requires about 20 bytes per entry. For quick random queries you would want to fit everything in RAM, which would likely not be possible if everything is stored in a single file.