nyurik / timeseriesdb

Automatically exported from code.google.com/p/timeseriesdb
GNU General Public License v3.0
13 stars 6 forks source link

Indexing for BinCompressedFiles #10

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I have built a tick database to store all ticks indexed by contract, then by 
timestamp (There are 60,000+ contracts, so storing each contract in a separate 
file was not practical).

The source files include NxCore stream tapes(www.nxcoreapi.com), and several 
nightly tapes direct from the exchanges (mostly bonds etc). All of these files 
are read into memory and converted to tightly encoded structs and then sorted 
first by contract Id, then by timestamp. I then write out two files, one is the 
raw structs to a master file ~15 GB with all of the ticks for all contracts for 
a single day. Then I also write out an index that includes the Contract name to 
Contract Id translation, along with pointers(index and length) to each block of 
ticks by session Id (there may be more than one session in a 24 hour day), 
along with meta data like the volume per session, high/low, tick counts etc.

After building these files each night, I then combine all of the index files 
into one master index which allows for coalescing all sessions for a given 
contract, and maintain pointers to each file/offset/count for each session.

The main goal of all these layers of abstraction is to allow for extremely 
rapid addition of new tapes (~20 minutes to add an entire days worth of ticks), 
with extremely fast lookup by contract / session O(logn*logm) where n is the 
number of distinct contract names, and m the number of distinct sessions. N is 
generally at 500,000 contracts or less, and m is roughly 3600 days in the 
database at most, making this a very fast lookup. Once you have the contract 
and session looked up, it is an O(1) operation to get the ticks.

This actually collapses into two groups
Master index: Containing contract/session tuples with pointers to raw tick file 
locations

Raw Tick Data: Collections of ticks loaded from a 24 hour exchange file. ~7200 
files at 1-15gb/file.

The database as it stands right now is ~10 TB, and is actually stored on a ZFS 
SMB share on a freebsd computer, as many of my raw tick files were suffering 
from bit rot (I then cache a lot of files locally to improve lookups).

I would like to keep the indices as series files as they are now for rapid 
lookup reasons, and store the ticks in compressed files, but need an O(1) 
lookup mechanism for compressed files. I have tested adding a numerical index 
to each tick and using it as a lookup mechanism, but it increases the sizes of 
the files by ~10-20%, and seeks over SMB (network share to ZFS) are incredibly 
slow, thus I actually cache all indexes local / in memory, and would like to do 
a single seek/read over the network to get the ticks. Having to do a binary 
search on the compressed file is quite expensive.

As I write the bulk tick files, I could use some sort of callback/output 
parameter to keep track of which logical block we are in, and use that along 
with a sequence id to create pointers to the blocks that I care about. In this 
manner, it would be an O(1) operation to load up the blocks over the network, 
and then decompress / further filter exactly the ticks I requested.

This would require a way to request ticks by index (like series files) and to 
track the blocks used for each tick as it is written, to store in the index.

Let me know what you think / how to best go about this,

-Karl

Original issue reported on code.google.com by kar...@gmail.com on 27 Mar 2012 at 3:22

GoogleCodeExporter commented 9 years ago
Karl, is it 60,000 or 500,000 contracts? You gave both numbers? 

In either case, I still think that storing one contract per file will work in 
your case if you put some large number (e.g. 10,000) files per dir (the dir 
name can be contractId/10000*10000).

You will rely on the file system (which is a good caching database after all) 
to find the needed contract. Sessions are a bit harder - but since the files 
are one per contract, the file is compressed, and the binary searches are 
cached, the binary search within the file might not be a huge perf hit - more 
testing would be needed.

To speed up import, adding data to massive number of files can be done in async 
manner: Dictionary of ContractID => CachingBufferPerContract, each buffer would 
async write blocks of data the moment it reaches threshold, while the main 
thread still parses the tape files. .NET task lib is your friend (can't wait to 
start using "async" keyword)

Also, depending on your usage pattern, you might consider caching contracts in 
a local temp files - whenever a contract is requested, its entire data is 
copied (possibly async too) to a local drive, and once ready, all subsequent 
calls go to the local, while still monitoring source file for increase in 
length.

You might also consider hybrid DB approach - contract meta data might be better 
off stored in a some SQL database, as it would be much better suited for 
various fast lookups and inserts/updates.

And finally, I think I will soon try to implement some sort of a network 
database service, since the timeseriesDb lib already supports working with 
streams rather than files. The client would still use IEnumerableFeed<,> 
interface to get the data, but from a NetworkFeed rather than a local one, and 
all seeking can be done locally on the server.

I will try to post some more ideas if they come to mind.

P.S. Just ran Demo5 sample with 100 million items (local RAID0). Your case 
would probably be the "no optimizations", but still you might gain lots of 
speed from core IO with compressed.

Finished creating files with 100,000,000 items:

 DeltaType.Positive and Calculated index: 100,087,380 bytes in 00:00:15.5549400
                      DeltaType.Positive: 200,172,590 bytes in 00:00:18.1105744
                        No optimizations: 300,239,708 bytes in 00:00:21.0126544
                            Uncompressed: 1,600,000,992 bytes in 00:01:10.0256527

Original comment by yuriastrakhan on 27 Mar 2012 at 2:12

GoogleCodeExporter commented 9 years ago
Hey Yuri,

It is ~60,000 contracts in any given file, but after covering 10 years of data, 
the master index contains ~500,000 contracts (separate files for futures, 
equities, bonds... once all rolled up and after a lot of futures expiration, we 
arrive at 500k).

I've never seen the file system work well with >> 20,000 files, the B-Tree 
scan/lookups begin to take a long time. I actually implemented the async 
mechanism you are talking about in a previous iteration of this database. I 
first coded things up on top of Berkeley Db, using seperate flat files per 
contract and a b-tree index. As I read in contracts I would asynchronously dump 
to each contract's flat file. This resulted in 60,000 different stream objects, 
which I would have to scan/write to as many as 10 times, which would be as many 
as 600,000 iops. I do have a lot of SSDs in addition to Magnetic drives, but 
even on these that resulted in 3-4 hours to write all of my data to disk. Using 
a single flat file, and writing all 15 GB of indexed ticks at once, I can read 
the data into memory (Machine has 96 GB to support reading and sorting), and 
index it in about 15 minutes, and the write to disk is ~10 seconds.

I have considered SQL for the indices, but that still results in 500,000 
contracts * 365 * 10 = 1,825,000,000 (1.8 Billion) indices (less so since each 
contract doesn't have 10 years, but my db does actually have about 600Million 
indices), which causes even SQL server to die (I've used it for up to 
~300Million fairly well... after that it just isn't efficient).

I do cache the master index on each machine(several GB) so the index lookups 
are quite fast, and then there is one network request to seek to an offset for 
a length which is quite fast.

I will have to do a dive on your Demo5 - my tests showed that writing to disk 
with the equivalent of nooptimizations saved ~50% of space as compared to 
uncompressed (these are already highly encoded ticks), but the time to write 
was an order of magnitude longer. For uncompressed I can dump in ~10 seconds, 
for compressed it took ~60 minutes. Even at 60 minutes, since the clear 
limitation is CPU, this could be addressed by background compressing 
uncompressed files across multiple cores (just need a lot of memory). The space 
could be improved considerably if I could remove the indices from each tick, 
specifically no contract Id, no session id, and no tick Id (the contract and 
session id do not change often for ordered data so they are not a big deal, but 
the tick id is). I could remove the tick id and just use pointers based on 
contract, session and time codes, but time codes are not unique (you can have 
two ticks in a row at the same time, but be attributed to two different 
sessions). All of this seems best solved by adding indexing to the compressed 
files as I originally mentioned.

I think regardless of solutions, a network server that did local seeks will be 
a large improvement. I would really love to do away with my SMB abstraction, I 
could even do away with index caching with a network server.

Let me know what you think - I really appreciate your responses, and your work 
on this library, it certainly is very powerful,

-Karl

Original comment by kar...@gmail.com on 27 Mar 2012 at 2:42

GoogleCodeExporter commented 9 years ago
One correction I can write 15GB to disk in an uncompressed file in about 90 
seconds, not 10 seconds (was confusing myself with my smaller bond files that 
write in 10 seconds)... though I would love to have 1.5GBps disks!

Original comment by kar...@gmail.com on 27 Mar 2012 at 2:45