Open GoogleCodeExporter opened 9 years ago
Karl, is it 60,000 or 500,000 contracts? You gave both numbers?
In either case, I still think that storing one contract per file will work in
your case if you put some large number (e.g. 10,000) files per dir (the dir
name can be contractId/10000*10000).
You will rely on the file system (which is a good caching database after all)
to find the needed contract. Sessions are a bit harder - but since the files
are one per contract, the file is compressed, and the binary searches are
cached, the binary search within the file might not be a huge perf hit - more
testing would be needed.
To speed up import, adding data to massive number of files can be done in async
manner: Dictionary of ContractID => CachingBufferPerContract, each buffer would
async write blocks of data the moment it reaches threshold, while the main
thread still parses the tape files. .NET task lib is your friend (can't wait to
start using "async" keyword)
Also, depending on your usage pattern, you might consider caching contracts in
a local temp files - whenever a contract is requested, its entire data is
copied (possibly async too) to a local drive, and once ready, all subsequent
calls go to the local, while still monitoring source file for increase in
length.
You might also consider hybrid DB approach - contract meta data might be better
off stored in a some SQL database, as it would be much better suited for
various fast lookups and inserts/updates.
And finally, I think I will soon try to implement some sort of a network
database service, since the timeseriesDb lib already supports working with
streams rather than files. The client would still use IEnumerableFeed<,>
interface to get the data, but from a NetworkFeed rather than a local one, and
all seeking can be done locally on the server.
I will try to post some more ideas if they come to mind.
P.S. Just ran Demo5 sample with 100 million items (local RAID0). Your case
would probably be the "no optimizations", but still you might gain lots of
speed from core IO with compressed.
Finished creating files with 100,000,000 items:
DeltaType.Positive and Calculated index: 100,087,380 bytes in 00:00:15.5549400
DeltaType.Positive: 200,172,590 bytes in 00:00:18.1105744
No optimizations: 300,239,708 bytes in 00:00:21.0126544
Uncompressed: 1,600,000,992 bytes in 00:01:10.0256527
Original comment by yuriastrakhan
on 27 Mar 2012 at 2:12
Hey Yuri,
It is ~60,000 contracts in any given file, but after covering 10 years of data,
the master index contains ~500,000 contracts (separate files for futures,
equities, bonds... once all rolled up and after a lot of futures expiration, we
arrive at 500k).
I've never seen the file system work well with >> 20,000 files, the B-Tree
scan/lookups begin to take a long time. I actually implemented the async
mechanism you are talking about in a previous iteration of this database. I
first coded things up on top of Berkeley Db, using seperate flat files per
contract and a b-tree index. As I read in contracts I would asynchronously dump
to each contract's flat file. This resulted in 60,000 different stream objects,
which I would have to scan/write to as many as 10 times, which would be as many
as 600,000 iops. I do have a lot of SSDs in addition to Magnetic drives, but
even on these that resulted in 3-4 hours to write all of my data to disk. Using
a single flat file, and writing all 15 GB of indexed ticks at once, I can read
the data into memory (Machine has 96 GB to support reading and sorting), and
index it in about 15 minutes, and the write to disk is ~10 seconds.
I have considered SQL for the indices, but that still results in 500,000
contracts * 365 * 10 = 1,825,000,000 (1.8 Billion) indices (less so since each
contract doesn't have 10 years, but my db does actually have about 600Million
indices), which causes even SQL server to die (I've used it for up to
~300Million fairly well... after that it just isn't efficient).
I do cache the master index on each machine(several GB) so the index lookups
are quite fast, and then there is one network request to seek to an offset for
a length which is quite fast.
I will have to do a dive on your Demo5 - my tests showed that writing to disk
with the equivalent of nooptimizations saved ~50% of space as compared to
uncompressed (these are already highly encoded ticks), but the time to write
was an order of magnitude longer. For uncompressed I can dump in ~10 seconds,
for compressed it took ~60 minutes. Even at 60 minutes, since the clear
limitation is CPU, this could be addressed by background compressing
uncompressed files across multiple cores (just need a lot of memory). The space
could be improved considerably if I could remove the indices from each tick,
specifically no contract Id, no session id, and no tick Id (the contract and
session id do not change often for ordered data so they are not a big deal, but
the tick id is). I could remove the tick id and just use pointers based on
contract, session and time codes, but time codes are not unique (you can have
two ticks in a row at the same time, but be attributed to two different
sessions). All of this seems best solved by adding indexing to the compressed
files as I originally mentioned.
I think regardless of solutions, a network server that did local seeks will be
a large improvement. I would really love to do away with my SMB abstraction, I
could even do away with index caching with a network server.
Let me know what you think - I really appreciate your responses, and your work
on this library, it certainly is very powerful,
-Karl
Original comment by kar...@gmail.com
on 27 Mar 2012 at 2:42
One correction I can write 15GB to disk in an uncompressed file in about 90
seconds, not 10 seconds (was confusing myself with my smaller bond files that
write in 10 seconds)... though I would love to have 1.5GBps disks!
Original comment by kar...@gmail.com
on 27 Mar 2012 at 2:45
Original issue reported on code.google.com by
kar...@gmail.com
on 27 Mar 2012 at 3:22