visualfabriq / bquery

A query and aggregation framework for Bcolz (W2013-01)
https://www.visualfabriq.com
BSD 3-Clause "New" or "Revised" License
56 stars 11 forks source link

Parallelised string factorisation #22

Open ARF1 opened 9 years ago

ARF1 commented 9 years ago

Shared variables with manual locking:

Shared variables without locking requirement:

Thread-local variables:

Locking scheme:

ARF1 commented 9 years ago

@CarstVaartjes Had a slight bug resulting in a faulty factor: chunk results were appended out-of-order, e.g. chunk 0, chunk 50, chunk 1, etc.

Possible solutions:

  1. Keep full-sized out-buffer in memory before compressing the whole thing into a carray (The previous fix does this. But the locking needs to be cleaned up if this is to be the final solution.)
  2. Find a way to write carray chunks for labels out of order, e.g. not labels.append(out-buffer) but something like labels.write(out-buffer, chunk_number).
  3. Append chunk results directly to the labels carray but enforce proper order. This would require more careful scheduling of chunk processing than currently and waiting for "slow" chunks to complete before faster threads can work on new chunks. More elaborate implementation could use a "chunk writing queue" to mitigate the waiting issue. This would be a middle-ground between a full-size buffer and a strict waiting scheme.

For solution 2: Could we create labels with chunklen=carray_.chunklen. Then append(...) the chunks out-of-order and finally rename the chunk filenames of labels on disk? Of course even better would be writing them with the correct file name to begin with. The downside might be that reading the carray might be slower as the files are not arranged in "natural order" on the disk, possibly leading to additional seeks.

Any preferences or advice?

CarstVaartjes commented 9 years ago

I was wondering about the order with parallelization actually! (for groupby it's less relevant, but for the factorization it very much is). Ideally it would be "Find a way to write carray chunks for labels out of order,"; there were other people also asking for that functionality in the bcolz mail group, but I do not think that it's possible atm, right @esc ? So I think the renaming the filenames is a great idea except for the leftover array (we always need to end with that, so we still have). Also: the carray that we're writing might have a different chunk length that the input, which further complicates matters Keeping the full-sized out-buffer in memory might kill some use-cases (we use x billion record ctables that do not fit in memory for instance)

I'm going to sleep over it for a night I think. The great thing about your solution is that it does work for other use cases that we have that are in-core (the groupby functions) so we should be able to speed up those

ARF1 commented 9 years ago

@CarstVaartjes I managed to find a way to write the carray chunks out-of-order. bcolz does expose methods that make this fairly painless.

What is much more painful is that labels and carray_ must not be forced to have the same chunklen. (Which would make reordering easy...) It seems that the associated frequent writes for labels chunks hinder efficient reading of the carray_ chunks. The performance really drops out...

What I implemented now is forcing labels to have a chunklen that is a multiple of the carray_ chunklen. The results from several carray_ chunks are then collated and written as a single labels chunk. This however requires dealing with yet another "leftover" loop: for the carray_ chunks corresponding to carray_.nchunks % chunk_multiple.

To make matters worse, writing in-memory chunks out-of-order requires a different api. On the whole the code is now in urgent need of some refactoring...

But it works... well sort of: out-of-core factorize() is x2.2 faster than master for uncompressed, x1.75 for compressed bcolz for my test case.

What has me completely stumped is that in-memory factorize() is MUCH slower than out-of-core factorize(). It is even slower than the single-threaded code!

CarstVaartjes commented 9 years ago

Trying to summarize it for myself and thinking out loud:

Discussion points are:

God this is complicated stuff. Really great insights though. And for most functions in the aggregation side your code could already be used -> I think writing to the same ndarray would still require locking (sort of defeating the parallel purpose there, as each row there does write) but instead each thread could have its own array where we use numpy to add the results of the individual ndarrays together. (for sum and count it will work, for sorted_count_distinct not as that requires sequential reading from a point where you now the previous value was different)

ARF1 commented 9 years ago

@CarstVaartjes I was implementing one of your ideas and ran into trouble. You wrote:

We have a list that based on the index lets us know what the original values were: reverse_values. [...] wouldn't it be much smarter to just read the actual kh_str_t table in the end (as it contains all values and all indexes)

I am in the process of implementing this and I think it is not possible: since we are using a hash-table, the position of the elements in the table does not indicate their order of insertion but rather their hash value.

I think the kh_str_t table does not contain the indexes. Am I overlooking a feature of khash here?

@CarstVaartjes @FrancescElies If not, what we could do is make a hash table of a struct type. In the struct we could store the string value (or char*) and the reverse value. We would then need to implement our own comparison function, similar to what was done for the python object hash table implementation. (See pyobject_cmp.)

What do you think? It could make for a cleaner implementation since we do not need to carry around the reverse_values pointer. On the other hand we would have to create a whole slew of hash table implementations (one for each data type) rather than use those already defined.


Re speed: creating the dict from the reverse vector is definitely faster, even for single-threaded operation. See my (now slightly outdated) PR #21 with performance enhancements.

This was surprising to me as well: insertion into a dict should take the same time no matter where it happens. I believe the reason for the increased performance is, that without python objects as arguments the helper function can have a nogil signature which seems to speed up things somehow. Possibly the performance increase manifests only if the GIL is actually released when calling the helper. - Or maybe my recollection is just wrong.

ARF1 commented 9 years ago

@CarstVaartjes Just wanted to let you know to ignore my previous post. I finally understood how khash works. The keys are indeed stored: table.keys. Should have been obvious...

ARF1 commented 9 years ago

@FrancescElies @CarstVaartjes OK, here are few revision that can serve as basis for further discussions:

  1. rev 5a3b487 ('stock khash') is about a week old and does not yet use iterblocks: it serves me as a reference implementation but is fairly messy.
  2. rev b2453aa ('pandas khash') is identical to the previous, only instead of khash stock it used pandas' khash v0.2.6: exposes performance problems of stock khash.
  3. rev a1092c6 ('iterblocks') is an iterblocks based implementation: looks very nice but is slow.

All revisions pass my own test cases. I finally ferreted out the synchronisation issue I mentioned to @FrancescElies which was leading to duplicate entries in the reverse dictionary.

Performance measurements were done using on a 12-character column with 1014 unique values and about 9 million entries in total with the following commands:

  1. on-disk factorisation: %timeit -r 10 a.cache_factor(['isin'], refresh=True)
  2. in-memory factorisation: %timeit -r 10 bquery.ctable_ext.factorize(a['mycol'])

For comparison the current visualfabriq/bquery master (rev 3ec8eb80cd, 'master') was used as a reference.


Uncompressed, in-memory factorisation:

master:       1.98 s
stock khash:  0.820 s (x2.41)
pandas khash: 0.719 s (x2.6)
iterblocks:   1.73 s  (x1.1)

Uncompressed, on-disk factorisation:

master:       2.48 s
stock khash:  0.940 s (x2.6)
pandas khash: 0.808 s (x3.1)
iterblocks:   0.942 s (x2.6)

Compressed, in-memory factorisation:

master:       3.34 s
stock khash:  1.76 s (x1.9)
pandas khash: 1.76 s (x1.9)
iterblocks:   2.85 s (x1.2)

Compressed, on-disk factorisation:

master:       3.99 s
stock khash:  2.24 s (x1.8)
pandas khash: 2.20 s (x1.8)
iterblocks:   2.25 s (x1.8)

Conclusions:

  1. using stock khash v0.2.8 which we just merged into master seems to incur a non-negligible performance penalty
  2. iterblocks seems to incur a significant performance penalty compared to reading in chunks. This could be either due to iterblocks being slow or my implementation in rev a1092c6 having a problem. To narrow-down the problem, I am working on a comparable implementation avoiding iterblocks keeping the overall interface constant but the chunk-scheduling is non-trivial.
  3. My iterblocks implementation (rev a1092c6) seems to have a problem with in-memory performance. In-memory is consistently slower than on-disk!
ARF1 commented 9 years ago

to compare changes disregarding the iterblocks revision with windows line endings use: https://github.com/ARF1/bquery/compare/b2453aa8f8dde8ea0711ec2773ef2ffee8412202...ARF1:a1092c6

FrancescElies commented 9 years ago

Hi,

apologies for the late answer, we are under some deadline pressures and I am also afraid at the moment we do not have much spare time. I just had a first look, note that I have no practical experience with c++, please be gentle with my mistakes.

About benchmarking, keeping track of all possible different scenarios is going to be difficult, here some ideas taken from other projects, maybe we could consider using vbench https://github.com/wesm/pandas/tree/master/vb_suite (pandas) or airspeed velocity http://spacetelescope.github.io/asv/ (under consideration in bcolz https://github.com/Blosc/bcolz/issues/116).

I saw very nice stuff even using omp pragma directives not supported directly in cython, once I tried some stuff but I found a bit tricky to use objects inside prange, it seems like you managed to do that without problems. About EOLs maybe could could try https://github.com/visualfabriq/bquery/pull/28, what do you think? I am also not sure why new commits don't run against tests in travis for this branch

Your numba suggestion is also very interessting, this topic will require for sure some time, hopefully this whole situation is fine with you.