Open ARF1 opened 9 years ago
@CarstVaartjes Had a slight bug resulting in a faulty factor: chunk results were appended out-of-order, e.g. chunk 0, chunk 50, chunk 1, etc.
Possible solutions:
labels.append(out-buffer)
but something like labels.write(out-buffer, chunk_number).
labels
carray but enforce proper order. This would require more careful scheduling of chunk processing than currently and waiting for "slow" chunks to complete before faster threads can work on new chunks. More elaborate implementation could use a "chunk writing queue" to mitigate the waiting issue. This would be a middle-ground between a full-size buffer and a strict waiting scheme.For solution 2:
Could we create labels
with chunklen=carray_.chunklen
. Then append(...)
the chunks out-of-order and finally rename the chunk filenames of labels
on disk? Of course even better would be writing them with the correct file name to begin with. The downside might be that reading the carray might be slower as the files are not arranged in "natural order" on the disk, possibly leading to additional seeks.
Any preferences or advice?
I was wondering about the order with parallelization actually! (for groupby it's less relevant, but for the factorization it very much is). Ideally it would be "Find a way to write carray chunks for labels out of order,"; there were other people also asking for that functionality in the bcolz mail group, but I do not think that it's possible atm, right @esc ? So I think the renaming the filenames is a great idea except for the leftover array (we always need to end with that, so we still have). Also: the carray that we're writing might have a different chunk length that the input, which further complicates matters Keeping the full-sized out-buffer in memory might kill some use-cases (we use x billion record ctables that do not fit in memory for instance)
I'm going to sleep over it for a night I think. The great thing about your solution is that it does work for other use cases that we have that are in-core (the groupby functions) so we should be able to speed up those
@CarstVaartjes I managed to find a way to write the carray chunks out-of-order. bcolz
does expose methods that make this fairly painless.
What is much more painful is that labels
and carray_
must not be forced to have the same chunklen. (Which would make reordering easy...) It seems that the associated frequent writes for labels chunks hinder efficient reading of the carray_
chunks. The performance really drops out...
What I implemented now is forcing labels
to have a chunklen that is a multiple of the carray_
chunklen. The results from several carray_
chunks are then collated and written as a single labels
chunk. This however requires dealing with yet another "leftover" loop: for the carray_
chunks corresponding to carray_.nchunks % chunk_multiple
.
To make matters worse, writing in-memory chunks out-of-order requires a different api. On the whole the code is now in urgent need of some refactoring...
But it works... well sort of: out-of-core factorize() is x2.2 faster than master for uncompressed, x1.75 for compressed bcolz for my test case.
What has me completely stumped is that in-memory factorize() is MUCH slower than out-of-core factorize(). It is even slower than the single-threaded code!
Trying to summarize it for myself and thinking out loud:
Discussion points are:
God this is complicated stuff. Really great insights though. And for most functions in the aggregation side your code could already be used -> I think writing to the same ndarray would still require locking (sort of defeating the parallel purpose there, as each row there does write) but instead each thread could have its own array where we use numpy to add the results of the individual ndarrays together. (for sum and count it will work, for sorted_count_distinct not as that requires sequential reading from a point where you now the previous value was different)
@CarstVaartjes I was implementing one of your ideas and ran into trouble. You wrote:
We have a list that based on the index lets us know what the original values were: reverse_values. [...] wouldn't it be much smarter to just read the actual kh_str_t table in the end (as it contains all values and all indexes)
I am in the process of implementing this and I think it is not possible: since we are using a hash-table, the position of the elements in the table does not indicate their order of insertion but rather their hash value.
I think the kh_str_t table does not contain the indexes. Am I overlooking a feature of khash here?
@CarstVaartjes @FrancescElies
If not, what we could do is make a hash table of a struct type. In the struct we could store the string value (or char*) and the reverse value. We would then need to implement our own comparison function, similar to what was done for the python object hash table implementation. (See pyobject_cmp
.)
What do you think? It could make for a cleaner implementation since we do not need to carry around the reverse_values pointer. On the other hand we would have to create a whole slew of hash table implementations (one for each data type) rather than use those already defined.
Re speed: creating the dict from the reverse vector is definitely faster, even for single-threaded operation. See my (now slightly outdated) PR #21 with performance enhancements.
This was surprising to me as well: insertion into a dict should take the same time no matter where it happens. I believe the reason for the increased performance is, that without python objects as arguments the helper function can have a nogil signature which seems to speed up things somehow. Possibly the performance increase manifests only if the GIL is actually released when calling the helper. - Or maybe my recollection is just wrong.
@CarstVaartjes Just wanted to let you know to ignore my previous post. I finally understood how khash works. The keys are indeed stored: table.keys. Should have been obvious...
@FrancescElies @CarstVaartjes OK, here are few revision that can serve as basis for further discussions:
All revisions pass my own test cases. I finally ferreted out the synchronisation issue I mentioned to @FrancescElies which was leading to duplicate entries in the reverse dictionary.
Performance measurements were done using on a 12-character column with 1014 unique values and about 9 million entries in total with the following commands:
%timeit -r 10 a.cache_factor(['isin'], refresh=True)
%timeit -r 10 bquery.ctable_ext.factorize(a['mycol'])
For comparison the current visualfabriq/bquery master (rev 3ec8eb80cd, 'master') was used as a reference.
Uncompressed, in-memory factorisation:
master: 1.98 s
stock khash: 0.820 s (x2.41)
pandas khash: 0.719 s (x2.6)
iterblocks: 1.73 s (x1.1)
Uncompressed, on-disk factorisation:
master: 2.48 s
stock khash: 0.940 s (x2.6)
pandas khash: 0.808 s (x3.1)
iterblocks: 0.942 s (x2.6)
Compressed, in-memory factorisation:
master: 3.34 s
stock khash: 1.76 s (x1.9)
pandas khash: 1.76 s (x1.9)
iterblocks: 2.85 s (x1.2)
Compressed, on-disk factorisation:
master: 3.99 s
stock khash: 2.24 s (x1.8)
pandas khash: 2.20 s (x1.8)
iterblocks: 2.25 s (x1.8)
Conclusions:
to compare changes disregarding the iterblocks revision with windows line endings use: https://github.com/ARF1/bquery/compare/b2453aa8f8dde8ea0711ec2773ef2ffee8412202...ARF1:a1092c6
Hi,
apologies for the late answer, we are under some deadline pressures and I am also afraid at the moment we do not have much spare time. I just had a first look, note that I have no practical experience with c++, please be gentle with my mistakes.
About benchmarking, keeping track of all possible different scenarios is going to be difficult, here some ideas taken from other projects, maybe we could consider using vbench https://github.com/wesm/pandas/tree/master/vb_suite (pandas) or airspeed velocity http://spacetelescope.github.io/asv/ (under consideration in bcolz https://github.com/Blosc/bcolz/issues/116).
I saw very nice stuff even using omp pragma directives not supported directly in cython, once I tried some stuff but I found a bit tricky to use objects inside prange, it seems like you managed to do that without problems. About EOLs maybe could could try https://github.com/visualfabriq/bquery/pull/28, what do you think? I am also not sure why new commits don't run against tests in travis for this branch
Your numba suggestion is also very interessting, this topic will require for sure some time, hopefully this whole situation is fine with you.
Shared variables with manual locking:
Shared variables without locking requirement:
Thread-local variables:
Locking scheme: