Closed wlruys closed 3 years ago
Is this both on CPU and GPU code? Is it related to the number of GPUs? If yes, it is a bug.
I ran the old code to see if it was also leaking accuracy. Main conclusion at the moment is that its hard to tell in the first few iterations. But its possible we're seeing the same behavior.
I'm running more trials of the new code at the moment but the queue is being slow tonight.
This old version of the code has the larger magnitude random perturbation in the distributed_median. I ran the old code exactly as it was in commit: f19add0d63e9c9a3d6e083a8207455362e9f8fad (with a new compilation fix from the current code)
^ This is with the URL dataset (2 local tree iterations per Iteration, 1170 ppL)
Can you compare with the final partition that Chao builds? You don't need all nodes. Just look at a couple of leaves. Just make sure that you have the same tree. Also, you can test it on a random dense dataset which you just pad a few extra dimensions with zeros; so that you can ensure that the projections are not nearly zero and that all the points are different.
Is it possible that the bug is on the neighbor updates and not on the tree construction / distance calculation?
Is it possible that the bug is on the neighbor updates and not on the tree construction / distance calculation?
It's possible, I changed some of that code as well. I would expect to see duplicate points showing up (multiple points at zero distance, etc) if there was a merging error though and I don't think I'm seeing this in my output.
Can you compare with the final partition that Chao builds?
I can try. I'll work on it soon. Trying to verify dense codes (OLD vs NEW Pyrknn) against each other at the moment, which will help narrow down where the error could be. (Possibly in the common portion) It'll be slightly tricky to compare leaves w/ Chao since they'll be in random order internally.
Also, you can test it on a random dense dataset
I'm trying full dense now as a sanity check. If this fails, this will also kind of test the same thing.
Okay, test the dense code. (But I thought the dense code was broken... ) anyway, working with dense codes will be much better to debug the tree construction.
The chart above is incorrect. I was only comparing accuracy of IDs in that old commit. For sparse data, with many points at the same distance for each other this is inaccurate. Below shows the correction (a point is correct if its closer than the true kth nearest neighbor).
For comparison here is what aims to be the same runs on the new distribute code.
I'm not sure why 1 rank is showing a difference at the moment, the code should be identical (same dataset, leafsize).
Both are showing decay, but the decay in the new code is much more extreme. May be more than 1 problem.
Okay, test the dense code. (But I thought the dense code was broken... ) anyway, working with dense codes will be much better to debug the tree construction.
Dense CPU is broken due to removing Parla and not getting around to replacing it yet. Dense GPU is "working".
the fact that the 1 rank works fine is not surprising at all. Any bugs that have to do with distributed memory are hidden.
I'm surprised that they're different by 10%, they should be closer.
Argh, well, part of this bug is that theres an oversight in the new code. I incorrectly rounded the local tree levels when the points aren't a power of 2. Setting leafsize instead fixes this (also rounding down).
This does not fix all of the problems (dense power of 2 still shows some decay, and there is decay in the old code) but it will make the sparse results for the new code look a lot better.
I'll update the table when the full test finishes running. Hopefully, its close to the old code performance.
I'll update the table when the full test finishes running. Hopefully, its close to the old code performance.
As expected its looking a bit better but still showing an accuracy leak. Plan:
Is this sparse or dense?
This is sparse (URL), a dense table is on the google drive but not copied here.
Fixed. It was a combination of a few different things. Tightening the precision of what I consider identical points in the k-select to be closer to machine epsilon, the leafsize bug mentioned above, the diminishing returns of performing multiple local iterations during a strong scaling run, and critically there was a bug in checking the accuracy of nearest neighbors. This bug was not always there but ,checking the repo, has been around for at least a few months.
Not closing this yet just in case I've missed something again. I'm running AVAZU jobs now to have another real dataset to verify this with. But checking the trees directly seemed to show good behavior for synthetic examples.
The bad news is that I don't know which version of the accuracy checking was run for the NMSlib and FAISS comparisons we've shown before. These need to be rerun.
Times are slow relative to the single node c++ call for Sparse CPU search (Chao's code) shown because these are the GPU kernel. At each iteration data is moved to and from the GPU and the search within the kernel is about twice as slow here compared to the CPU kernel.
AVAZU maintains accuracy on the first 4 iterations out to 16 ranks on both CPU and GPU. On the order of [0.16, 0.20, 0.23, 0.26, 0.3]. I'm closing this.
A previous AVAZU comparison was accidentally run with the wrong dataset, AVAZU 40M vs AVAZU+Test Points 42M. These have very different convergence properties, AVAZU+Test is significantly faster. I'm rerunning some old comparisons to see if we still have speedup w.r.t. NMSlib. AVAZU w/ test converges like [0.23, 0.33, 0.36, 0.4, 0.43] for this leafsize. (~600 ppL)
Accuracy drops dramatically on sparse scaling runs past the first few iterations.
Checking the partition on small (N<20) on sparse matrices appears to be correct. Randomization in distributed k-select only hits O(100) per iteration points in URL/AVAZU, so this shouldn't be the cause.