Optimization of association analysis. Command line interface. Bug fixes. Code cleanup.

andychu commented 9 years ago

Added bin/decode-assoc command line tool for automation. Define and read a new schema so we know whether variables are string or boolean, and what their encoding parameters are.
bin/test.sh: Add tests for bin/decode-assoc, for (string x boolean) problem
- Add Boolean RAPPOR encoding (no hashing step) to the Python client with Encoder.encode_bits
- Add association testdata generation to rappor_sim.py
Optimizations
- In association.R, remove from GetCondProb() any computation that doesn't depend on the report. (e.g. 100x-1000x speedup for the joint conditional stage)
- Use mclapply in R for the steps that did lapply() over all reports (N). The number of cores is controlled by decode-assoc --num-cores.
- Provide an alternative implementation of EM in C++ in analysis/cpp/fast_em.cc (analysis/R/fast_em.R is a wrapper that does serialization and a system() call) ~100x speedup.
Fixed bugs in the k=1 case, due to R matrix/vector confusion (adapted from Ananth's changes)
Allow different variables in association analysis to have different parameters (adapted from Ananth's changes, but tests still pass)
Add hacky support for boolean vars (adapted from Ananth's changes)
Code cleanup in association.R. Rename variables to be more clear.
Minor refactoring of decode_dist.R to resemble decode_assoc.R

andychu commented 9 years ago

OK thanks for the review. I addressed all but the last one. I will do something to compare the results. I think the problem is that Decode() is fairly noisy on a small number of reports.

DONE: Updated docstring for bit_indices

Having both params and params_list is mainly because the unit tests already use a single set of params.

I removed the num_variables > 4 check, because it was arbitrary. The unit tests test 3 variables, so 4 should work.

I only used mclapply() in cases where we're iterating over the number of reports, but not say the number of cohorts. mclapply() forks processes so it has some significant memory overhead, as well as startup overhead.

UpdatePij: This is being called inside for (i in 1:max_em_iters), not mclapply?

Added the check that dimensions == 2 in ConstructFastEM.

Wrapping: I changed this variable type to size_t. num_entries and entry_size are uint32_t, but their product could be more than 4 GB, which should work fine on a 64 bit machine. I don't want to hard-code any assumptions about the machine memory size.

analysis/cpp/run.sh -- removed extra comments

andychu commented 9 years ago

OK, in association_test.R, I added TestTwoImplementations() which runs both algorithms. The L1 difference between the matrices is 9.7e-17 (not 0 curiously).

I also renamed the ugly $rmap and $map stuff, and did a few other cleanups.

PTAL.

andychu commented 9 years ago

OK good point. I changed this and the CSV now looks like this:

(I ended up using something a little different than "melt", since that was doing weird things)

"flag..HTTPS","domain","proportion" "TRUE","bing.com",0.263930065623489 "FALSE","bing.com",0.147970233934715 "TRUE","yahoo.com",0.182032391377609 "FALSE","yahoo.com",0.114596102564608 "TRUE","google.com",0.190706050320775 "FALSE","google.com",0.095164303098314 "TRUE","Other",0.00340437927526885 "FALSE","Other",0.00219647380522421

andychu commented 9 years ago

Any more comments on this?

ananthr commented 9 years ago

[LGTM]

Thanks Andy! Curious to know what problems you have with melt, but anything that flattens the assoc results before printing is sufficient for now.

google / rappor

Optimization of association analysis. Command line interface. Bug fixes. Code cleanup. #52