google / rappor

RAPPOR: Privacy-Preserving Reporting Algorithms
Apache License 2.0
862 stars 163 forks source link

Optimization of association analysis. Command line interface. Bug fixes. Code cleanup. #52

Closed andychu closed 9 years ago

andychu commented 9 years ago
andychu commented 9 years ago

OK thanks for the review. I addressed all but the last one. I will do something to compare the results. I think the problem is that Decode() is fairly noisy on a small number of reports.

DONE: Updated docstring for bit_indices

Having both params and params_list is mainly because the unit tests already use a single set of params.

I removed the num_variables > 4 check, because it was arbitrary. The unit tests test 3 variables, so 4 should work.

I only used mclapply() in cases where we're iterating over the number of reports, but not say the number of cohorts. mclapply() forks processes so it has some significant memory overhead, as well as startup overhead.

UpdatePij: This is being called inside for (i in 1:max_em_iters), not mclapply?

Added the check that dimensions == 2 in ConstructFastEM.

Wrapping: I changed this variable type to size_t. num_entries and entry_size are uint32_t, but their product could be more than 4 GB, which should work fine on a 64 bit machine. I don't want to hard-code any assumptions about the machine memory size.

analysis/cpp/run.sh -- removed extra comments

andychu commented 9 years ago

OK, in association_test.R, I added TestTwoImplementations() which runs both algorithms. The L1 difference between the matrices is 9.7e-17 (not 0 curiously).

I also renamed the ugly $rmap and $map stuff, and did a few other cleanups.

PTAL.

andychu commented 9 years ago

OK good point. I changed this and the CSV now looks like this:

(I ended up using something a little different than "melt", since that was doing weird things)

"flag..HTTPS","domain","proportion" "TRUE","bing.com",0.263930065623489 "FALSE","bing.com",0.147970233934715 "TRUE","yahoo.com",0.182032391377609 "FALSE","yahoo.com",0.114596102564608 "TRUE","google.com",0.190706050320775 "FALSE","google.com",0.095164303098314 "TRUE","Other",0.00340437927526885 "FALSE","Other",0.00219647380522421

andychu commented 9 years ago

Any more comments on this?

ananthr commented 9 years ago

[LGTM]

Thanks Andy! Curious to know what problems you have with melt, but anything that flattens the assoc results before printing is sufficient for now.