shigeki / floc_simulator

FLoC Simulator
MIT License
37 stars 14 forks source link

Help understanding ApplySortingLsh? #3

Open stepthom opened 2 years ago

stepthom commented 2 years ago

Hi,

Thank you for your package - it is very nicely written and very easy to use.

I am trying to understand how the ApplySortingLsh function works. In particular:

shigeki commented 2 years ago

cluster_data combines a blocked_flag (7th bit) and a bit width of each cohort Id (1st-6th bits). In the figure below (ignore Japanese in the fig), you can see that each cohort Id has about 34-36 bits range according to cluster_data and SimHash value. The cluster_data of 98 or 99 has the blocked_flag in the 7th bit so that the cohort Id is blocked. image

stepthom commented 2 years ago

Thank you, that helps. Some more questions:

Thanks, Steve

shigeki commented 2 years ago

Please refer to these docs.

https://docs.google.com/viewer?a=v&pid=sites&srcid=Y2hyb21pdW0ub3JnfGRldnxneDo1Mzg4MjYzOWI2MzU2NDgw https://www.chromium.org/Home/chromium-privacy/privacy-sandbox/floc/

stepthom commented 2 years ago

Thank you, @shigeki. In the document you link to, I assume that the relevant part is this:

The inputs are turned into a cohort ID using a technique we're calling PrefixLSH. It is similar to a SimHash variant called SortingLSH that was described by our colleagues in Google Research and Google Ads last October.

  • ...

  • The browser then uses the full set of domain name inputs to deterministically produce a 50-bit Locality-Sensitive Hash bitvector, where the i'th bit indicates the sign (positive or negative) of the sum of the i'th coordinates of all the floating-point vectors derived from the domain names.

  • A Chrome-operated server-side pipeline counts how many times each 50-bit hash occurs among qualifying users — those for whom we log cohort calculations along with their sync data.

  • The 50-bit hashes start in two big cohorts: all hashes whose first bit is 0, versus all hashes whose first bit is 1. Then each cohort is repeatedly divided into two smaller cohorts by looking at successive bits of the hash value, as long as such a division yields two cohorts each with at least 2000 qualifying users. (Each cohort will comprise thousands of people total, when including those Chrome users for whom we don't sync cohort data.)

  • ...

Two questions:

I assume that the "Chrome-operated server-side pipeline" has already been computed and the results are in the contents of "cluster_data" - is that correct?

Also, the step "The 50-bit hashes start in two big cohorts" makes sense to me, but I don't see how the code in ApplySortingLsh implements this. The code seems to start to do the following:

Given a sim_hash (computed from the given URL list):

  1. Set cumulative_sum to 0
  2. Set index to 0
  3. Add either 2^35, 2^35, or 2^36 (depending on the “cluster data”) to cumulative_sum.
  4. Check if cumulative_sum is bigger than sim_hash a. If yes, then done. Set cohort_id to current index and return b. If no, continue
  5. Increment index by 1
  6. GOTO step 3

I'm having trouble seeing how this is the same what the document describes.

stepthom commented 2 years ago

To expand:

It seems like what the document is saying the algorithm should do is something like this (just making up the numbers for illustration):

shigeki commented 2 years ago

I assume that the "Chrome-operated server-side pipeline" has already been computed and the results are in the contents of "cluster_data" - is that correct?

Right.

Also, the step "The 50-bit hashes start in two big cohorts" makes sense to me, but I don't see how the code in ApplySortingLsh implements this.

It was the step at the internal server in Google, not floc_simulator. The floc_simulator just uses its results as the cluster_data file in Chrome provided by Google. I do not know the details in Google internal. Please ask them at https://github.com/WICG/floc/issues.