Help understanding ApplySortingLsh?

stepthom commented 2 years ago

Hi,

Thank you for your package - it is very nicely written and very easy to use.

I am trying to understand how the ApplySortingLsh function works. In particular:

What do the values in cluster_data represent? I see that there are 33872 values in the array, so I assume there is one per cohort. But I see that most of the values seem to be 35, 34, 36, 98, or 99. What do these values mean/represent in the real world? Or, why/how are they used?

shigeki commented 2 years ago

cluster_data combines a blocked_flag (7th bit) and a bit width of each cohort Id (1st-6th bits). In the figure below (ignore Japanese in the fig), you can see that each cohort Id has about 34-36 bits range according to cluster_data and SimHash value. The cluster_data of 98 or 99 has the blocked_flag in the 7th bit so that the cohort Id is blocked.

stepthom commented 2 years ago

Thank you, that helps. Some more questions:

Why are some cohort IDs blocked?
Why is the bit width of each cohort ID different? E.g., why not make them all 35?

Thanks, Steve

shigeki commented 2 years ago

Please refer to these docs.

https://docs.google.com/viewer?a=v&pid=sites&srcid=Y2hyb21pdW0ub3JnfGRldnxneDo1Mzg4MjYzOWI2MzU2NDgw https://www.chromium.org/Home/chromium-privacy/privacy-sandbox/floc/

stepthom commented 2 years ago

Thank you, @shigeki. In the document you link to, I assume that the relevant part is this:

The inputs are turned into a cohort ID using a technique we're calling PrefixLSH. It is similar to a SimHash variant called SortingLSH that was described by our colleagues in Google Research and Google Ads last October.

...

The browser then uses the full set of domain name inputs to deterministically produce a 50-bit Locality-Sensitive Hash bitvector, where the i'th bit indicates the sign (positive or negative) of the sum of the i'th coordinates of all the floating-point vectors derived from the domain names.

A Chrome-operated server-side pipeline counts how many times each 50-bit hash occurs among qualifying users — those for whom we log cohort calculations along with their sync data.

The 50-bit hashes start in two big cohorts: all hashes whose first bit is 0, versus all hashes whose first bit is 1. Then each cohort is repeatedly divided into two smaller cohorts by looking at successive bits of the hash value, as long as such a division yields two cohorts each with at least 2000 qualifying users. (Each cohort will comprise thousands of people total, when including those Chrome users for whom we don't sync cohort data.)

...

Two questions:

I assume that the "Chrome-operated server-side pipeline" has already been computed and the results are in the contents of "cluster_data" - is that correct?

Also, the step "The 50-bit hashes start in two big cohorts" makes sense to me, but I don't see how the code in ApplySortingLsh implements this. The code seems to start to do the following:

Given a sim_hash (computed from the given URL list):

Set cumulative_sum to 0
Set index to 0
Add either 2^35, 2^35, or 2^36 (depending on the “cluster data”) to cumulative_sum.
Check if cumulative_sum is bigger than sim_hash a. If yes, then done. Set cohort_id to current index and return b. If no, continue
Increment index by 1
GOTO step 3

I'm having trouble seeing how this is the same what the document describes.

stepthom commented 2 years ago

To expand:

It seems like what the document is saying the algorithm should do is something like this (just making up the numbers for illustration):

Cohort 0 is assigned to any sim_hash between 10000000000000000000000000000100000000000000000000 and 10000000000000000000000000001100000000000000000000
Cohort 1 is assigned to any sim_hash between 10000000000000000000000000001100000000000000000000 and 10000000000000000000000000001101000000000000000000
...
Cohort 33,872 is assigned to any sim_hash between 11110000000000010000000000001100000000000000000000 and 11110000000000110000000000001100000000000000000000

shigeki commented 2 years ago

I assume that the "Chrome-operated server-side pipeline" has already been computed and the results are in the contents of "cluster_data" - is that correct?

Right.

Also, the step "The 50-bit hashes start in two big cohorts" makes sense to me, but I don't see how the code in ApplySortingLsh implements this.

It was the step at the internal server in Google, not floc_simulator. The floc_simulator just uses its results as the cluster_data file in Chrome provided by Google. I do not know the details in Google internal. Please ask them at https://github.com/WICG/floc/issues.

shigeki / floc_simulator

Help understanding ApplySortingLsh? #3