Open stepthom opened 2 years ago
cluster_data
combines a blocked_flag (7th bit) and a bit width of each cohort Id (1st-6th bits).
In the figure below (ignore Japanese in the fig), you can see that each cohort Id has about 34-36 bits range according to cluster_data
and SimHash value.
The cluster_data
of 98 or 99 has the blocked_flag in the 7th bit so that the cohort Id is blocked.
Thank you, that helps. Some more questions:
Thanks, Steve
Thank you, @shigeki. In the document you link to, I assume that the relevant part is this:
The inputs are turned into a cohort ID using a technique we're calling PrefixLSH. It is similar to a SimHash variant called SortingLSH that was described by our colleagues in Google Research and Google Ads last October.
...
The browser then uses the full set of domain name inputs to deterministically produce a 50-bit Locality-Sensitive Hash bitvector, where the i'th bit indicates the sign (positive or negative) of the sum of the i'th coordinates of all the floating-point vectors derived from the domain names.
A Chrome-operated server-side pipeline counts how many times each 50-bit hash occurs among qualifying users — those for whom we log cohort calculations along with their sync data.
The 50-bit hashes start in two big cohorts: all hashes whose first bit is 0, versus all hashes whose first bit is 1. Then each cohort is repeatedly divided into two smaller cohorts by looking at successive bits of the hash value, as long as such a division yields two cohorts each with at least 2000 qualifying users. (Each cohort will comprise thousands of people total, when including those Chrome users for whom we don't sync cohort data.)
...
Two questions:
I assume that the "Chrome-operated server-side pipeline" has already been computed and the results are in the contents of "cluster_data" - is that correct?
Also, the step "The 50-bit hashes start in two big cohorts" makes sense to me, but I don't see how the code in ApplySortingLsh
implements this. The code seems to start to do the following:
Given a sim_hash
(computed from the given URL list):
cumulative_sum
to 0index
to 02^35
, 2^35
, or 2^36
(depending on the “cluster data”) to cumulative_sum
.cumulative_sum
is bigger than sim_hash
a. If yes, then done. Set cohort_id
to current index
and return
b. If no, continueindex
by 1I'm having trouble seeing how this is the same what the document describes.
To expand:
It seems like what the document is saying the algorithm should do is something like this (just making up the numbers for illustration):
I assume that the "Chrome-operated server-side pipeline" has already been computed and the results are in the contents of "cluster_data" - is that correct?
Right.
Also, the step "The 50-bit hashes start in two big cohorts" makes sense to me, but I don't see how the code in ApplySortingLsh implements this.
It was the step at the internal server in Google, not floc_simulator.
The floc_simulator just uses its results as the cluster_data
file in Chrome provided by Google. I do not know the details in Google internal. Please ask them at https://github.com/WICG/floc/issues.
Hi,
Thank you for your package - it is very nicely written and very easy to use.
I am trying to understand how the
ApplySortingLsh
function works. In particular:cluster_data
represent? I see that there are 33872 values in the array, so I assume there is one per cohort. But I see that most of the values seem to be 35, 34, 36, 98, or 99. What do these values mean/represent in the real world? Or, why/how are they used?