sonjageorgievska / Arena

0 stars 0 forks source link

Take care of the randomized addresses #3

Closed sonjageorgievska closed 8 years ago

sonjageorgievska commented 8 years ago

One solution is:

  1. make a statistics of e.g. what (average or median?) percentage p of the addresses detected in a time window is randomized.
  2. exclude all randomized addresses during calculation of density
  3. after all calculation is done, scale the histogram to take into account that p% of data was ignored.

After all, we only detect a fraction x of the people, because not everybody has a smart phone and because some people have 2 smartphones. The number x should be found online in some reports or papers. Then our calculations should scale to take into account x, too.

philiprn commented 8 years ago

Let's first try to find randomized addresses, using Alexey's insights.

Maybe, investigate whether there is some correlation between addresses disappearing, and randomized addresses? I mean, for each random address, there should be a non-random address missing.

On Mon, Jun 13, 2016 at 2:48 PM, sonjageorgievska notifications@github.com wrote:

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sonjageorgievska/Arena/issues/3, or mute the thread https://github.com/notifications/unsubscribe/AQStEhGQOHtWVEhpBD0KUtorIblW0PUlks5qLVGHgaJpZM4I0StL .

sonjageorgievska commented 8 years ago

Hi here! This is interesting, but better to live it for future work (a lot of theoretical work is required to put it in the present paper). We have a flag for the randomized, I already did some comparison in March. I will try to estimate the number p, so that you can use it directly. This number would also change over months/years, depending on Apple :) More info: the pictures addresses_per_second_nonrandomized and randomized from resultsFromAnalysis Folder. I think the Pearson correlation between both series was quite high ~0.85

Edit: just checked the pearson between detected nonrandomized and randomized addresses per minute. Is is 0.983387 :) Per two minutes: (0.9895045

sonjageorgievska commented 8 years ago

So, @philiprn, I found out that the ratio randomized/non-randomized is at most 0.225. This means that if you exclude all randomized addresses during calculation of density, then after all calculation is done, you can scale the histogram by 1.225 to make up for the left-out randomized addresses.