mozilla / ichnaea

Mozilla Ichnaea
http://location.services.mozilla.com
Apache License 2.0
573 stars 139 forks source link

Use location areas (lac) as fallback estimates #154

Closed hannosch closed 10 years ago

hannosch commented 10 years ago

While we build up our cell databases, in many cases we won't have entries for the actual cell id being searched for. But we might have one or more cell entries for the same location area.

If the cell search fails, we can do a secondary query for all matching radio, mcc, mnc, lac entries and report the average position for the entire area as the result.

In a second step, we could also maintain and store a position estimate for each lac, so we avoid the extra work doing lookup time. We could use a special value like "-2" to signify the combined lac estimate (0 is a valid value and we use -1 to signify "unknown"").

graydon commented 10 years ago

Initial work looking at the LACs of various countries. Histograms are in m, measuring the square root of the area of the bounding box; many of the bounding boxes are highly oblong. Accompanying maps show the bounding boxes.

LACs with fewer than 10 cells measured inside them are excluded, as are most LACs measuring over 300km on a side (there seem to be some outliers / odd data, but very few). I think I included a few of the gigantic ones in the USA map by accident, but the plot shows the basic arrangement.

USA (map):

# NumSamples = 366; Min = 151.00; Max = 294673.00
# Mean = 61122.158470; Variance = 3133448279.860133; SD = 55977.212148
# each * represents a count of 1
  151.0000 - 29603.2000 [   132]: ************************************************************************************************************************************
29603.2000 - 59055.4000 [    89]: *****************************************************************************************
59055.4000 - 88507.6000 [    60]: ************************************************************
88507.6000 - 117959.8000 [    36]: ************************************
117959.8000 - 147412.0000 [    18]: ******************
147412.0000 - 176864.2000 [     7]: *******
176864.2000 - 206316.4000 [    11]: ***********
206316.4000 - 235768.6000 [     6]: ******
235768.6000 - 265220.8000 [     6]: ******
265220.8000 - 294673.0000 [     1]: *

Germany (map):

# NumSamples = 782; Min = 618.00; Max = 200036.00
# Mean = 44241.829923; Variance = 1205651236.481304; SD = 34722.488915
# each * represents a count of 3
  618.0000 - 20559.8000 [   244]: *********************************************************************************
20559.8000 - 40501.6000 [   192]: ****************************************************************
40501.6000 - 60443.4000 [   140]: **********************************************
60443.4000 - 80385.2000 [    90]: ******************************
80385.2000 - 100327.0000 [    60]: ********************
100327.0000 - 120268.8000 [    24]: ********
120268.8000 - 140210.6000 [    17]: *****
140210.6000 - 160152.4000 [     8]: **
160152.4000 - 180094.2000 [     5]: *
180094.2000 - 200036.0000 [     2]: 

Russia (map):

# NumSamples = 274; Min = 290.00; Max = 257452.00
# Mean = 23186.240876; Variance = 1223725886.657307; SD = 34981.793646
# each * represents a count of 2
  290.0000 - 26006.2000 [   211]: *********************************************************************************************************
26006.2000 - 51722.4000 [    33]: ****************
51722.4000 - 77438.6000 [    13]: ******
77438.6000 - 103154.8000 [     7]: ***
103154.8000 - 128871.0000 [     1]: 
128871.0000 - 154587.2000 [     5]: **
154587.2000 - 180303.4000 [     1]: 
180303.4000 - 206019.6000 [     1]: 
206019.6000 - 231735.8000 [     0]: 
231735.8000 - 257452.0000 [     2]: *

Greece (map):

# NumSamples = 87; Min = 1021.00; Max = 174369.00
# Mean = 43754.666667; Variance = 1726208832.452107; SD = 41547.669399
# each * represents a count of 1
 1021.0000 - 18355.8000 [    37]: *************************************
18355.8000 - 35690.6000 [    12]: ************
35690.6000 - 53025.4000 [     7]: *******
53025.4000 - 70360.2000 [    12]: ************
70360.2000 - 87695.0000 [     3]: ***
87695.0000 - 105029.8000 [     8]: ********
105029.8000 - 122364.6000 [     3]: ***
122364.6000 - 139699.4000 [     2]: **
139699.4000 - 157034.2000 [     1]: *
157034.2000 - 174369.0000 [     2]: **

India (map):

# NumSamples = 195; Min = 3.00; Max = 48274.00
# Mean = 8941.425641; Variance = 83569801.126522; SD = 9141.651991
# each * represents a count of 1
    3.0000 -  4830.1000 [    98]: **************************************************************************************************
 4830.1000 -  9657.2000 [    38]: **************************************
 9657.2000 - 14484.3000 [    14]: **************
14484.3000 - 19311.4000 [    19]: *******************
19311.4000 - 24138.5000 [     8]: ********
24138.5000 - 28965.6000 [    11]: ***********
28965.6000 - 33792.7000 [     3]: ***
33792.7000 - 38619.8000 [     2]: **
38619.8000 - 43446.9000 [     0]: 
43446.9000 - 48274.0000 [     2]: **

Thailand (map):

# NumSamples = 229; Min = 58.00; Max = 128466.00
# Mean = 15934.161572; Variance = 412566154.746820; SD = 20311.724564
# each * represents a count of 1
   58.0000 - 12898.8000 [   146]: **************************************************************************************************************************************************
12898.8000 - 25739.6000 [    29]: *****************************
25739.6000 - 38580.4000 [    27]: ***************************
38580.4000 - 51421.2000 [    17]: *****************
51421.2000 - 64262.0000 [     4]: ****
64262.0000 - 77102.8000 [     1]: *
77102.8000 - 89943.6000 [     2]: **
89943.6000 - 102784.4000 [     0]: 
102784.4000 - 115625.2000 [     1]: *
115625.2000 - 128466.0000 [     2]: **
graydon commented 10 years ago

Before digging too far in here I did a little further postGIS investigation on the form of the LACs we have, specifically to see if the same (MCC, LAC) pair gets reused between network operators (MNCs) but also just to get a clearer feel for their structure.

First: this is a sample of 325,005 GSM cells. They group into 14,379 LACs so on average 23 cells per LAC, but the distribution of cells-per-LAC is highly skewed:

# NumSamples = 14379; Min = 1.00; Max = 1457.00
# Mean = 22.594478; Variance = 2442.828876; SD = 49.424982
# each * represents a count of 187
    1.0000 -   146.6000 [ 14089]: ***************************************************************************
  146.6000 -   292.2000 [   230]: *
  292.2000 -   437.8000 [    36]: 
  437.8000 -   583.4000 [     8]: 
  583.4000 -   729.0000 [     4]: 
  729.0000 -   874.6000 [     5]: 
  874.6000 -  1020.2000 [     2]: 
 1020.2000 -  1165.8000 [     3]: 
 1165.8000 -  1311.4000 [     0]: 
 1311.4000 -  1457.0000 [     2]: 

Restricting to LACs with < 100 cells, it's skewed towards smaller LACs:

# NumSamples = 14379; Min = 1.00; Max = 100.00
# 580 values outside of min/max
# Mean = 22.594478; Variance = 2442.828876; SD = 49.424982
# each * represents a count of 111
    1.0000 -    10.9000 [  8369]: ***************************************************************************
   10.9000 -    20.8000 [  1917]: *****************
   20.8000 -    30.7000 [  1053]: *********
   30.7000 -    40.6000 [   807]: *******
   40.6000 -    50.5000 [   509]: ****
   50.5000 -    60.4000 [   391]: ***
   60.4000 -    70.3000 [   276]: **
   70.3000 -    80.2000 [   214]: *
   80.2000 -    90.1000 [   133]: *
   90.1000 -   100.0000 [   130]: *

Even under 10 cells per LAC:

# NumSamples = 14379; Min = 1.00; Max = 10.00
# 6010 values outside of min/max
# Mean = 22.594478; Variance = 2442.828876; SD = 49.424982
# each * represents a count of 38
    1.0000 -     1.9000 [  2886]: ***************************************************************************
    1.9000 -     2.8000 [  1399]: ************************************
    2.8000 -     3.7000 [   968]: *************************
    3.7000 -     4.6000 [   715]: ******************
    4.6000 -     5.5000 [   549]: **************
    5.5000 -     6.4000 [   497]: *************
    6.4000 -     7.3000 [   403]: **********
    7.3000 -     8.2000 [   347]: *********
    8.2000 -     9.1000 [   323]: ********
    9.1000 -    10.0000 [   282]: *******

So to summarize: most LACs we've seen have only one cell sample in them, but there are a "decent number" with tens-to-dozens, a few in the low hundreds, and a very small number of "mega-LACs" with over 300 cells.

Here is a map of the "mega LACs", those with >100 cells, without any attempt to remove outliers / bugs. I have no idea what the big euro-african block is but it's a pile of LACs with a similar extent. I assume it's a bug.

Here is a map of the same set with LACs excluded if they have >300 cells or >2000km width. Most of them (by volume) are in Poland. I have no idea why.

Within the set of LACs -- mega and mini alike -- I went looking for LAC reuse: I found 16 (MCC, LAC) pairs that get reused between 4 different MNCs each; 161 that get reused between 3 MNCs each; and 1828 that get reused between 2 MNCs each. So it definitely happens! But not too often, about 70% of (MCC, LAC) pairs in this sample belong to a single MNC.

So in summary there: most of the benefit of backing off to LAC will be gained by looking for an exact match on MNC as well; a substantial minority, but a minority, of cases may also benefit from considering same-LAC-different-MNC. The question is: how much.

That question is geometric: are the LACs that get reused between MNCs actually denoting the same places? It's difficult to tell exactly, and I'm new to this. I decided to try calculating the "error" of the assumption-of-identity by calculating the extent-of-centroids-between-LACs divided by the extent-of-cell-centroids-within-each-LAC; the idea here being that if two LAC centroids are 10km apart it's maybe ok if the LACs are themselves 100km wide, but not if they're 100m wide.

What I got from this is about a quarter of the reused LACs have terrible error (i.e. greater-than-1.0) and small fraction have errors in the 100x range. Here's the set of errors clamped to 10x-or-less:

# NumSamples = 1909; Min = 0.00; Max = 10.00
# 133 values outside of min/max
# Mean = 38.421201; Variance = 1051412.965885; SD = 1025.384302
# each * represents a count of 19
    0.0000 -     1.0000 [  1462]: ****************************************************************************
    1.0000 -     2.0000 [   185]: *********
    2.0000 -     3.0000 [    48]: **
    3.0000 -     4.0000 [    31]: *
    4.0000 -     5.0000 [    13]: 
    5.0000 -     6.0000 [    11]: 
    6.0000 -     7.0000 [    12]: 
    7.0000 -     8.0000 [     5]: 
    8.0000 -     9.0000 [     5]: 
    9.0000 -    10.0000 [     4]:

And again, clamped to 1.0 or less:

# NumSamples = 1909; Min = 0.00; Max = 1.00
# 447 values outside of min/max
# Mean = 38.421201; Variance = 1051412.965885; SD = 1025.384302
# each * represents a count of 6
    0.0000 -     0.1000 [   500]: ***********************************************************************************
    0.1000 -     0.2000 [   228]: **************************************
    0.2000 -     0.3000 [   170]: ****************************
    0.3000 -     0.4000 [   115]: *******************
    0.4000 -     0.5000 [   106]: *****************
    0.5000 -     0.6000 [    88]: **************
    0.6000 -     0.7000 [    87]: **************
    0.7000 -     0.8000 [    62]: **********
    0.8000 -     0.9000 [    60]: **********
    0.9000 -     1.0000 [    46]: *******

There we're starting to see some structure, seems likely to me those are actually "the same area" LACs. Here's the same group clamped to 0.1 or less:

# NumSamples = 1909; Min = 0.00; Max = 0.10
# 1409 values outside of min/max
# Mean = 38.421201; Variance = 1051412.965885; SD = 1025.384302
# each * represents a count of 1
    0.0000 -     0.0100 [   113]: *****************************************************************************************************************
    0.0100 -     0.0200 [    63]: ***************************************************************
    0.0200 -     0.0300 [    51]: ***************************************************
    0.0300 -     0.0400 [    46]: **********************************************
    0.0400 -     0.0500 [    42]: ******************************************
    0.0500 -     0.0600 [    42]: ******************************************
    0.0600 -     0.0700 [    40]: ****************************************
    0.0700 -     0.0800 [    29]: *****************************
    0.0800 -     0.0900 [    43]: *******************************************
    0.0900 -     0.1000 [    31]: *******************************

In summary here, then: about 75% of the reused LACs seem to me to be "plausibly denoting the same area" (as in: their cell-coverage regions overlap), and about half (that is, 15% of total LACs) overlap coverage regions with a same-LAC-different-MNC by 50%-or-more.

So this could definitely be a detectable win, but it's not entirely clear to me that it's essential. I'll focus on just same-MNC-and-LAC for now.

I also did not look at recycling across radio types. This is just a GSM sample.

tamcap commented 10 years ago

Re: Poland. Didn't we get some funny uploads with single locations in not-so-random areas in Polands? The distribution clustering away from north-eastern Poland seems similar to the distribution of those "funny" points on the coverage map. :interrobang: