mohamedmansour / my-hangouts-extension

My Hangouts for Google Plus Chrome Extension
https://plus.google.com/116935358560979346551/about
Other
35 stars 9 forks source link

Normalize Users Location before Querying #95

Closed mohamedmansour closed 12 years ago

mohamedmansour commented 12 years ago

We need to somehow normalize the location so we don't store duplicates in the internal database.

For example,

Chicago, IL | Chicago, IL, USA | 41.8781136 | -87.62979819999998
chicago, il | Chicago, IL, USA | 41.8781136 | -87.62979819999998
Chicago il  | Chicago, IL, USA | 41.8781136 | -87.62979819999998

As you see, we can do simple normalization that does the following:

By doing that, we can eliminate a good chunk of duplicate data. More normalization could happen like, if a user just placed "Chicago" perhaps find the nearest match such as "Chicago, IL" but that might be dangerous. I believe it wont though.

mohamedmansour commented 12 years ago

/cc @johnbc @kaktus621

marmat commented 12 years ago

I would suggest to only remove spaces and commas. [a-zA-Z] won't do since there are many languages which use more than those characters, e.g. in China or Japan this would probably lead to empty strings which would cause massive errors.

I do agree with the lower casing, though. This in combination with removing spaces and commas should take care of most of the duplicate locations (and also save some Geocoding).

marmat commented 12 years ago

Just a small status update: I implemented normalization in my Map My Circles extension for now and tested it:

Location.normalize = function(locationString) {
  return locationString.toLowerCase().replace(/[ ,;\.]/gi, '');
};

Using my circles+follower base of currently 959 users, 832 API requets were made. This means that 959 - 832 = 127 duplicate locations were found using the above shown normalization. This may not sound much, but it's still 13% of traffic and request savings compared to requesting all locations.

Furthermore I think the rate may be even higher the larger the user base is. More users will probably lead to more people living in the same town which will in turn lead to even more duplicates that don't have to be requested over and over again.

mohamedmansour commented 12 years ago

@kaktus621 that sounds great, would be cool if we could abstract our map stuff here to match Map My Circles, we could reuse lots of code. BTW Google unblocked our extension, so we could release it to the public when we are ready. From looking at the data for the locations table for My Hangouts, currently we have 1000 locations, it isn't that bad for two weeks worth of collecting.

mohamedmansour commented 12 years ago

Hey @kaktus621 can we implement normalization in My Hangouts? I like the design you were using, and if we could port that in that would be great. If you don't have time, I could do tomorrow morning. I want to release the update to the public tomorrow afternoon.

marmat commented 12 years ago

Hi @mohamedmansour sorry, I'm currently in the phase where we're writing exams every few days, therefore I couldn't really work on the feature. I will have some time today in the evening, though (in your timezone that's probably Sunday morning). I'll migrate some code from the MapMyCircles extension and than we can finalize that.

mohamedmansour commented 12 years ago

Don't worry about :) I will let you know if I need any help, we can migrate code/make design better when your done :)

mohamedmansour commented 12 years ago

This has been resolved now, I know we could be smarter with normalizing the cities but we don't want the client to be more complex as it is. Thanks @kaktus621 !