matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.87k stars 2.65k forks source link

Open Source SVG Map to show cities and regions #1652

Closed anonymous-matomo-user closed 11 years ago

anonymous-matomo-user commented 14 years ago

UPDATE! Fund this project now to help us finish the work and release the feature!

It takes 2 minutes: http://crowdfunding.piwik.org/analytics-maps-world-country-city-region/

Right now the world map only shows which country the visits are coming from. It would be really great / useful if we could narrow it down to what state the visitor came from or even which city. Like in the Google analytics.

It'll be a great feature for marketing so we know where most of the visitors are!

We would like to also move the technology from Flash to full open source stack, and have the map displayed in SVG or Vector.

Once the Maps show cities and region, we definitely have to show the maps in the actual Country report: full awesomeness! see #1821

Once implemented we should remove SWFobject lib: #3666

gka commented 13 years ago

Attachment: A mockup of the new city view cities2.png

gka commented 13 years ago

Attachment: A mockup of the new region view regions1.png

robocoder commented 14 years ago

This was mentioned in #1514.

mattab commented 13 years ago

This should be implemented in the next few weeks.

gka commented 13 years ago

So, I will start the map development in the next days. There are a couple of things to do:

However, there are some open problems.

gka commented 13 years ago

Note, in this image you can see an overview about all available region outlines. Think this is pretty much complete.

http://gadm.org/img/gadm_v1_level1_high.png

mattab commented 13 years ago

Greg, great news that you will resume work on this :)

GADM data looks good and there for all countries, this sounds fantastic, nice find!

In both ways, this would require some install script to run every time the user updates his GeoIP database. Do you think this is possible?

It is possible to run a script on every GeoIP update, but not a good solution for ease of use. Here is a proposal: Maybe, we could run the script once before Piwik release (and commit the file to SVN after checking it's OK). We would use the latest GeoIP database for this. Then, if users use an older version (or newer version) compared to the pre-generated DB, maybe they could run the script manually themselves?

But, I would like to ask about this script, what exactly will it map?

My current understanding is that Piwik will report:

Is the algorithm designed to map "regions" according to GeoIP, to "regions" in the Piwik map?

I guess that, for each GeoIP region, we could give one lat/long of a city belonging to this region. If so, do we need a database for this, maybe the SWF could map in real time the lat/long to the pixel inside the region?

Thanks for claryfying

gka commented 13 years ago

Replying to matt:

It is possible to run a script on every GeoIP update, but not a good solution for ease of use. Here is a proposal: Maybe, we could run the script once before Piwik release (and commit the file to SVN after checking it's OK). We would use the latest GeoIP database for this. Then, if users use an older version (or newer version) compared to the pre-generated DB, maybe they could run the script manually themselves?

My current understanding is that Piwik will report:

Is the algorithm designed to map "regions" according to GeoIP, to "regions" in the Piwik map?

I didn't knew that the GeoIP DB contains a complete list of fips regions. So Piwik already knows how many visitors each region had? In this case, there's no need for another update script. Now, all I have to assure is that the map region ids are the same as the GeoIP/FIPS region ids.

Just to make sure I understand everything, here's a sample request flow:

Did I got it right?

Thanks

mattab commented 13 years ago

Yes it sounds good!

One other feedback, would be to give labels to the icons to switch to city/region view, and to the button to zoom in/zoom out, since these buttons are very important and must be easy to reach.

I think that we must check that GeoIP data will output the regions as we expect them, ie. that all visitors are indeed assigned in one of the regions listed in: http://www.maxmind.com/app/fips10_4

I will double check this and confirm here.

Is there any other open question appart from this one?

mattab commented 13 years ago

Greg, will the Flash maps know about all cities in the GeoIP Database?

Or, will the flash map expect a list of cities & lat/long, and plot them "blindly" in the flash map (projecting from lat/long to pixels, and drawing City name based on Piwik API as input) ?

gka commented 13 years ago

The flash map doesn't know anything about the cities. Instead the map is able to project lat/lng coordinates to the exact pixels. I'm not sure if it will display all cities "blindly", maybe there will be some simple clustering of cities that are very close.

mattab commented 13 years ago

One thing I think of, is that sometimes GeoIP will not return region/city info (or just, not city some other time). So, maybe we can plan to display the "Unknown" on the map somewhere, discreetly?

Because, I guess the % displayed, will take into account the % for the Unknown region/city?

gka commented 13 years ago

You mean that for some visitors GeoIP only knows the country but not the city and/or region? We can add a "Plus X unlocatable visitors" text somewhere to display this data. Good point.

mattab commented 13 years ago

Yes, we better expect the worst with geoip free edition, all use cases can happen. I think Piwik will return, for a country's region, the 'Unknown' (or 'Other') row that will contain these. simple text "X visits couldn't be located" sounds good!

robocoder commented 13 years ago

Sorry, I'm late joining this discussion.

I'm going to add a FIPS data file as part of my work in #1823 to convert the region names into the more compact FIPS 10-4 code when storing it in the log_visit table.

re: the MaxMind FIPS file

mattab commented 13 years ago

Partial feedback:

names are (English) localized without any special characters Region names are in English charset, but they are not translated in English. For example french regions are written in french. Maybe this is not true for all countries.

Would it help if I provide a reverse lookup, i.e., FIPS code to region name? I think, that the API output for getRegions (eg.) should contain the region code, and the full region name, like we do for Country API output (which includes the country code and full name, icon path, etc.)

gka commented 13 years ago

Just wanted to let you know that I've got plenty of stuff to do right now. Hope to be able to continue working on this feature soon. Sorry for the delay..

mattab commented 13 years ago

I have just seen an interesting project: jVectorMap for canvas+JS Map library.

Greg, what do you think about this work? hope your work load is getting better :)

gka commented 13 years ago

Hi Matt,

jvectormap looks quite nice. I'm currently thinking a lot about JS/SVG based mapping myself, even developed some early prototypes while working on other projects. Still, our biggest challenge is how to build the map data files for every country (including admin level2 regions).

Thus, one of my next steps is to setup a Mapnik server and let it export clean SVG projections of shapefiles. My goal is to do that by writing as less code as possible. Mapnik is such a powerful library, so it should be almost only a matter of configuration.

After we created the map files, we still have the choice to either use one of the currently emerging JS/SVG mapping libraries (like jVectorMaps) or to develop a mini library especially for Piwik by ourself.

Besides of that, I fully agree on JS/SVG instead of Flash. Quite a challenge, but possible.

and, yes, my work load starts getting better. :)

mattab commented 13 years ago

Greg, I'm happy to hear you are keen on SVG for your own work. As you may have seen in the blog recently, we are now using canvas graphs only, and Timo from the team has contributed many patches to jqplot to make it work nicely in our use case. Hopefully, an existing library can meet our needs in terms of performance, maintainability, features, licensing.

In any case this work will be reused, not only in Piwik but I'm sure in hundreds of other projects in the future, since we are building the first truly open source world mapping with region details for all countries.

With shapefiles, I hope you can find a format that is of low size and with good shape quality, that must be quite a challenge.

mattab commented 13 years ago

Once the Maps show cities and region, we definitely have to show the maps in the actual Country report: full awesomeness! see #1821

mattab commented 13 years ago

See also the blog post by greg: http://vis4.net/blog/posts/piwik-maps-2/

gka commented 13 years ago

I now proceed with this task. As a first step, I looked at the shapefiles provided by gadm.org. Since some people seem to be very interested in my work (I got plenty of mails after finishing the first version of the map), I decided to blog about my progress. I will post the links in here.

Part 1: Finding a map data source: http://vis4.net/blog/en/posts/recreating-the-piwik-map

mattab commented 13 years ago

Greg; excellent first step, looking forward to the next part!

mattab commented 13 years ago

Regarding the simplification of shape files, it is obvious that we don't need any detail around Chile, a rough outline of the coast would do perfectly (we can't affort all the little islands ;).

Also is it possible to add a test not to plot islands less than 10 square km, or something similar?

It is really key for user experience to have the smallest file size possible to ensure fast downloading and JS parsing / CPU usage. :)

gka commented 13 years ago

I just looked into a world shapefile and computed the areas of all 3761 polygons (a country may have multiple polygons for islands etc). Filtering every polygon smaller than 10 square km would remove 434 polygons (=11%). I checked the names of the countries which the removed polygons belong to and found many island states among them (as expected).

However, I think a hard cut at 10 sq km has some problems:

See sample rendering.

I tried a different rule: remove all polygons smaller than 5% of the maximum polygon area of that country. Since the maximum area of the Maldives is 9sqkm, all islands are kept, while all small islands of the USA are removed. Also this halved the resulting SVG file size keeping the map correct in terms of includedness of countries.

See sample rendering.

However, removing every polygon smaller than 5% of the maximum per country is not satisfactory as well. Big and also well-known islands like Hawaii or Novaya Zemlya shouldn't be removed in my opinion.

Also there is a problem with those tiny islands countries like the Maldives, which is obvious if you look at the zoomed view on the Maldives. The islands are way too small to be rendered in a meaningful way.

This leads to some important questions: Does the map needs to include all countries or is it acceptable to ignore some countries, like the Maldives Islands?

In the old map widget this wasn't important because there was no country level view.

gka commented 13 years ago

I think would like to keep islands and outlying regions in the country-level views. For instance, I could try to generate composed maps that mix different projections to fit the complete country into the map.

Like in this example for the United States.

gka commented 13 years ago

One note regarding the projection used for the country-level views. I would like to simplify the whole thing by using the same projection for every country, but with different parameters (namely the projection center).

One of the simpler projections is the orthographic projection, which looks the same as if looking onto a 3D globe. This gives quite good and less-distorted views for almost every country. The only country that looks quite distorted is Russia. Because of it's huge area, Russia takes too much space on the globe.

You can have a look at distorted Russia here.

My opinion is that this distortion is acceptable given the simplicity of rendering maps in a single projection. In an ideal mapping world, one would use specialized projection for every country, like the New Zealand Map Grid for New Zealand etc.

Any more opinions?

gka commented 13 years ago

The next design decision has to do with the aspect ratio of the maps. At some point in the map generation process I need to crop the country-level maps to a bounding box. The idea is to fit the countries as big as possible into the views while also showing a bit of their neighbour countries for navigation.

Now the question arises to which aspect ratio the maps should be cropped. As far as I see do we have to options:

  1. A fixed aspect ratio (presumably some wide-screenish format) would be the simplest solution. The complete cropping could be done in the preprocessing stage, which would reduce complexity of map rendering. However, the obvious drawback of this solution is that the maps won't look as good as they could when displaying countries in the "opposite" aspect ratio. For some portrait countries (like Germany), this should be acceptable. For other, more extreme aspect ratios (my favourite example is Chile) don't.
  2. In the latter cases, a dynamic aspect ratio would make much more sense. It would be perfect if we could present the Chilean Piwik users (I assume there are some) with a portrait view of their country. Dynamic aspect ratios could be done either by choosing a fixed ratio per country or by cropping the maps at rendering time. Choosing a fixed ratio per country has the drawback that those countries could not fit a landscape ratio in fullscreen mode. In contrast, cropping at runtime may take more CPU. Also, I don't know if the current dashboard supports dynamic resizing of widgets, but this may be easy to implement.
mattab commented 13 years ago

I tried a different rule: remove all polygons smaller than 5% of the maximum polygon area of that country.

Sounds good

This leads to some important questions: Does the map needs to include all countries or is it acceptable to ignore some countries, like the Maldives Islands?

Removing countries all together is I think not a good idea. Maybe we could still display ALL countries, but leave the rule of 5% for all other countries. (eg. Maldives would be displayed with at least the main islands (if possible..), while canada would still lose many big islands.

I think would like to keep islands and outlying regions in the country-level views. For instance, I could try to generate composed maps that mix different projections to fit the complete country into the map. Like in this example for the United States.

If you can do that (at least for some selected countries?), it would be pretty cool!

My opinion is that this distortion is acceptable given the simplicity of rendering maps in a single projection. In an ideal mapping world, one would use specialized projection for every country, like the New Zealand Map Grid for New Zealand etc.

Simple is always better, even if the shape is not perfect. We could improve this later anyway. What would it look like for NZ? ;-)

The idea is to fit the countries as big as possible into the views while also showing a bit of their neighbour countries for navigation.

Also, maybe it would useful if hovering over the countries next to the zoomed country, would display a little tooltip with the Country metrics. Maybe it could be displayed next or below to the main country metrics (which would always be displayed?).

fixed aspect ratio (presumably some wide-screenish format) would be the simplest solution.

Agreed. I think it is expected that all country zooms have the same dimensions. dashboard doesn't support dynamic size widgets at present (not planned). Chile looks good in the map, even with a lot of sea. We can use the space for legends, metrics, etc.

The dynamic aspect ratio could be implemented later in the lib for static country-specific maps...

Thanks for posting your thoughts and log here, cheers!

gka commented 13 years ago

For New Zealand both projections doesn't differ that much. NZ is rather small (at least compared to the Earth), so the orthographic projection is quite a good approximation.

mattab commented 13 years ago

Replying to greg:

For New Zealand both projections doesn't differ that much. NZ is rather small (at least compared to the Earth), so the orthographic projection is quite a good approximation.

Indeed the orthographic projection seems really good!

gka commented 13 years ago

Update: I managed to generate SVG maps like this for every country:

[[Image(http://vis4.net/tmp/FR.png)]]

You can check them out here. (ZIP, 4.8MB, contains small PNGs for each map)

This is how it works: For each country the algorithm computes a 'nice' bounding box which includes only the most important polygons. For instance, the US bounding box does not include Alaska and Hawaii and the bounding box of Spain doesn't include all those Spanish islands.

However, for some countries like Japan, the current algorithm doesn't work. We could fix this by manually adjusting the parameters for those "problem" countries.

The next step would be to replace the polygons of the active countries with their sub-region polygons.

Also I'm working on a nice compression algorithm for the vector data. SVG has way too much overhead because of all those XML syntax. The smallest file size can be achieved using a CSV like format. We could reduce the size even more by kind-of-Base64 encoding the coordinates. For instance, the number "12345" can be stored as something like "zxB", which saves two bytes for each number.

gka commented 13 years ago

Here's the mentioned map of Japan

[[Image(http://vis4.net/tmp/JP.png)]]

mattab commented 13 years ago

Thanks for the udpate, you are making nice progress! :)

AU + JP + AX look OK, but it looks like it is only a zoom problem. Would slightly zoom out should fix the display for these countries?

Canada + Greece + Indonesia + Norway + Philippines looks very detailed (all little islands), not sure we need so much detail.

Also I'm working on a nice compression algorithm for the vector data. Great to hear, I can't stress enough how important it is to have small file sizes and fast SVG rendering. The Map will be displayed in all dashboards by default so should load uber fast so it doesn't slow down the piwik dashboard experience.

The next step would be to replace the polygons of the active countries with their sub-region polygons.

Does it mean, that you will draw the country regions inside the existing country shape, for these countries for which we have region mapping information?

gka commented 13 years ago

Great to hear, I can't stress enough how important it is to have small file sizes and fast SVG rendering. The Map will be displayed in all dashboards by default so should load uber fast so it doesn't slow down the piwik dashboard experience. I see no problems for the Piwik dashboard because the map will only load those maps it really needs. Each country map will be stored in a different file and most users will only need two or three of them. The most important thing is the file size of the world map, which should be way smaller than current map plugin size (250kB).

However, I think that the total size of all country maps (again, which aren't loaded before the user clicks on a country). There are 246 countries in the map, even if I manage to reduce the avg map size to 10kB it would take more than 2MB to store all of them.

Does it mean, that you will draw the country regions inside the existing country shape, for these countries for which we have region mapping information?

The resulting maps will look like this: [[Image(http://vis4.net/tmp/DE_lev1.png)]]

[[Image(http://vis4.net/tmp/FR_lev1.png)]]

Just to clear this up: the region maps are available for every country. The mapping between GeoIP cities and country regions must be computed in most cases.

mattab commented 13 years ago

However, I think that the total size of all country maps (again, which aren't loaded before the user clicks on a country). There are 246 countries in the map, even if I manage to reduce the avg map size to 10kB it would take more than 2MB to store all of them.

One idea: All countries XML could be stored in a single ZIP file. Then, the requests to get the XML for a given country, could go through a Piwik php controller, this controller would on demand unpack the zip and only return the XML (or CSV) that is being requested. So that, the size overhead of all maps is just the size of the ZIP containing them all. The PHP code would be very simple and simply unpack the ZIP (code already exists cf. Piwik_Unzip) and return the requested mapping info.

If you have 200 * 10kb = 2Mb, zipped should be around 200-400kb maybe which should be uber fast to unzip, and also be low overhead. Thoughts?

gka commented 13 years ago

If you have 200 * 10kb = 2Mb, zipped should be around 200-400kb maybe which should be uber fast to unzip, and also be low overhead. Thoughts?

Zipping the kind-of-base64 encoded CSV files leads to compression rates of approximately 50% not 10-20%.

Is disk storage really such a big issue for Piwik? Since the MySQL tables are getting large anyway (500kB per month in my own Piwik installation), I think that 2MB is still acceptable. However, serving the maps gezipped to the browser makes sense, since web traffic is a big issue for dashboards, especially when also supposed to run on mobile devices.

gka commented 13 years ago

By the way, to get an overview about the GeoIP location database I quickly mapped all stored locations. Maybe we could use this information to kind of focus on those maps on countries where it makes sense at all.

[http://vis4.net/tmp/geoip-locations.png]

Some African countries have less GeoIP locations than sub-country regions. We could consider to limit the region-level reporting to those countries whose GeoIP location density is greater than the density of regions.

Close-up view on France:

[[Image(http://vis4.net/tmp/FR-closeup.png)]]

mattab commented 13 years ago

Is disk storage really such a big issue for Piwik? Since the MySQL tables are getting large anyway (500kB per month in my own Piwik installation), I think that 2MB is still acceptable. However, serving the maps gezipped to the browser makes sense, since web traffic is a big issue for dashboards, especially when also supposed to run on mobile devices.

Disk storage is not a critical issue, but we try to keep the ZIP as small as possible, since once unzipped it is already 15.5 MB (and 5.5M zip).

We could easily serve the region maps in GZIP using the existing function Piwik::serveStaticFile which returns gzip if supported by server.

mattab commented 13 years ago

Some African countries have less GeoIP locations than sub-country regions. We could consider to limit the region-level reporting to those countries whose GeoIP location density is greater than the density of regions.

Interesting map! Did you use the free GeoIP db for it? It would be interesting to plot the same using the commercial geoip DB (which some Piwik users will use) and see if there is any difference. I've sent you an email.

In any case I think it would be nice to have sub region mapping for all countries, but maybe the regions boundaries could be less sharp/accurate for the countries for which GeoIP density is poor? (rather than not plotting regions at all for these countries). Thoughts?

gka commented 13 years ago

The current GeoLiteCity db contains more locations (318814) than the commercial GeoCity db you sent along (168685). The reason could be that the version you have is dated to 2005.

However, the distribution pattern looks almost the same.

[http://vis4.net/tmp/geolitecity-locations.png]

[http://vis4.net/tmp/geocity-locations.png]

gka commented 13 years ago

I just checked all the region codes inside the GeoIP database and found out that they are not easily mappable to the region codes in the GADM shapefile. For some countries, like the US, the GeoIP db stores two-letter codes, for many other countries the regions are identified by two-digit numbers. The GADM shapefile uses two-letter ISO codes to identify the inner-country regions.

What do you think how we should integrate the GeoIP database with the map plugin?

One option would be to force the user to run some kind of initialization script which calculates the region for each GeoIP location which bears the risk of false identification of cities that lay on the border between two regions.

Another option would be to try to map the inconsistent GeoIP region ids to the right ISO region ids. We could do this either by matching the region names (which are stored in both databases) or by trying to find mapping tables for all used region identifiers. This bears the risk of false-identification of regions, for instance if there are errors in one of the databases or the automated identification fails. Also, here we would need to run an initialization script that brings both databases together.

Does anyone knows wether the GeoIP plugin uses the PHP/MySQL approach or is build on top of a GeoIP API which works with the binary database? I don't know if it is possible to access the raw list of locations through the binary API.

gka commented 13 years ago

Just saw that the GeoIP database uses the admin level 1 codes from geonames.org:

[http://download.geonames.org/export/dump/admin1CodesASCII.txt]

robocoder commented 13 years ago

Maybe we can work out the mapping as part of #2379.

gka commented 13 years ago

By the way, I just saw that the next version of GA supports region level mapping.

[http://vis4.net/tmp/ga-regions.png]

gka commented 13 years ago

Ok, I just got the first results.

246 country maps, including regions for the selected country.

SVG

total size: 3.6MB

zipped: 1.0MB

CSV

total size: 2.6MB

zipped: 857kB

This example shows the quality of the maps:

[[Image(http://vis4.net/tmp/ye.png)]]

Regarding the fact that the SVG files are not much bigger than the CSV files, especially when zipped, I would recommend to use the SVG files, which are a lot easier to maintain. However, the server should be configured to serve the SVG files gzipped.

I would not recommend to lower the quality and hope 1MB is still acceptable. Comments?

gka commented 13 years ago

Here's the zip files which contains all SVG maps:

[http://vis4.net/tmp/all-svg.zip]

robocoder commented 13 years ago

+1 for SVG

mattab commented 13 years ago

The region maps look beautiful!

1M zipped sounds reasonnable, but no more please ;) It is an increase of 20% of the ZIP size. Worth it for beautiful, open technology maps of the whole world for sure!

Btw, I noticed that South Sudan is not in the list of countries (SS). Is it possible to add it (it's been independent since July 2011).

Also did you see if GeoIP returns "Tibet" as a country? It would be nice to have a map for it since it should also be an independant country!

what are your thoughts regarding the mapping of geoip regions to the Geonames IDs, probably the Piwik API dataset should contain the right geonames mapping directly, so that means that the mapping should be done in the Geoip API as part of #1823. Any thoughts?