On Location - Githubissues

danfowler commented 8 years ago

Currently, the specification of the location dimension is limited to the following suggested attributes: code, title, and codeList. This is useful if you have a dataset with a column containing well-formatted codes that you can look up in a well-defined codeList (e.g. one of these). But location information can come in many forms (e.g. city, state, latitude, longitude, etc.) and we might want to explicitly capture this information in the model (see https://github.com/openspending/fiscal-data-package/issues/79 for a discussion on adding latitude and longitude). We have a use case for modeling location information for a budget dataset that specifies region names like the following:

District Council Singerei
District Council Soroca
District Council Edinet

OpenSpending is currently developing a web service that will accept location parameters (e.g. a country code and region name) and return a geojson polygon. We would like to use information from the dataset above to lookup polygons for visualizing on a map.

Currently, we are using a combination of the top-level countryCode (in this case "MD") and a single attribute in the location dimension to derive a location for each spending line. This is problematic for two reasons:

countryCode is a top-level element that applies to the entire dataset and, for this reason, can also be an array. It would be preferable to source all information for doing geographic visualization from one place that directly describes the spending line (that is, the model).
how can specify the type of each attribute in a location dimension attribute so we know how to do a lookup more generically?

As a first step in being more prescriptive with the location dimension, we need to have some standard, predictable place to put countryCode in the model. One quick way we can do this is to add countryCode as a suggested attribute to a location dimension. In cases where the countryCode is not listed in the dataset, we can provide it via a constant keyword. While we are doing this, we might as well add more fields to get some specificity in the location data we are mapping. I look toward OCDS for some potential fields:

postalCode
countryName
streetAddress
region
locality

So that the location dimension could look like this:

"location": {
  "dimensionType": "location",
  "attributes": {
    "region": {
      "source": "admin1"
    },
    "countryCode": {
      "constant": "MD"
    }
  }

What do you think @rgrp @pwalsh @akariv ?

rufuspollock commented 8 years ago

I get the logic here and i think this is already possible.

However, I want to flag some bigger points from discussion with @danfowler

Underlying user story: I'm a user with a dataset and i want to visualise it geographically. All i have in my dataset are region names.
- Technical people immediately think: ok to visualize we need actual geojson or similar so we need to run our data against a gazetteer. To do that we need enough info to use the gazetteer.
I get this but I have concerns about premature automation here as follows.
Automating this kind of geo-reconciliation is not easy and you need to get it right. 90% reliable is not good enough. Also it will often turn out that many datasets have very specific needs e.g. you have a budget dataset for NYC and want ward boundaries which isn't in your geo reconciliation service.
It therefore may be better here to focus on generating the general framework here e.g. give them some working JS + HTML but where they need to do a bit of manual data processing. Over time we can gradually automate more and more.

Perhaps we are already at this point but I'd then like to see a good bit of analysis and summary of the services we have looked at and a good sample of datasets e.g. at least one municipal and country geo where we have worked this through.

Otherwise my suggestion for right now would be:

Focus on doing the geo-reconciliation in the frontend with manual input if necessary for the demos right now
Meanwhile explore what can be automated easily and in enough cases
Remember that for decent viz we probably can't create in a wizard. Rather the wizard is about allowing the use to create something they can then export (as HTML+JS) and then go and tweak and integrate themselves.

pwalsh commented 8 years ago

@danfowler @rgrp

I don't like that the original post here ties the clarification on location as a dimension to the webservice that we have, related to OpenSpending. It immediately frames the problem in terms of "how can we make FDP work with the OpenSpending web service", rather than "FDP has a location dimension referred to in the spec: how do we make a location dimension valuable/usable".

Let's not talk directly about the web service or frontend reconciliation. Those are not the issues, and obviously we are already doing that.

Let's talk about how a location dimension is valuable in the spec, and ensure the spec delivers. Otherwise, why do we have a location dimension in the first place.

akariv commented 8 years ago

This is a case in which we feel that the spec is not being descriptive enough to allow implementors to make use of the values stored in the location dimensions.

The solution here is to make spec more descriptive, very much like we not only provide means to indicate that a dimension is a classification but also which kind of classification.

I am very reluctant to using attribute names for anything, and in fact I would remove the suggestion and limitation for specific attribute names altogether. I think we should approach this in a similar manner to classifications. We should add an optional sub-categorisation for location dimensions allowing to specify a specific geo feature that the dimension describes (e.g. city / country / region / address) etc. This subcategory could be used by implementors as a hint for reconciliation mechanisms etc. [Actually, having this on the dimension level might be too restricting and we should probably add it as an attribute property - do be discussed]

In case we decide to adopt the OS types as first class citizens in the FDP spec, we should use that mechanism instead.

timgdavies commented 8 years ago

This thread, and frictionlessdata/specs#79, and our experience with OCDS location extension (lots of in-principle demand, but facing lots of chicken-and-egg challenges with understanding the granularity of modelling to aim for, and what different forms of analysis this might enable), and emerging work on Ag Investment data - I wonder if there might be scope to do some shared work on better understanding the different user stories, and structures of input data, for location?

For example, whilst schools and hospitals might have a easily described physical location, other analysis may want to know about the intended catchment area for those services, or about whether a service budget which has a broad geographical scope in general, is allocated only to a particular kind of sub-region. Asides from the question of gazetteers to use for location, being clear to publishers whether a location is a 'physical location' or a 'delivery area' etc. may be important.

I also wonder whether it is useful to encourage publishers to provide multiple levels of admin geography where they have it, as:

An analyst looking between countries will want just country codes
An analyst looking country wide, or at second level admin boundaries will want admin codes
A local analysis may well be happy to deal with ward names, and have local capacity to geocode these

Whilst in theory the higher levels could be inferred from the lower level, in practice we know this is very difficult, and for the global analyst relies on having the chimera of a robust and updated gazetteer covering all levels of admin geography.

akariv commented 8 years ago

@timgdavies I totally agree that the spec needs to have better means to describe location data. As you said, just putting a 'location' column with no context is not good enough for more in-depth analyses.

So, besides encouraging publishers to provide more detail in the raw data, we should also make the spec flexible to support data sets with missing data. This is achievable in two ways I think:

Create a rich taxonomy for describing the existing data. To take your example, allow publishers to specify if a location column refers to a 'physical location' or a 'delivery area' etc.
Provide means for augmenting missing data using constant attributes or package metadata. This would remove the need to infer higher levels of location by.

Are there any taxonomies which we can take inspiration from, and might fit this purpose?

rufuspollock commented 8 years ago

@akariv my request here would still be more and more detailed user stories / walkthroughs so it is clear:

What people are trying to do
What might be needed to suppor that

pwalsh commented 7 years ago

Moving to https://github.com/frictionlessdata/datapackage-fiscal/issues/3

openspending / fiscal-data-package

On Location #140