open-data-toronto / ckanext-datastore-profiler

creates summaries for toronto open data datastore resources
GNU Affero General Public License v3.0
2 stars 1 forks source link

Dealing with geometry datatypes #18

Open tummala-hareesh opened 2 years ago

tummala-hareesh commented 2 years ago

image

mackeynichols commented 2 years ago

Ooof good question. Some thoughts:

  1. I agree that doing numeric profiling on lats and longs is not useful :+1: (which dataset is this from btw?)
  2. There are also "geometry" columns that contain geographic info that we should consider profiling somehow
  3. I wonder if there's a smart way for us to show (either in a graph, or in a statistic) geographic distribution of these kinds of attributes?
  4. I ALSO worry if this is something so complicated that it will add months of work when it might not be needed

Either way, I agree with your sentiment here, and I think we should talk about it Friday 👍

tummala-hareesh commented 2 years ago

I would want to use geometry and create a choropleth map. For example, US starbucks store count by state. image

To create something like above, I would need unique (lat, lng) and their respective counts. Ofcourse, we will have to import the Toronto layer

P.S: Commenting here to have something for our discussion on Friday.

mackeynichols commented 2 years ago

I love the idea - we'd want to do it where we have point data stored in a "geometry" column

Those "lat" and "long" columns are meant to be stored in a "geometry" column, but we know there are many datasets that have lat and long anyways ... so any spatial visualization we do should be on a "geometry" column. We'll also need to consider if/how we visualize line and polygon data.

The lat and long columns will, eventually, be put into "geometry" columns as Open Data cleans our catalog.

mackeynichols commented 2 years ago

Hey @tummala-hareesh Im thinking we should rename this issue to "Dealing with 'geometry' columns and call it an enhancement

the 'lat' and 'long' being considered "numeric" is a different Data Quality issue that OD should deal with separately 😅

mackeynichols commented 2 years ago

On this, we should consider adding the following to profiler logic:

Adding analytics for lines and polygons will be too hardcore for a first release of a profile, IMO

mackeynichols commented 2 years ago

Hey I was looking at this again and, tbh, point-level analyses (seeing how many points are in a neighborhood, for example) might also be too hardcore for our first release.

Right now, the profiler says whether geometric data is point, line, or polygon. I am going to say that any profiling past that should be done in a second release

mackeynichols commented 2 years ago

Nvm im reopening this as an enhancement so we can not forget to deal with it later.

However I still believe what I said above: we shouldnt worry about more complex spatial profiling until we release something.