kuwala-io / kuwala

Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times
https://kuwala.io
Apache License 2.0
787 stars 52 forks source link

OSM-POI: Include brand property #25

Open mattigrthr opened 3 years ago

mattigrthr commented 3 years ago

OSM objects may include a brand or operator tag, which you can use to derive the brand of a POI.

The issue that exists is that the values of those tags can be spelled differently across several entities (e.g., "McDonalds", "Mc Donald's", or "McDonald's").

There exists a repo that tries to unify the spelling across OSM: https://github.com/osmlab/name-suggestion-index/.

Otherwise, it is an option to find a clean list of worldwide brand names and use string distance measures to connect a POI to a brand.

mattigrthr commented 2 years ago

@IritaSee, we already parse the operator and brand tags from the OSM objects and include them in the Parquet files.

This is the file where we process the Parquet files after running the osm-parquetizer pipeline which transforms the pbf files to Parquet files: https://github.com/kuwala-io/kuwala/blob/master/kuwala/pipelines/osm-poi/src/Processor.py

The idea would be to find the best match of a brand or operator in the name-suggestion-index.

Those are the necessary steps:

mattigrthr commented 2 years ago

More details about how Spark UDFs are used are in the PR discussion: https://github.com/kuwala-io/kuwala/pull/69#issuecomment-1009823381

mattigrthr commented 2 years ago

After doing some initial tests with the brand and operator name matching, it turns out that including the matching in the OSM-POI pipeline directly would increase the runtime significantly. Therefore, we have decided to store the consolidated list of brand and operator names in a separate table in Postgres, which can then be used later in transformation blocks (e.g., on a filtered set of POIs and thus drastically reduce the runtime).

Since the canvas development currently has a higher priority for the core team, this issue is up for grabs again.