heremaps / here-cli

A command-line interface to work with HERE XYZ Hub and other HERE APIs
https://www.here.xyz/
MIT License
38 stars 19 forks source link

join csv and assign feature ID based on property search #233

Open burritojustice opened 4 years ago

burritojustice commented 4 years ago

If I have a CSV and want to join its properties to relevant geometries in a space via virtual spaces, it's not always going to have the same feature ID.

However, there may be values in a column of the csv that matches a property of features in the geometry space. If so, we could use property search to find a matching feature's ID, and then assign that value as the feature id of that CSV row upon upload (and refresh). Then the resulting virtual spaces will work.

here xyz join spaceID -f my.csv --keys propInCSV,propInSpace

If the geometry space has geoids as the feature ID, and both have a meaningful property name, that geoid would also become the feature ID of the new feature that the csv row is converted into.

csv:

country population (millions)
Germany 86
Canada 35
India 1250

space with geometries:

   {
       "type": "FeatureCollection",
       "features": [
          {
           "id": "DE",
           "type": "Feature",
           "geometry": …,
           "properties": {
               "name": "Germany"
           },
          {
           "id": "CA",
           "type": "Feature",
           "geometry": …,
           "properties": {
               "name": "Canada"
           },
          {
           "id": "IN",
           "type": "Feature",
           "geometry": …,
           "properties": {
               "name": "India"
           }]
  }
here xyz join spaceID -f my.csv --keys country,name

So in this case, as the CLI uploads the first row, it would search for Germany in the space with geometries, grab the feature ID DE, and assign it to the Germany feature in the new space.

This would make it more important to save the last few CLI commands in the client block as referenced in #219, since the options could get complex and if unsuccessful, updating the CSV space would fail and the virtual space will not update.

burritojustice commented 4 years ago

Here's another example: the state of Georgia publishes its COVID-19 statistics by county, but in the CSV it only has the name.

county_resident,Positive,DEATHS,HOSPITALIZATION,case_rate
Appling,417,15,53,2246.65
Atkinson,227,2,30,2725.09
Bacon,345,5,28,3025.25
Baker,45,3,12,1444.16
Baldwin,738,35,84,1661.11
Banks,185,3,28,925.83
Barrow,760,29,137,879.8

I would like to join this with county geometries I already have from the US census from my "library" of counties in shMPSR4R, but those use GEOID10 as the feature ID so I can't use virtual spaces as is.

However, if I select the matching keys from each data source:

    here xyz join shMPSR4R -f countycases.csv --keys county_resident,NAME10

This would upload the CSV and each row, through the magic of property search, would get the same ID as the matching feature in the "library". Then join would return a virtual space ID.

We'd also need to think about how best to update these data spaces, as upload is not going to know how to assign the feature ID search. Perhaps we need to specify a target space if it's not the first time we've used space.

here xyz join shMPSR4R -f countycases.csv --keys county_resident,NAME10 --target CSVspaceID
burritojustice commented 4 years ago

we should also make the search case-insensitive -- I just saw something that didn't get matched because of a capitalized 'Of'

burritojustice commented 4 years ago

If we don't find a match, let's try a couple of things:

1) search for the ID as a substring in the property field (and vice versa): if the source id is China, look for it in the People's Republic of China in the country column. If the source ID is Madison County, and the target property is Madison, check if Madison is a substring.

2) check the string distance -- I know of stringdist in R and fuzzwuzzy in Python, not sure what the node equivalent is -- maybe one of these?

https://www.npmjs.com/package/fuzzball https://www.npmjs.com/package/string-similarity