openaddresses / machine

Scripts for running OpenAddresses on a complete data set and publishing the results.
http://results.openaddresses.io/
ISC License
97 stars 36 forks source link

ISSUE-766 | Added new attribute method #767

Closed macieg closed 4 years ago

iandees commented 4 years ago

Have you tested this with real data? I'm surprised that we'd get multiple values for a single column back like that.

macieg commented 4 years ago

@iandees It's not about columns in csv files - it's about nodes in xml files like described here - https://github.com/openaddresses/machine/issues/766

macieg commented 4 years ago

@iandees to be more specific - Some time ago I've updated the cache with polish addresses, because it was outdated. Now I'd like to get rid of that cache and use frequently updated source of data.

It requires adding this additional attribute method. Apart from this PR I'll need also to make changes in the main repository and documentation.

Maybe I'm wrong, but this is a place where I should start?

I'm happy to give more detailed explanation if needed :)

iandees commented 4 years ago

I understand what the source data looks like, but I'm pretty sure that as that data works its way through our pipeline it will lose multiple values and end up with a single string, not an array of strings. This is why I had asked if you tried this change outside of the unit test.

macieg commented 4 years ago

@iandees - I've tried.

I was doing some experiments with other task

https://github.com/openaddresses/openaddresses/pull/4668/commits/b14251c6ac20d2bdcba7296ebcfa821705310fa8

If we take a look at the resulting file, we'll see:

LON,LAT,NUMBER,STREET,UNIT,CITY,DISTRICT,REGION,POSTCODE,ID,HASH
15.9878829,54.0127764,30,Kochanowskiego,,Białogard,"['Polska', 'zachodniopomorskie', 'białogardzki', 'Białogard']","['Polska', 'zachodniopomorskie', 'białogardzki', 'Białogard']",78-200,PL.ZIPIN.1422.EMUiA_05e9f97a-c860-43ff-b7f1-e6fcd229a7c3,48b297bfd8452a34
15.9832874,54.008211,9,Ludowa,,Białogard,"['Polska', 'zachodniopomorskie', 'białogardzki', 'Białogard']","['Polska', 'zachodniopomorskie', 'białogardzki', 'Białogard']",78-200,PL.ZIPIN.1422.EMUiA_05f08ec1-6eaf-484a-aca7-6d0f69c234de,0fe4030acab1328a
15.9705208,54.0123932,14,Królowej Jadwigi,,Białogard,"['Polska', 'zachodniopomorskie', 'białogardzki', 'Białogard']","['Polska', 'zachodniopomorskie', 'białogardzki', 'Białogard']",78-200,PL.ZIPIN.1422.EMUiA_05f68238-99d8-4990-a21d-41cd2cabbf85,bf6b1980190d2eaa
16.0040907,54.012632,2,Gryfitów,,Białogard,"['Polska', 'zachodniopomorskie', 'białogardzki', 'Białogard']","['Polska', 'zachodniopomorskie', 'białogardzki', 'Białogard']",78-200,PL.ZIPIN.1422.EMUiA_060b6d7e-c2a1-401e-8eb2-eada4e0483c7,9aeb025523996d9

"['Polska', 'zachodniopomorskie', 'białogardzki', 'Białogard']"

There are quotes around, but it looks to me like it was considered as an array before printing to the file.

Am I wrong?

I haven't tried to do it on my local computer. I can do if needed. :)

iandees commented 4 years ago

I think that might be the text coming out of OGR. But let's try it and see what happens!

macieg commented 4 years ago

@iandees - you were right, it was just a string :/ Fix below, not the prettiest code :)

https://github.com/openaddresses/machine/pull/768