shanecoughlan / data-twist

Experimental script to twist Open Data into new shapes
5 stars 1 forks source link

Data Twist currently fails when trying to import duplicate OpenStreetMap data. #4

Closed shanecoughlan closed 11 years ago

shanecoughlan commented 11 years ago

Data Twist currently fails when trying to import duplicate OpenStreetMap data. For example, if the data has the same ID, it fails.

shanecoughlan commented 11 years ago

Kana is working on an automated duplication detection and removal feature for the script. We hope to go live with it tomorrow.

shanecoughlan commented 11 years ago

It looks like this bug was squashed. For example, a recent run against data from Matsue (as per version 0.9 of Data Twist) produced the following output:

I found 36 duplicate entries in the input file. I wrote 249 locations to the output file. I processed a total of 285 locations during my analysis. The remaining issue is how the duplication works. I understand from Kana that it is pretty brutal at the moment, and we might be disregarding some data as duplicates inaccurately. For example, a run against all Tokyo using Data Twist 0.8 produced the following output: same data : 83 write data : 9073 all data :9156 A run against all Japan produced (after a long, long processing time) 194,776 locations in the output file, but a suspiciously high number of over 17,000 duplicates. It's something we need to look into.