shanecoughlan / data-twist

Experimental script to twist Open Data into new shapes
5 stars 1 forks source link

Data Twist duplication detection might be too strong #7

Open shanecoughlan opened 11 years ago

shanecoughlan commented 11 years ago

This is related to issue 4, a bug with data duplication in output SQL files preventing imports to the Wordpress MySQL database. In other words, larger data samples killed our Data Twist outputs.

It looks like this bug was squashed. For example, a recent run against data from Matsue (as per version 0.9 of Data Twist) produced the following output:

I found 36 duplicate entries in the input file. I wrote 249 locations to the output file. I processed a total of 285 locations during my analysis. The remaining issue is how the duplication works. I understand from Kana that it is pretty brutal at the moment, and we might be disregarding some data as duplicates inaccurately. For example, a run against all Tokyo using Data Twist 0.8 produced the following output: same data : 83 write data : 9073 all data :9156 A run against all Japan produced (after a long, long processing time) 194,776 locations in the output file, but a suspiciously high number of over 17,000 duplicates. It's something we need to look into.