iMEdD-Lab / open-data

Datasets created by iMEdD Lab that are publicly available
34 stars 18 forks source link

Rapid tests location are well outside of Greece #9

Closed akosiaris closed 2 years ago

akosiaris commented 3 years ago

Hi,

First, let me thank you profoundly for publishing this clearly painstakingly constructed data set out in the open and keeping it updated. It's been very useful as an alternative to the non machine readable, limited official releases.

As for this issue, I 've been playing around a bit with the rapid_tests part of the dataset recently. While creating a geographical visualization of the dataset, I noticed that many of the data points are well outside the geographical limits of Greece, some even on the Western hemisphere of the world. A quick screenshot exhibits that below

image

I verified this by grepping through the dataset for the points that are in the Western hemisphere (just because they are extremely easy to search for, they all have a ,-<number> textual pattern, i.e.

$ grep ',-[[:digit:]]' rapid_tests.csv | wc -l
169

So 169 datapoints are in the wrong hemisphere. Partially deduplicating based on the actual place (e.g. "ΡΟΔΟΥ, ΡΟΔΟΥ, ΠΛΑΤΕΙΑ ΣΑΝ ΦΡΑΤΖΕΣΚΟ") gives us 58 distinct locations

$ grep ',-[[:digit:]]' rapid_tests.csv | cut -d, -f4,5,6,13,14 | sort -u |wc -l
58

with the first 10 in order of datapoints being the following

$ grep ',-' rapid_tests.csv | cut -d, -f4,5,6,13,14 | sort | uniq -c | sort -rn|head
     28 Π.Ε. ΣΥΡΟΥ,ΣΥΡΟΥ,ΑΘΛΗΤΙΚΟ ΚΕΝΤΡΟ,45.1584051,-93.2261219
     27 Π.Ε. ΖΑΚΥΝΘΟΥ,ΖΑΚΥΝΘΟΥ,ΙΚΑ,34.303205,-77.873179
     18 Π.Ε. ΖΑΚΥΝΘΟΥ,ΖΑΚΥΝΘΟΥ,ΛΙΜΑΝΙ,42.3387933,-70.9745339
      8 Π.Ε. ΘΕΣΣΑΛΟΝΙΚΗΣ,ΘΕΣΣΑΛΟΝΙΚΗΣ,ΛΙΜΑΝΙ,42.3387933,-70.9745339
      7 Π.Ε. ΘΕΣΣΑΛΟΝΙΚΗΣ,ΘΕΣΣΑΛΟΝΙΚΗΣ,ΛΙΜΆΝΙ,33.754185,-118.216458
      6 Π.Ε. ΡΟΔΟΥ,ΡΟΔΟΥ,ΠΛΑΤΕΙΑ ΣΑΝ ΦΡΑΤΖΕΣΚΟ,37.8199286,-122.4782551
      5 Π.Ε. ΡΟΔΟΠΗΣ,ΡΟΔΟΠΗΣ,ΔΗΜΟΤΙΚΗ ΒΙΒΛΙΟΘΗΚΗ,40.8888845,-73.8408546
      5 Π.Ε. ΖΑΚΥΝΘΟΥ,ΖΑΚΥΝΘΟΥ,ΠΕΡΙΦΕΡΕΙΑ,45.2134808,-93.3288325
      3 Π.Ε. ΠΙΕΡΙΑΣ,ΠΙΕΡΙΑΣ,ΠΛΑΤΕΙΑ ΔΗΜΑΡΧΕΙΟΥ,42.3601283,-71.0593203
      3 Π.Ε. ΛΕΣΒΟΥ,ΛΕΣΒΟΥ,ΑΓΟΡΑ,37.09024,-95.712891

As you can see there is still some duplication based on whether there is an accent or not in some words (e.g. Thessaloniki port which, interestingly, depending on whether "ΛΙΜΑΝΙ" is accented or not has different geographical coordinates), but that's arguably the lesser of the problems.

I have done 0 work to identify datapoints in the Eastern hemisphere that are wrong as they require a slighly more involved approach of making sure that latitude and longitude are within the administrative geographical boundaries of Greece. But as you can tell by the screenshot, there are datapoints in Cyprus, Egypt, Italy, Hungary, Turkey and Germany, all of them clearly not correct. There is also a sizable amount of datapoints of the Gulf of Guinea in Africa, but that's presumably because geographical coordinate discovery failed and returned latitude and longitude of 0.0,0.0

I have no idea how the latitude and longitude are generated and whether they are part of the original dataset or are secondary data, so I don't know if it is fixable.

In any case, I thought I should let you know.

Many thanks again!

troboukis commented 2 years ago

Ι'm so sorry for the late reply! Unfortunately, I just figured out that this issue had been opened. The geolocation is being automated, feeding the code the area, country, and address provided by EODY. I'll double-check what's going on with the code and will get back to you ASAP.