20.11 Goal 3 (Pre-Requisite) - Data Collection Issues

The finalised approach was to use python libraries like pytz to rescrape the 113k Datapoint for latitude and longitude and get their timezones. (took about 4hrs) After this was finished, the UTC deviation was calculated for each timezone. (For example Berlin = UTC+1). The big DataBase was split on integer values of UTC deviation -> [UTC - 10 : UTC + 14], creating roughly 24 new CSVs. The Concurrent-Script was rewritten to detect which UTC Timezone has the time 00:00 (important for data reasons) and is run every hour accordingly on the server.

The solution for this data Problem removes any worries that our dataset might still be small even at the end of our experiment, although a high percentage of the cities won't have a description, they can still be used for testing.
The solution was also required as a pre-requisite for the tokenization goal, which wasn't clear at the time of setting our goals.

phinik / LovelyLLamas

20.11 Goal 3 (Pre-Requisite) - Data Collection Issues #13