simonw / git-history

Tools for analyzing Git history using SQLite
Apache License 2.0
191 stars 18 forks source link

Defining composed ids considering new lines as different items #58

Open mgaitan opened 2 years ago

mgaitan commented 2 years ago

I'm a newbie to the datasette ecosystem and I'm particularly amazed by the git-scraping technique. Thanks Simon for sharing it!

I need help defining a composed id on the rows for this CSV where I'm tracking power outages events in Buenos Aires's metropolitan area every 20'.

https://github.com/OpenDataCordoba/cortes_enre/blob/main/cortes_enre.csv

My problem is that there is no a clear ID of each event and I would like to track changes over it

Consider this recent commit https://github.com/OpenDataCordoba/cortes_enre/commit/b3cde1c1d3b27dc0a76249d0025e3cbe68d914ed

Here it seems I could use all the columns but the last two as a composed id

latitud,longitud,nn,tipo,empresa,partido,localidad,subestacion,alimentador

Then the colums afectados (affected users) and normalizacion estimada (estimated time to normalization) could change during a few next updates, but eventually the line will be deleted.

The problem is that the composed id basically describes the "place" where the outage is happening, and maybe in the future it could be a totally different event in the same place unrelated to the current event.

So, how could I distinguish different events in the same place? I'm wondering if there is a way to consider it's a new item if the composed id appears again (ie the commit is not updating an existing line but adding it).

simonw commented 2 years ago

This is really difficult!

One idea: you could take the yyyy-mm - the year and month - and use those as part of the ID. This would at least give you a unique new ID for each location for each month.

The two downsides to this are that if an outage starts on August 31st and continues to September the 1st it would be treated as two separate outages. And if an outage finishes September 2nd and then a new one starts on September 28th in the same location they would be treated as the same outage.

But it may be the best you can do in this situation?