noi-techpark / bdp-commons

GNU Affero General Public License v3.0
2 stars 12 forks source link

As a traffic events Data Consumer I would like that double event entries are removed #599

Closed rcavaliere closed 1 year ago

rcavaliere commented 1 year ago

We have duplicate events for the Province BZ Traffic Events Data Collector.

Check for example this:

select * from intimev2.event where origin = 'PROVINCE_BZ' and category = 'evento eccezionale - caso particolare | Sonderfälle' and event_interval = '["2022-12-23 00:00:00",)' order by id desc

we obtain 4 events, but actually we should have just 2.

We need to investigate why we create 4 events instead of 2, and fix the problem so that each single event has one entry. If the event is updated in its evolution we should have multiple metadata records associated to it (no multiple events in the events table). We could rethink how the data is currently stored in the Open Data Hub

dulvui commented 1 year ago

I just saw now that the "messageType" field is different. One record has the type Trasporti pubblici - Öffentliche Verkehrsmittel and the other one is Situazione attuale - Aktuelle Lage

If you run this query you can see that the messageType in the metadata of the two similar records is different.

select e.name, m.json->'messageTypeDescIt',m.json->'messageTypeDescDe',m.json->'messageTypeId'
from event e
join metadata m on e.meta_data_id = m.id 
where e.origin = 'PROVINCE_BZ'
and e.category = 'evento eccezionale - caso particolare | Sonderfälle'
and e.event_interval = '["2022-12-23 00:00:00",)'
order by e.id desc

Is this then correct how its is now or should I combine both metadata, to have only one record? Probably we need to discuss, on how to combine them.

rcavaliere commented 1 year ago

@dulvui ok, at least this explains this situation and it's good to hear that we don't have issues at the Data Collector. On the other side, as you say, this might not be the intended result we want to have. Let me investigate further, then let's decide how to proceed

rcavaliere commented 1 year ago

@dulvui let's discuss this together. We need to deepen the logic with which the Data Collector considers an event unique. If I remember well, the point is in the way the value events_series_uuid is calculated

dulvui commented 1 year ago

@rcavaliere Okay, I will check in the data collector, how this uuid is composed before our meeting

rcavaliere commented 1 year ago

@dulvui I think the bug is still present. Check for example through analytics, and set this configuration:

Screenshot from 2023-03-20 10-15-39

I see for example the event " Bei Gfrill (km 9,950 - km 10,050) " 6 times! Can you find these records in the database, and understand, why we have 6 records for the same event?

dulvui commented 1 year ago

@rcavaliere I checked now and there are some differences between the records. The fields messageId, acutalMail and publisherDateTime are different for this 3 records. In the uuid the filed messageId is used, so removing that field, might solve the problem.

Should I change the datacollector, so that this events get merged into one, by removing messageId from uuid?

rcavaliere commented 1 year ago

@dulvui yes, let's try this!

dulvui commented 1 year ago

@rcavaliere I checked now again and the last changes I made 3 weeks ago (31.03.2023) by removing messageId from the uuid field worked. Now there are no duplicate events anymore, if you query events starting from the date 01.04.2023. Now the fields that compose the uuid are beginDate, endDate, lognitue and latitude. So only if one of this fields changes, a new event is created.

Here a query to verify that there are no duplicates:

select e.created_on, e.description, m.json -> 'placeIt', l.geometry, event_interval , e.uuid  from "event" e
join metadata m on e.meta_data_id = m.id 
join "location" l on e.location_id = l.id
where e.origin = 'PROVINCE_BZ'
and e.created_on > '2023-04-01 00:00:00.000'
order by  m.json -> 'placeIt' desc 

On analytics its a bit difficult to see, because there are events created before the 01.04.2023 still showing up now. So there are still duplicates, but they where created before the change.

I found only one duplicate now, where the position of the event slightly changed. You can find that entry with this query:

select e.created_on, e.description, m.json -> 'placeIt', l.geometry, event_interval , e.uuid  from "event" e
join metadata m on e.meta_data_id = m.id 
join "location" l on e.location_id = l.id
where e.origin = 'PROVINCE_BZ'
and e.id in (3016591, 3016590)
rcavaliere commented 1 year ago

@dulvui that's very good! If you go on analytics and check the data on the map you feel that now the quality is much better and more "realistic". I have to check better, but I think that we have also some issues in the categorization of the events, check for example this

Screenshot from 2023-04-21 18-39-15

rcavaliere commented 1 year ago

@dulvui I think that for this Data Collector we are now fine with the data stream, but we have issues in the visualization of the information on analytics. Together with the visualization of the map (see comment above), also the events in the tab view has something to correct.

Screenshot from 2023-04-27 22-02-02

My suggestion is to close this issue and to open a new one for the improvements on analytics