Closed markholdex closed 5 months ago
@markholdex, @zolotokrylin
We have that problem with duplicated events in Bigquery. It's not related directly to the fact that the application inserts duplicated events, but rather to the internal functionality
of Bigquery.
I can propose moving our database to Mongo DB and using this connector https://workspace.google.com/u/0/marketplace/app/mongosheet/373297131098
Benefits
Another solution:
Or run
manual cleanup of records in the Bigquery once a week.
create or replace table `events.events` as (select distinct * from `events.events`)
@georgeciubotaru thank you for the message. Why is BigQuery behaving like this?
Or run manual cleanup of records in the Bigquery once a week.
If this works, let's think about automation of this option.
Log: Some additional info https://dev.to/idrisrampurawala/handling-duplicates-in-bigquery-3aae
Log: I will try to handle duplicated events together with @markholdex. Further logs will be added
@georgeciubotaru is it because we have batches and some events in that batch get inside the BQ, and some don't?
Log:
pr-time-tracker
Log:
Existing events PR_APPROVED
and PR_REJECTED
were updated to comply with our needs, meaning the sender
is the owner of the PR and these events should be considered as approve
or reject
received by that user.
The newest events PR_REVIEW_{APPROVE|REJECT|COMMENT}
are for reviewers purpose (the sender gave a review).
@markholdex FYI
After running an investigation in the Spreadsheet, I noticed that we have a lot of duplicated events. Easy to spot by checking the
PR_MERGED
orPR_OPENED
events where the PRid
and timestamp are identical. This type of data flood is influencing my analysis in a bad way.