holdex / pr-time-tracker

https://autoinvoice-theta.vercel.app
0 stars 3 forks source link

Problem: GitHub events are duplicated #248

Closed markholdex closed 5 months ago

markholdex commented 5 months ago

After running an investigation in the Spreadsheet, I noticed that we have a lot of duplicated events. Easy to spot by checking the PR_MERGED or PR_OPENED events where the PR id and timestamp are identical. This type of data flood is influencing my analysis in a bad way.

image
georgeciubotaru commented 5 months ago

@markholdex, @zolotokrylin

We have that problem with duplicated events in Bigquery. It's not related directly to the fact that the application inserts duplicated events, but rather to the internal functionality of Bigquery.

I can propose moving our database to Mongo DB and using this connector https://workspace.google.com/u/0/marketplace/app/mongosheet/373297131098

Benefits

Another solution: Or run manual cleanup of records in the Bigquery once a week.

create or replace table `events.events` as (select distinct * from `events.events`)
zolotokrylin commented 5 months ago

@georgeciubotaru thank you for the message. Why is BigQuery behaving like this?

Or run manual cleanup of records in the Bigquery once a week.

If this works, let's think about automation of this option.

georgeciubotaru commented 5 months ago

Log: Some additional info https://dev.to/idrisrampurawala/handling-duplicates-in-bigquery-3aae

georgeciubotaru commented 5 months ago

Log: I will try to handle duplicated events together with @markholdex. Further logs will be added

zolotokrylin commented 5 months ago

@georgeciubotaru is it because we have batches and some events in that batch get inside the BQ, and some don't?

georgeciubotaru commented 5 months ago

Log:

georgeciubotaru commented 5 months ago

Log: Existing events PR_APPROVED and PR_REJECTED were updated to comply with our needs, meaning the sender is the owner of the PR and these events should be considered as approve or reject received by that user.

The newest events PR_REVIEW_{APPROVE|REJECT|COMMENT} are for reviewers purpose (the sender gave a review).

@markholdex FYI