mozilla / participation-metrics-org

Participation metrics planning repository
4 stars 4 forks source link

Setup the Data Retention + Filter feature #199

Closed canasdiaz closed 5 years ago

canasdiaz commented 5 years ago

In order to setup the recently deployed data-retention feature it is needed to get some parameters from our Mozilla folks. This ticket will be closed when the setup is ready and deployed.

canasdiaz commented 5 years ago

@hmitsch can u please confirm the following:

hmitsch commented 5 years ago

This is correct. The last bullet point

  • Data older than 2 years will be removed

is valid for all data sources (git, github, bugzilla, Discourse, Meetup, etc).

Thanks a lot for checking in!

-Henrik

sduenas commented 5 years ago

We have incorporated these features in the new release of GrimoireLab: https://github.com/chaoss/grimoirelab/blob/b577ef2e7423c23621fa84c96fba44645bc02e2f/releases/NEWS#L1

Meetup fields were updated in: https://github.com/chaoss/grimoirelab-perceval/issues/504

@sanacl we should be ready to deploy the new release and active data retention when @hmitsch approves it.

hmitsch commented 5 years ago

Niiiice! Yes, I do approve. Let's ship this, @sduenas and @sanacl!

-Henrik

canasdiaz commented 5 years ago

We are about to deploy this @hmitsch . ~The only con if we'll have to download all the information from scratch so until Monday/Tuesday we won't have updated data.~ The only con is we'll have to download meetup information from scratch and not everything as I commented before.

canasdiaz commented 5 years ago

We are having memory issues @hmitsch . The new feature is consuming more memory so we would need a VM with more resources. Can u provide this?

canasdiaz commented 5 years ago

ping @hmitsch @johngian

johngian commented 5 years ago

The easiest would be to scale up the cluster. Would this help ?

canasdiaz commented 5 years ago

The easiest would be to scale up the cluster. Would this help ?

Hi again @johngian , we had memory issues with the VM where the data processing software is running. But, my team just warned me that this may change with the latest release. So, in order to avoid asking for resources we don't need I will monitor this for a week and if the issue is not seen again I will discard this request.


@hmitsch we are having a look at everything to make sure this is running before sending this ticket to the 'Done' column

canasdiaz commented 5 years ago

Hi @hmitsch, we've been reviewing the behavior of the data retention feature. Data is being dropped/removed according to our expectations but we've seen some corner cases that you should be aware of. Data is removed when is older than 2 years but for some special cases data is being retrieved and removed after an intermediate check, that means some items are stored in the raw index temporary. Let's see these special case one by one:

  1. Remo. Due to the remo API, in order to get correctly the activities we need to download everything from scratch. As soon as this is done, a process is executed to keep only the data we need according to the data retention time frame.
  2. Github. Some of the repositories have not activity in the configured data retention time frame, because of this our software downloads the data for those repos (they are just a few repos). Then an intermediate process (in charge of keeping data according to the retention time) removes that data.

@hmitsch we would like to get your ACK before marking this as Done.

I'm adding @sduenas to the loop just in case we need him.

hmitsch commented 5 years ago

This looks great!

image

Thanks so much!