mozilla / participation-metrics-org

Participation metrics planning repository
4 stars 4 forks source link

Perceval connector questions: Filtering fields and data retention #184

Closed hmitsch closed 5 years ago

hmitsch commented 5 years ago

We have questions around data collection and retention.

Is it possible to drop (not retrieve) certain data fields from Perceval connectors?

Example: We are not interested in retrieving group_topics for Meetup data.

Additionally, we would like to know if data retention can be configured on a data source basis.

Example: Retain mailing lists for 3 weeks and Github contributions for 2 years.

canasdiaz commented 5 years ago

I'm discussing this with our engineering team today. Stay tuned.

canasdiaz commented 5 years ago

One question to make our discussion more useful @hmitsch. Is your data retention policy aimed to the information shown on the dashboard or it is also aimed to the raw information we use to produce the information we show?

Let me make this clearer with an example, if we provide a Github contribution index to be shown on the dashboard with that policy (last 2 years) but we also have a raw one (not visible from the dashboard, but visible doing ES queries) with all the information. Would it be ok?

hmitsch commented 5 years ago

Is your data retention policy aimed to the information shown on the dashboard or it is also aimed to the raw information we use to produce the information we show?

It is also aimed at the raw information. Basically, we don't want to keep any data that we don't (visually) expose.

Best regards, Henrik

canasdiaz commented 5 years ago

Hi @hmitsch ,

about the data retention question. Our product by default keeps all the information, let's recap the information we have:

If we don't have all the information:

Said that, a force brute approach could be tested for the raw/enriched indexes deleting items older than a certain period of time. But it would be even more complex for the identities database, as there is no relationship in that database among the identities and the time frame where they were active.

Regarding to dropping certain data fields from the Perceval connector. This is something not provided by the tool. We have two options here:

hmitsch commented 5 years ago

Hi @sanacl,

thanks for the detailed answer. This is useful. The brute force approach you mention (delete items older than X) would be interesting. Is there a way you can test if that works?

Would it be possible to add a feature request for filtering of elements in the Perceval connector? Is that something that could be added fairly easily to your product (not asking about timing, only about complexity)?

Best regards, Henrik

canasdiaz commented 5 years ago

Hey @hmitsch , in order to make the conversation with my team more efficient I'm going to try to sum up the two requests you sent us a few days ago.

Data retention

The idea here is to be able to drop the data older than a defined threshold. This parameter would be able to be customized for each data source. E.g, we could have the last 2 years for Github activity and the last 6 months for Meetup activity. Ideally, the studies created based on the enriched information should not be affected by the data retention policy. This retention policy would be applied to all the information gathered which includes the Sorting Hat identities information.

Selective download

In order to avoid downloading data that does not want to be analyzed, the product must offer a way to avoid certain fields to be downloaded. This feature should be able to be applied to all the data sources.


Your feedback will be highly appreciated :)

hmitsch commented 5 years ago

Hi @sanacl,

this looks great, with one exception:

Ideally, the studies created based on the enriched information should not be affected by the data retention policy.

We do not need the studies (e.g. contributor onion) to be adhering to a separate retention policy. Let's aim to keep things simple and transparent and only analyze the data for as long as the data retention period is set.

Hope this helps. Best regards, Henrik

canasdiaz commented 5 years ago

Request received and sent to Bitergia's product manager and CEO. Waiting for info :eyes:

canasdiaz commented 5 years ago

I've just see this is WIP for Bitergia so we can close this ticket @hmitsch

canasdiaz commented 5 years ago

I guess this task is finished so it could be moved to Done. We should start creating tasks for the deployment of the solution agreed.

Do u agree @hmitsch ? I can create those tickets if you want.

hmitsch commented 5 years ago

@sanacl, yes I agree. Please create those tickets. Thank you!