opensource-observer / oso

Measuring the impact of open source software
https://opensource.observer
Apache License 2.0
66 stars 17 forks source link

`int_repo_metrics_by_project` table shows duplicate metrics #1827

Closed baumstern closed 1 month ago

baumstern commented 1 month ago

Which area(s) are affected? (leave empty if unsure)

No response

To Reproduce

run this query:

SELECT a.project_id, a.project_name, b.artifact_id, b.artifact_name, b.count
FROM `oso_production.projects_v1` a
JOIN (
    SELECT project_id, artifact_id, artifact_name, COUNT(*) AS count
    FROM `oso_production.int_repo_metrics_by_project`
    GROUP BY project_id, artifact_name, artifact_id
    HAVING COUNT(artifact_id) > 1
) b ON a.project_id = b.project_id;

Describe the Bug

As of July 22, 2024, 12:09 UTC, int_repo_metrics_by_project table shows that 100 projects have duplicate entries. The common between this dup entries are that an artifact is belong to multiple project.

query result:

[{
  "project_id": "UNKoHLpnT_OziRiPCuX21uhgLNIJygsxuZz3A0TL-cg\u003d",
  "project_name": "protocol-guild",
  "artifact_id": "VLcSRbfxZ-_93qd5YwRajU-iIO8beWwEcuRFXOuDEQo\u003d",
  "artifact_name": "beaconrunner",
  "count": "2"
}, {
  "project_id": "UNKoHLpnT_OziRiPCuX21uhgLNIJygsxuZz3A0TL-cg\u003d",
  "project_name": "protocol-guild",
  "artifact_id": "siL8ezhji41Dn0HMDKo1DHYSoXfuwDc8ft06O7feD60\u003d",
  "artifact_name": "distributed-validator-specs",
  "count": "2"
}, {
  "project_id": "UNKoHLpnT_OziRiPCuX21uhgLNIJygsxuZz3A0TL-cg\u003d",
  "project_name": "protocol-guild",
  "artifact_id": "lRTiBd_yQhrNJJipPl_SwSZDyF6_z9ZxDNiOJe-bPFw\u003d",
  "artifact_name": "eth-portal",
  "count": "2"
}, {
  "project_id": "7oTuZbUFg8jwdGTUp8SnLKd2dqHqoqyMsVXf3Xih6Fw\u003d",
  "project_name": "beaconrunner-ethereum",
  "artifact_id": "VLcSRbfxZ-_93qd5YwRajU-iIO8beWwEcuRFXOuDEQo\u003d",
  "artifact_name": "beaconrunner",
  "count": "2"
}
....
]

see the full query result here.

related issue: https://github.com/opensource-observer/oso/issues/1781

Expected Behavior

no dup

baumstern commented 1 month ago

I suspect the duplication arises because a single artifact may be associated with multiple projects. For example, both the protocol-guild and beaconrunner-ethereum projects include the beaconrunner artifact, which is a pattern that appears to be common among other entries as well.

ccerv1 commented 1 month ago

@baumstern yes! We do, knowingly, have a few repos shared among projects, mostly ones associated with the ethereum org space (or forks of its repos). However, the artifact metrics should not be double-counted. So I'm glad you found this.

It appears that this issue would also affect the privacy-scaling-explorations because we index both the org space and some its individual repos.

ccerv1 commented 1 month ago

hey @baumstern - pushing a simple fix now! thanks for your help testing this. separately will need to do some manual cleanup work to handle the repos that are owned by more than one projects.