thoth-station / storages

Storage and database adapters for project Thoth
https://thoth-station.github.io/
GNU General Public License v3.0
14 stars 16 forks source link

python_package_license table has duplicate value #2713

Closed harshad16 closed 2 years ago

harshad16 commented 2 years ago

Bug description

While working #2703, we found out that python_package_license is having duplicate values.

ID
Name identifier version
1 UNDETECTED UNDETECTED UNDETECTED
2 Apache License 2.0 Apache-2.0 2.0
3 Mozilla Public License 2.0 MPL-2.0 2.0
4 MIT License MIT LICENSE-WITHOUT-VERSION
5 Mozilla Public License 2.0 MPL-2.0 2.0
6 MIT License MIT LICENSE-WITHOUT-VERSION

Steps to Reproduce

Steps to reproduce the behavior:

  1. See the duplication in stage database

Expected behavior

No duplicate values.

harshad16 commented 2 years ago

/triage accepted /sig stack-guidance /priority critical-urgent

harshad16 commented 2 years ago

As there was no distinction or uniqueness added in the table https://github.com/thoth-station/storages/blob/a95f2618ee94f0e519ffa93f5f6c93a9cdba0658/thoth/storages/graph/models.py#L99 the duplication was happening. By making a unique index constriant between license_name, license_identifier, license_version, we resolved the issue. https://github.com/thoth-station/storages/commit/b357f951ec5633facf8e498af5339b108b5b9d49

The database is updated with this constraint with the latest release of storage #2712

Also the data is updated , to fix the issue in already existing data:

update the foreign key:

update python_package_version a
  set package_license = b.newID
  from (select c.id, d.reqID as newID
    from python_package_license c
      inner join (select license_name, license_identifier, license_version, min(id) as reqID
        from python_package_license
        group by license_name, license_identifier, license_version) as d
    on c.license_name = d.license_name and c.license_identifier=d.license_identifier and c.license_version = d.license_version) as b
  where a.package_license = b.id;

delete the existing duplicate data:

delete
from python_package_license
where id not in ( select min(id) as id
        from python_package_license
        group by license_name, license_identifier, license_version);

closing the issue.