molgenis / molgenis-py-bbmri-eric

MOLGENIS Python tooling for BBMRI-ERIC
GNU Lesser General Public License v3.0
0 stars 3 forks source link

Deleted rows from a staging area are not deleted from the published ERIC tables #36

Closed dtroelofsprins closed 3 years ago

dtroelofsprins commented 3 years ago

When records are deleted from the staging area they should also be removed from the published ERIC tables. This does not happen. It seems the _delete_rows function within the Publisher does not work properly.

dtroelofsprins commented 3 years ago

After an inventory two reason why the deletion or records from the published ERIC tables are identified: 1: the return in the _get_production_ids (publisher.py):

return {row["id"] for row in rows if row.get("national_node", "") == node.code}

never returns any records as the row.get("national_node", "") returns a dictionary. This means that all no deleted records can be identified. This should solve the issue:

return {row["id"] for row in rows if row.get("national_node", {}).get("id", "") == node.code}

2: The current get_qualities function (bbmri_client.py) returns ALL biobank and collection IDs, also the ones without a quality. Biobank and collections with a quality are not deleted, but as the function returns all biobanks and collections never any biobank or collection will be deleted as they all seem to have a quality. This of course is not true.

    def get_quality_info(self) -> QualityInfo:
        """
        Retrieves the quality information identifiers for biobanks and collections.
        :return: a QualityInfo object
        """

        biobank_qualities = self.get(
            "eu_bbmri_eric_biobanks", batch_size=10000, attributes="id,quality"
        )
        collection_qualities = self.get(
            "eu_bbmri_eric_collections", batch_size=10000, attributes="id,quality"
        )

        biobanks = utils.to_upload_format(biobank_qualities)
        collections = utils.to_upload_format(collection_qualities)

        return QualityInfo(
            biobanks={row["id"]: row["quality"] for row in biobanks},
            collections={row["id"]: row["quality"] for row in collections},
        )

Only the rows that have a quality should be returned => therefore not select from eu_bbmri_eric_biobanks and eu_bbmri_eric_collections, but from eu_bbmri_eric_bio_qual_info and eu_bbmri_eric_col_qual_info instead:

    def get_quality_info(self) -> QualityInfo:
        """
        Retrieves the quality information identifiers for biobanks and collections.
        :return: a QualityInfo object
        """

        biobank_qualities = self.get(
            "eu_bbmri_eric_bio_qual_info", batch_size=10000, attributes="id,biobank"
        )
        collection_qualities = self.get(
            "eu_bbmri_eric_col_qual_info", batch_size=10000, attributes="id,collection"
        )

        biobanks = utils.to_upload_format(biobank_qualities)
        collections = utils.to_upload_format(collection_qualities)

        bb_qual={}
        {bb_qual.setdefault(row["biobank"], []).append(row["id"]) for row in biobanks}
        coll_qual={}
        {coll_qual.setdefault(row["collection"], []).append(row["id"]) for row in collections}

        return QualityInfo(
             biobanks=bb_qual,
             collections=coll_qual
         )

After these changes the function test_delete_rows (test_publisher.py) needed a fix: Changed:

        {"id": "bbmri-eric:ID:NO_OUS", "national_node": "NO"},
         {"id": "ignore_this_row", "national_node": "XX"},
         {"id": "delete_this_row", "national_node": "NO"},
         {"id": "undeletable_id", "national_node": "NO"},

into:

         {"id": "bbmri-eric:ID:NO_OUS", "national_node": {"id": "NO"}},
         {"id": "ignore_this_row", "national_node": {"id": "XX"}},
         {"id": "delete_this_row", "national_node": {"id": "NO"}},
         {"id": "undeletable_id", "national_node": {"id": "NO"}},