informatics-isi-edu / pdb-ihm

Deriva Protein Database Project
3 stars 1 forks source link

Review unique constraints in tables with user uploaded files #186

Open brindakv opened 8 months ago

brindakv commented 8 months ago

Entry_Related_File and ihm_starting_model_details are tables where users can upload files belonging to an entry, in addition to the primary mmCIF and image files.

Unique constraints in these tables need to reviewed.

Current for Entry_Related_File, multiple records with the same File_Type is allowed. However, when a new record is created using an already existing File_Type that is on Record_Ready (i.e., data is already populated in the corresponding table), then the data is deleted before loading data from the new file. Entry_Related_File will show two records with the same File_Type with data from only one loaded into the database.

To properly enforce the correct behavior, (RCB, Structure_Id, File_Type) unique constraint is required so that multiple records with the same File_Type are not created. User can edit the existing record to fix any errors (ERROR status) or to update data (RECORD READY status).

Note: Although data is deleted before loading data from new file of the same File_Type, ermrest still throws an error regarding duplicate primary keys if the a new record is created with a file that has the same id in the uploaded files as the data in ermrest. If this is done by editing an existing record, there is no error. This behavior also suggests the need for (RCB, Structure_Id, File_Type) unique constraint.

brindakv commented 8 months ago

Add two new unique constraints to Entry_Related_File:

ihm_starting_model_details can remain as is.

svoinea commented 8 months ago

There are currently conflicts for the (RCB, structure_id, File_Type) UNIQUE CONSTRAINT:

_ermrest_catalog_1=> select "RID", "File_Name", "RCB", "structure_id", "File_Type" from "Entry_Related_File" where ("RCB", "structure_id", "File_Type") in 
(select "RCB", "structure_id", "File_Type" from "Entry_Related_File" group by "RCB", "structure_id", "File_Type" having count(*) > 1)
order by "structure_id"
;
  RID   |                    File_Name                     |                             RCB                              | structure_id |              File_Type               
--------+--------------------------------------------------+--------------------------------------------------------------+--------------+--------------------------------------
 2-FH0W | Chemical-Crosslinks-from-Experiments_DMTMMv2.csv | https://auth.globus.org/6b934fbb-1e96-4664-a02f-f328711dda8d | D_2-6W26     | Chemical Crosslinks from Experiments
 2-FH0T | Chemical-Crosslinks-from-Experiments_BS3.csv     | https://auth.globus.org/6b934fbb-1e96-4664-a02f-f328711dda8d | D_2-6W26     | Chemical Crosslinks from Experiments
 2-FGZY | Chemical-Crosslinks-from-Experiments_DMTMM.csv   | https://auth.globus.org/6b934fbb-1e96-4664-a02f-f328711dda8d | D_2-6W28     | Chemical Crosslinks from Experiments
 2-FGZW | Chemical-Crosslinks-from-Experiments_BS3.csv     | https://auth.globus.org/6b934fbb-1e96-4664-a02f-f328711dda8d | D_2-6W28     | Chemical Crosslinks from Experiments
(4 rows)

In other words, we have multiple files for the same (RCB, structure_id, File_Type).

brindakv commented 8 months ago

@svoinea to review the backend code to find how deletion is carried out before insertion to understand why the error happens.

@brindakv to update tool tip "Only one file of a particular type can be uploaded".

@brindakv to review RCT and content of Entry_Related_File and corresponding ermrest table in entries D_2-6W26 and D_2-6W28 on production and delete duplicate instances.

svoinea commented 8 months ago

As I have anticipated, the backend code is deleting just the entries of the table. I am pasting the sequence of code:

        """
        Empty first the tname table
        """
        self.logger.debug('Checking if PDB:{}/structure_id={} is empty'.format(tname, entry_id))
        url = '/entity/PDB:{}/structure_id={}'.format(tname, entry_id)
        resp = self.catalog.get(
            url
        )
        resp.raise_for_status()
        deleted_rows = resp.json()
        if len(deleted_rows) > 0:
            try:
                """
                The table is not empty
                """
                url = '/entity/PDB:{}/structure_id={}'.format(tname, entry_id)
                resp = self.catalog.delete(
                    url
                )

Deleting the older records of the Entry_Related_File table is OK. I have already done it on catalog 99 and 50.

brindakv commented 8 months ago

Thanks @svoinea.

I fixed the duplicate records with the same File_Type in D_2-6W26 and D_2-6W28.

svoinea commented 8 months ago

Added the following constrains on all the catalogs:

Created Key Entry_Related_File_RCB_structure_id_File_Type_key for table PDB:Entry_Related_File
Created Key Entry_Related_File_RCB_structure_id_File_MD5_key for table PDB:Entry_Related_File
Set nullok=False in column PDB:Entry_Related_File:File_Name
Set nullok=False in column PDB:Entry_Related_File:File_URL
Set nullok=False in column PDB:Entry_Related_File:File_MD5
brindakv commented 8 months ago

Tooltip updated on dev.

brindakv commented 8 months ago

@svoinea to update if the unique constraints are applied on dev and production.

@monica81567 to test on production.

hongsudt commented 5 months ago

I am mvoing this to Review.

brindakv commented 2 days ago

@monica81567 please test this on production.