jamesaoverton / cmi-pb-terminology

CMI-PB Controlled Terminology
0 stars 0 forks source link

Consider using _view tables when checking foreign keys in Python code #60

Open jamesaoverton opened 2 years ago

jamesaoverton commented 2 years ago

I found an example in the CMI-PB data where:

  1. A row from olink_prot_info.tsv ({"uniprot_id":"O14625","olink_id":"OID05624"}) was put in 'olink_prot_info_conflict' because 'uniprot_id' "O12625" is a duplicate in a primary key column.

this is the right behaviour, but then

  1. A row in 'olink_prot_exp' failed validation because the foreign key 'olink_id' "OID05624" was not found in 'olink_prot_info'. They message was "Value OID05624 of column olink_id is not in olink_prot_info.olink_id".

This validation error is correct but still somewhat misleading, since "OID05624" can be found in olink_prot_info.tsv and the view 'olink_prot_info_view.olink_id'.

My first thought was that the SQL foreign key constrain could apply to the 'olink_prot_info_view.olink_id' column of the view, but after a bit of research I think won't work since columns in views cannot be UNIQUE as required for a foreign key: https://www.sqlite.org/foreignkeys.html#fk_indexes

My second thought is that we should have a different validation message in this case. Instead of saying "Value OID05624 of column olink_id is not in olink_prot_info.olink_id", we should say "Value OID05624 of column olink_id is in olink_prot_info_conflict.olink_id" (this could be worded better). This would require the Python code to check in both 'olink_prot_info' and 'olink_prot_conflict'.

Thoughts or suggestions?

lmcmicu commented 2 years ago

It seems that your second thought is, from the user's perspective, different from the first. The idea behind the first is to include the contents of both the normal and conflict tables when checking foreign keys, but in terms of the messages being generated it does not actually change anything. Either the value is (logically speaking) in the foreign column or it isn't. The idea behind your second thought is that we should give the user more information, i.e., that we should explicitly tell the user that the value is not in the "normal" foreign table but that it is in the conflict version of the table.

Regarding the first proposal, although you can't create a unique constraint on a view, there nevertheless is an existing unique constraint on the normal table, which I guess will be used underneath the hood when querying over the view. Since the conflict table will also tend to have have very few rows in it, I would expect that a query against the view would not be much more expensive than a query against the normal table alone, but it is something we should benchmark.

I think the second proposal is preferable, though, since in terms of the actual work that needs to be done to get the info from the db it is no more expensive, and this is useful information to give to the user. I would suggest something like "Value OID05624 of column olink_id only exists in olink_prot_info_conflict.olink_id".

lmcmicu commented 2 years ago

I haven't thought of a better idea than your first and second proposals but will comment if I think of something.

jamesaoverton commented 2 years ago

We'll try the second approach.