jhu-bids / TermHub

Web app and CLI tools for working with biomedical terminologies. https://github.com/orgs/jhu-bids/projects/9/views/7
https://bit.ly/termhub
GNU General Public License v3.0
8 stars 10 forks source link

Bug: `test_n3c` schema not fully / properly set up (& tests) #804

Open joeflack4 opened 1 month ago

joeflack4 commented 1 month ago

Overview

Tables are created and data is copied over from n3c. However, no additional constraints, such as PKeys, are set.

Sub-tasks

(3) and (4) are currently lower priority because I'm not sure that any tests rely on these, and especially (4) because it would be time-consuming to do. I think no tests rely on this setup because mostly I'm just using test_n3c for inserts, and other tests are executing on the actual n3c schema.

4. Correct relational data between tables

Basically, we're setting up these tables by copying the first 50 rows from each table. However, this is not correct from a "relational data" perspective.

What's meant by "relational data" is like so: A table like code_sets is primary. Perhaps this is the most / only primary table. For every code_set in that table, we only need entries in concept_set_container, concept_set_version_item, and concept_set_members that apply to these code sets. Further, we can then filter the concept table to include only those which are listed in that member table. Then we can filter concept_relationship and concept_ancestor but what's there. Then, once these core tables are set up, any derived tables can be updated by running refresh_derived_tables().

Perhaps the best way to achieve this is by updating initialize() so that the "setup test schema" part of it does its own initialization, basically subsetting the code_set dataset first, and then filtering the other datasets like that, and then uploading. But this will also likely be slower than just doing something similar using the already existing SQL tables.

Also, have to consider how slow it is to do this. Right now I'm running remakes of the test schema at the start of every test suite. If it is too slow, we could consider adding some sort of caching. But we'd have to commit those cached files too, otherwise the GitHub action tests would also run quite slowly.