ncbo / bioportal-project

Serves to consolidate (in Zenhub) all public issues in BioPortal
BSD 2-Clause "Simplified" License
7 stars 5 forks source link

Mapping counts need to be updated/refreshed on a regular basis #210

Open mdorf opened 3 years ago

mdorf commented 3 years ago

There is a scheduled CRON job that runs weekly to re-generate total mapping counts and mapping pair counts between classes in different ontologies. Unfortunately, this job has never been able to execute fully in the production environment due to 4store crashes. This behavior is documented in ncbo/ncbo_cron#39.

We succeeded in completing this job in an isolated 4store environment populated with production data. The newly generated mapping count graphs were then exported into the production 4store instance. Unfortunately, this does not qualify as a permanent solution. We need to find an alternative, whether by a code optimization or a separate process, that allows us to keep the mapping counts (total counts from an ontology to all other ontologies as well as pair counts of mappings between individual ontologies) refreshed regularly.

mdorf commented 3 years ago

Temporary Workaround

A temporary workaround involves duplicating the production 4store instance, running the Mapping Counts Generator script against the duplicate instance and then migrating the newly generated MappingCount graphs back to the production 4store instance.

mapping_counts_process

  1. Stop ncbo_cron in Prod

  2. Copy Prod 4store data to a secondary 4store instance:

    $ 4s-dump http://<prod 4store>/sparql/ -f mapping_count_graph
    $ cat mapping_count_graph
    http://data.bioontology.org/metadata/MappingCount

           4s-dump will create a directory 'data' containing graphs which are listed in file specified by -f flag

  3. Change ncbo_cron's config.rb file to point to the secondary 4store instance

  4. Kick off the Mapping Counts Generator script within CRON

  5. After Mapping Counts Generator script completes its run, export the MappingCount graph and import it into the Prod 4store instance:

    find data -type f | 4s-restore <kb_name>

           Need to make sure that data directory is removed so that nothing else gets loaded