re1 / jacq-javaee

https://development.senegate.at/confluence/display/JACQ
0 stars 0 forks source link

[names] Use OpenUp! database for caching #15

Open re1 opened 4 years ago

re1 commented 4 years ago

The caching database is required to work independently from the JACQ services and has to be used remotely. A working database for the OpenUp! project exists and will also be used by the main JACQ services.

The existing OpenUp! database has to be fully implemented to Match OpenUp! functionality. In order to prevent data corruption by new service features especially sources without endpoint are to be preserved!

Caching process

Queries are currently cached in the OpenUp! project as described in Confluence (Common Names Webservice Caching System) and are to be implemented the same way from the JACQ Common Names Webservice.

:information_source: Queries are identified by their SHA1 hash.

Implementation

The implementation process can roughly be split into 5 steps for reference:

  1. [x] Define interfaces for all existing tables and prepare for remote caching
  2. [x] Check incoming requests against the remote cache and compare timestamps
  3. [x] Retrieve data from source caches where no source endpoint is available
  4. [x] Write uncached and outdated queries to source caches
  5. [x] Compare local web service cache against remote web service cache and decide on the final implementation.

Step 5 largely depends on the development process as further insight might change the scope of this issue.

re1 commented 4 years ago

The following chart shows the currently documented caching system using the OpenUp! database.

Source: JACQ Common Names Webservice Caching System

re1 commented 4 years ago

In this following chart additional information was added to better explain the process of caching and using web service results.

webservice-caching

The web service cache might already include responses from sources no longer available which are still to be included in the results.

re1 commented 4 years ago

The following Web service sources were originally used in the OpenUp! project and therefore still cached. Although not all sources are still online, service implementations are to be defined in order to query cached responses.

id url
1 http://www.catalogueoflife.org/col/webservice?response=full&format=php&name=
2 http://131.130.131.9/taxamatch/jsonRPC/json_rpc_taxamatchMdld.php
3 http://www.eu-nomen.eu/portal/soap.php?wsdl=1
4 http://webtjenester.artsdatabanken.no/Artsnavnebase.asmx?WSDL
5 http://wboe.oeaw.ac.at/api/taxonid/
6 http://ws.luomus.fi/triplestore/search?predicate=dwc:scientificName&object=
7 https://taxon.artdatabankensoa.se/TaxonService.svc?wsdl
8 http://www.meertens.knaw.nl/pland/rest/?format=php&plantnaam=
9 http://localhost:8080/jacq-names/rest/names/common/?query=
re1 commented 4 years ago

The following services are to be added to the OpenUp! Web service cache after they are implementing WebServiceCache (ids can change!):

id url
10 http://www.dnp.go.th
11 http://ylist.info/
INSERT INTO tbl_service
VALUES (10, 'http://www.dnp.go.th'),
       (11, 'http://ylist.info/');
re1 commented 4 years ago

WBÖ / DBÖ source (5) uses a similar response format as the JACQ Legacy source and in the OpenUp! implementation it also uses the JACQ Legacy (NHMW) service interface to fetch the original web service response.

The following example uses the query hash 40ce6cd2e9e0bf2aee00bd81c569b83ee4509749 (likely some form of Homogyne discolor Cass.).

s:856:"[
  {
    "taxon_id": "2892",
    "geography": "aut",
    "scope": 100,
    "dboe_id": "beleg:127439",
    "name": "Rahmblödschal ",
    "language": "bars",
    "match": true,
    "reference": "http:\/\/dboema.acdh.oeaw.ac.at\/dboe\/beleg\/127439",
    "type": ["\/name\/common"]
  }, { ... }, { ... }
]

The JACQ CNS will only use cached results for now unless stated otherwise. So far no working queries were found to test this source.

re1 commented 4 years ago

The caching strategy has been updated with Web Service response caching functionality including incremental caching as described in [names] Incremental Web service caching and timeouts (#21).

The following chart shows how common names are collected from both cached and uncached static and Web Service sources.

webservice-caching

re1 commented 4 years ago

The existing OpenUp! Database has Web Service responses cached in serialized PHP format. Most already cached responses are therefore hard to read and require a considerable amount of extra work. It might be a good idea to reevaluate the serialization or format of responses.

re1 commented 4 years ago

The OpenUp database table tbl_webservice_cache contains many values not relevant to the application. Such values can be removed to improve search times. The following values are to be removed:

In addition there are about 500 queries with unusable responses including the query text with the format like so:

Those values where found using the following query and checking it's result:

SELECT COUNT(response) AS c, response
FROM tbl_webservice_cache
GROUP BY response
HAVING c > 1
ORDER BY LENGTH(response);

As deleting by comparing to the query takes considerably long, therefore values for deletion are added manually:

DELETE
FROM tbl_webservice_cache
WHERE response IN ('N;', 'b:0;', 's:0:"";', 's:3:"[]
";', 'a:1:{i:0;O:8:"stdClass":0:{}}',
                   'a:2:{s:5:"error";s:0:"";s:6:"result";a:0:{}}',
                   'O:8:"stdClass":1:{s:13:"ArtssokResult";O:8:"stdClass":0:{}}',
                   'a:2:{s:5:"error";s:22:"no database connection";s:6:"result";a:0:{}}',
                   'O:8:"stdClass":1:{s:29:"GetTaxaBySearchCriteriaResult";O:8:"stdClass":0:{}}',
                   'a:1:{i:0;O:8:"stdClass":3:{s:10:"vernacular";s:14:"Chinese Chives";s:13:"language_code";s:2:"eu";s:8:"language";s:21:"English-United States";}}',
                   'a:2:{i:0;O:8:"stdClass":3:{s:10:"vernacular";s:13:"gerani cineri";s:13:"language_code";N;s:8:"language";s:7:"Catalan";}i:1;O:8:"stdClass":3:{s:10:"vernacular";s:11:"silkesnäva";s:13:"language_code";s:2:"sv";s:8:"language";s:7:"Swedish";}}',
                   'a:2:{s:5:"error";s:0:"";s:6:"result";a:1:{i:0;a:7:{s:10:"searchtext";s:2:"Os";s:19:"searchtextNearmatch";s:0:"";s:11:"rowsChecked";i:34709;s:4:"type";s:3:"uni";s:8:"database";s:5:"freud";s:18:"includeCommonNames";b:1;s:12:"searchresult";a:0:{}}}}')
   OR response LIKE
      'a:2:{s:5:"error";s:0:"";s:6:"result";a:1:{i:0;a:7:{s:10:"searchtext";s:%:"%";s:19:"searchtextNearmatch";s:0:"";s:11:"rowsChecked";i:%;s:4:"type";s:3:"uni";s:8:"database";s:5:"freud";s:18:"includeCommonNames";b:1;s:12:"searchresult";a:0:{}}}}';
re1 commented 4 years ago

Many rows of the tbl_webservice_cache table have the same column values and can be removed. A simple SQL script for deduplication would be

DELETE w1
FROM tbl_webservice_cache w1
         INNER JOIN tbl_webservice_cache w2
WHERE w1.id < w2.id
  AND w1.service_id = w2.service_id
  AND w1.query = w2.query
  AND w1.response = w2.response;

As many rows are locked in the process there might be a an error stating "The total number of locks exceeds the lock table size". This can be avoided by manually setting the service_id value to compare.

...
  AND w1.service_id = 1
  AND w2.service_id = 1
...