chenejac commented 1 year ago

Is your feature request related to a problem? Please describe. It looks performance in communication with external/remote graph bases is quite low. We need to test those performance and create baseline for comparation with further improvements.

Describe the solution you'd like A wiki page for performance testing should be updated - https://wiki.lyrasis.org/display/VIVO/Performance+Testing. The page is documenting the scenario of testing and results. Taking into account that ingestion has been improved recently in this PR, all existing tests at wiki page should be repeated. The definition of the environment should be also included (operating system, processor, RAM, SDD or HDD disk).

Additional context Wiki page - https://wiki.lyrasis.org/display/VIVO/Performance+Testing Take a look in another Lyrasis communities testing performance procedures and persentation of results:

brianjlowe commented 1 year ago

VIVO's internal logging of SPARQL query timings may be useful here: https://wiki.lyrasis.org/display/VIVO/The+Developer+Panel See sections about SPARQL Query Settings and developer.properties.

michel-heon commented 1 year ago

@chenejac , @brianjlowe , @wwelling , @kaladay

VIVO (TDB/Neptune) Performance measurement - Preliminary results-

Here are some of the results of the analysis to date

Prerequisite

read https://wiki.lyrasis.org/display/VIVO/Performance+Testing

System details

VIVO & Vitro code extract from
- https://github.com/vivo-project/VIVO/pull/3836
- https://github.com/vivo-project/Vitro/pull/368
VIVO/Vitro configuration based-on https://wiki.lyrasis.org/display/VIVO/Performance+Testing
- with logging patch
- Inference and search NOT disable
- $VIVO_HOME/config/developer.properties configured as:
```
developer.enabled = true
developer.loggingRDFService.enable = true
developer.loggingRDFService.queryRestriction = .*
developer.loggingRDFService.stackRestriction = .*
```
JAVA- Oracle JDK 11
Env. Vars
```
export CATALINA_OPTS="-Xms6g -Xmx14g"
export JAVA_OPTIONS="-XX:+UnlockCommercialFeatures -XX:+FlightRecorder \
-Dcom.sun.management.jmxremote=true \
-Dcom.sun.management.jmxremote.port=3614 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.autodiscovery=true "
```
- Data:
  - From a UQAM triplestore in pre-production
  - Contains 396047 triples
  - Data are internationalized fr-CA and en-US
  - Represents about 1300 persons, 600 organizations, 360 research entries, 3600 expert concepts and 2070 events
- Computational resources
- Tomcat host: AWS EC2 m6i.2xlarge
  - 8 CPU, 32 GB, 3.5 GhZ, Network up To 12.5 GB/s
- AWS-Neptune instance: db.t3.medium
  - 2 CPU, 4 GB, Moderate Network

Testing Procedure

Configure applicationSetup.n3 for appropriate situation: tdbConfigurationTripleSource or sparqlConfigurationTripleSource with appropriate sparql-endpoint URL
Copy data-graph in $VIVO_HOME/rdf/abox/filegraph
Start Tomcat (make sure solr is running)
Wait for the VIVO boot and data loading to be completed
Log in as administrator in VIVO and go to the administration window
Empty the contents of the tomcat log files => Echoing "" in all vivo.all.log's file in $TOMCAT_HOME/logs/
Start a VIVO reindexing
At the end of the indexing, start the performance evaluation with the attached script
Save the result in a file
Select the 'Person' tab in VIVO
Make steps 8-9

Results analysis

Test with TDB

Script result after reindexing (step 8):

   calls      sec   sec/call               method
==================================================
   52915    21.58     0.0004 sparqlConstructQuery
  211517    85.92     0.0004    sparqlSelectQuery
   10578     2.89     0.0003       sparqlAskQuery
Total time: 110.394 sec (~1 mins, or ~0 hrs)

Script result after reindexing (step 11):

   calls      sec   sec/call               method
==================================================
   53313    21.64     0.0004 sparqlConstructQuery
  212285    86.02     0.0004    sparqlSelectQuery
   10958     2.93     0.0003       sparqlAskQuery
Total time: 110.596 sec (~1 mins, or ~0 hrs)

Remarks:

A total of 1546 SPARQ operations are requested to refresh the Person tab form According to our calculations, these operations are performed in .202 seconds

Test with NEPTUNE

Script result after reindexing (step 8):

   calls      sec   sec/call               method
==================================================
   51760   685.09     0.0132 sparqlConstructQuery
  205288  3036.38     0.0148    sparqlSelectQuery
   10269   126.92     0.0124       sparqlAskQuery
Total time: 3848.392 sec (~64 mins, or ~1 hrs)

Script result after reindexing (step 11):

   calls      sec   sec/call               method
==================================================
   52156   685.89     0.0132 sparqlConstructQuery
  206056  3038.81     0.0147    sparqlSelectQuery
   10647   127.57     0.0120       sparqlAskQuery
Total time: 3852.273 sec (~64 mins, or ~1 hrs)

Remarks:

A total of 1544 SPARQ operations are requested to refresh the Person tab form According to our calculations, these operations are performed in 3,31 seconds

General remarks

It appears that changing pages (person, organization, search, etc.) requires a number of queries proportional to the number of triples in the TriplesStore. For a networked database detached from the VIVO application, the consequences are quite small for a little database. On the other hand, the sensitivity to latency becomes more perceptible for larger databases.

The hypothesis that emerges from the observation is that the high latency is mainly due to repeated sending of small queries. We believe that it would be advantageous to group small queries into larger ones to reduce the number of calls to the triplestore.

chenejac commented 1 year ago

@michel-heon thanks for this. This is a nice first step. I am wondering can we make more concrete definition for further steps including code analysis and listing links to code lines where the improvement is needed (for instance link to a method which should use bulk fetching of data).

michel-heon commented 1 year ago

A comprehensive investigation

This commentary describes the investigations that have been conducted to identify the source of the network latency. In addition, a first correction is proposed in the Vitro PR 378

The investigation's goal is to determine the entry point requiring optimization. Inspired by the defined procedure VIVO-Wiki/Performance+Testing, several executions with a step-by-step trace have been performed.

Experimental procedure to evaluate the performance of the solution

The investigation is carried out by performing five iterations, each comprising two tests

Test setup

Test flow

The tests are produced in the following sequence:

flowchart TD 
    Z[Clean tomcat logs ] --> A[Set parameters]
    A -->|STGE-Type, #-indv, #-Thread| B(compile/start-vivo)
    BB[(Data set)] -. Loading data set .-> C
    B --> C{{Wait for the reindexing to be completed}}
    C -.  trace log .-> E[(test-#.txt)]
    C --> F{{Select Person in VIVO}}
    F-. trace Log .-> G[(test-#+1.txt)]
    F--> H[Stop tomcat]

Test parameters

Iteration are parameterized according to the following three variables

the graph storage type (Local with tdb (T) or remote with Neptune (N))
the number of individuals per page (15 or 30)

the execution approach: monotrhead (1) or multithread (X)

Log compilers

The result of the tests is realized by the execution of this script which is an adaptation of the script proposed in parse
by


#!/bin/bash
###################################################################
# Script Name   :
# Description   :
# Args          : extract from https://wiki.lyrasis.org/download/attachments/183076653/parse.sh?version=1&modificationDate=1583178696512&api=v2&download=true
# Author        : Michel Héon PhD
# Institution   : Université du Québec à Montréal
# Copyright     : Université du Québec à Montréal (c) 2022
# Email         : heon.michel@uqam.ca
###################################################################
export SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd -P)"
source $SCRIPT_DIR/../conf/vs-conf.sh
source $BASH_LIB/cleanup.sh
LOG_FILES="$TOMCAT_HOME/logs/$APP_NAME.all.*"
log_file=$TMPDIR/file.log

for f in $LOG_FILES do echo -e '\n' >> $log_file cat $f >> $log_file done set +x

grep 'ingest:' < $log_file

Output formatting

divider=================================== divider=$divider$divider

header="\n %7s %8s %10s %20s\n" format=" %7d %8.2f %10.4f %20s\n"

width=50

printf "$header" "calls" "sec" "sec/call" "method" ; printf "%$width.${width}s\n" "$divider"

Output formatting - END

total=0 total_count=0 for method in {changeSetUpdate,sparqlConstructQuery,sparqlDescribeQuery,sparqlSelectQuery,sparqlAskQuery,serializeAll,serializeGraph,isEquivalentGraph}; do count=grep RDFServiceLogger $log_file | cut -d' ' -f9,10 | grep -c $method; sum=grep RDFServiceLogger $log_file | cut -d' ' -f9,10 | grep $method | cut -d' ' -f1 | paste -sd+ - | bc -l ; if [ ! -z "$sum" ] then rate=echo "${sum}/${count}" | bc -l printf "$format" "$count" "$sum" "$rate" "$method"; math="${total}+${sum}" total=echo $math | bc -l total_count=echo "${total_count}+${count}"|bc fi done

mins=echo "${total}/60.0" | bc -l hours=echo "${total}/3600.0" | bc -l printf "Total time/calls: %.2f/%d sec (~%.2f mins, or ~%.2f hrs) avrg %f sec.\n" "${total}" "$total_count" "${mins}" "${hours}" "$(echo "${total}/$total_count" | bc -l)" echo "Number of lines $(wc -l < $log_file)"

### Data set
- From a UQAM triplestore in pre-production
- Contains 396047 triples
- Data are internationalized fr-CA and en-US
- Represents about 1300 persons, 600 organizations, 360 research entries, 3600 expert concepts and 2070 events
### Java classes affected by the tests
see details in [Vitro PR 378](https://github.com/vivo-project/Vitro/pull/378)
## Test results

### Summary of experiments
|Test #| STGE Type| nbrindv/page| Threading | after-reindex|after-touch-person|
| -| -| -| -|-|-|
test-1 | T | 30 |1| X|
test-2 | T | 30 | 1| |  X|
test-3 | T | 15 | X | X|
test-4 | T | 15 | X | |  X|
test-5 | N | 30 |1| X|
test-6 | N | 30 | 1| |  X|
test-7 | N | 15 | X | X|
test-8 | N | 15 | X | |  X|
test-9 | N | 15 |1| X|
test-10 | N | 15 |1 | |  X|
### Test details
#### test-1

calls sec sec/call method

59468 27.36 0.0005 sparqlConstructQuery 236584 108.38 0.0005 sparqlSelectQuery 11829 3.80 0.0003 sparqlAskQuery Total time/calls: 139.55/307881 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000453 sec. Number of lines 307929

#### test-2

calls sec sec/call method

59864 27.44 0.0005 sparqlConstructQuery 237352 108.50 0.0005 sparqlSelectQuery 12207 3.84 0.0003 sparqlAskQuery Total time/calls: 139.78/309423 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000452 sec. Number of lines 309471

#### test-3

calls sec sec/call method

51926 24.37 0.0005 sparqlConstructQuery 207613 98.01 0.0005 sparqlSelectQuery 10379 3.70 0.0004 sparqlAskQuery Total time/calls: 126.08/269918 sec (~2.10 mins, or ~0.04 hrs) avrg 0.000467 sec. Number of lines 269963

#### test-4

calls sec sec/call method

52109 24.54 0.0005 sparqlConstructQuery 207997 98.19 0.0005 sparqlSelectQuery 10551 3.74 0.0004 sparqlAskQuery Total time/calls: 126.47/270657 sec (~2.11 mins, or ~0.04 hrs) avrg 0.000467 sec. Number of lines 270703

#### test-5

calls sec sec/call method

55176 352.31 0.0064 sparqlConstructQuery 209343 1613.03 0.0077 sparqlSelectQuery 10465 58.37 0.0056 sparqlAskQuery Total time/calls: 2023.71/274984 sec (~33.73 mins, or ~0.56 hrs) avrg 0.007359 sec. Number of lines 275029

#### test-6

calls sec sec/call method

55574 353.08 0.0064 sparqlConstructQuery 210133 1615.31 0.0077 sparqlSelectQuery 10845 59.00 0.0054 sparqlAskQuery Total time/calls: 2027.39/276552 sec (~33.79 mins, or ~0.56 hrs) avrg 0.007331 sec. Number of lines 276599

#### test-7

calls sec sec/call method

56134 351.02 0.0063 sparqlConstructQuery 209061 1598.22 0.0076 sparqlSelectQuery 10453 57.96 0.0055 sparqlAskQuery Total time/calls: 2007.19/275648 sec (~33.45 mins, or ~0.56 hrs) avrg 0.007282 sec. Number of lines 275693

#### test-8

calls sec sec/call method

56301 351.57 0.0062 sparqlConstructQuery 209408 1599.73 0.0076 sparqlSelectQuery 10607 58.34 0.0055 sparqlAskQuery Total time/calls: 2009.64/276316 sec (~33.49 mins, or ~0.56 hrs) avrg 0.007273 sec. Number of lines 276362

#### test-9

calls sec sec/call method

58137 361.48 0.0062 sparqlConstructQuery 217714 1664.85 0.0076 sparqlSelectQuery 10887 61.15 0.0056 sparqlAskQuery Total time/calls: 2087.49/286738 sec (~34.79 mins, or ~0.58 hrs) avrg 0.007280 sec. Number of lines 286784

#### test-10

calls sec sec/call method

58328 361.86 0.0062 sparqlConstructQuery 218091 1666.02 0.0076 sparqlSelectQuery 11068 61.46 0.0056 sparqlAskQuery Total time/calls: 2089.34/287487 sec (~34.82 mins, or ~0.58 hrs) avrg 0.007268 sec. Number of lines 2875344


# Discussion
## Performance between the local TDB type and the remote triples server (Neptune)
The two tests representing this test case are test-2 and test-6:

Test-2: Total time/calls: 139.78/309423 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000452 sec. Test-3: Total time/calls: 2027.39/276552 sec (~33.79 mins, or ~0.56 hrs) avrg 0.007331 sec.

We observe an average resolution of obtaining information of 0.000452 sec for the TdbLocal mode and 0.007331 sec for the TdbDistant mode, that is to say a ratio for **local** TDB of **16 times faster** than the **distant** mode.

## Impact of reducing the number of individuals to read per page
This test case aims at analyzing the performance of the reduction of the individual number to be read to obtain the necessary information to supply the page. In its initial state, Vitro reads the information of 30 individuals. Each individual uses sometimes up to 60 SPARQL queries, which for 30 individuals is equivalent to nearly 1800 queries. 
More specifically, by referring to the results below for these particular tests (309423-307881 = **1542**):

Test-1: Total time/calls: 139.55/307881 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000453 sec. Test-2: Total time/calls: 139.78/309423 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000452 sec.

To evaluate the effect of the decrease in the number of individuals we will review the results of the test-6 and test-10 for a reduction from 30 to 15 individuals on Neptune. 

To make this comparison, it is necessary to consider only the totals obtained when the person page is called. The totals from the indexing must be subtracted from the totals after the call to the person's page. The result of the test-5 and test-9 must be part of the calculation

test-5: Total time/calls: 2023.71/274984 sec (~33.73 mins, or ~0.56 hrs) avrg 0.007359 sec. test-6: Total time/calls: 2027.39/276552 sec (~33.79 mins, or ~0.56 hrs) avrg 0.007331 sec. test-9: Total time/calls: 2087.49/286738 sec (~34.79 mins, or ~0.58 hrs) avrg 0.007280 sec. Test-10: Total time/calls: 2089.34/287487 sec (~34.82 mins, or ~0.58 hrs) avrg 0.007268 sec.

**Result**
Total Time (TT) from test-6 (2027.39 sec) minus TT-5 (2023.71 sec) = **3.68 sec**onds to refresh page of (**30 indvs**)
TT-10 (2089.34 sec ) minus TT-9 (2087.49 sec) = **1.85 sec**onds to refresh page of (**15 indvs**)

So the page containing 15 individuals refreshes within a factor of 2... which makes sense!

## Impact of parallel processing (Multithreading) of individuals

test-7: Total time/calls: 2007.19/275648 sec (~33.45 mins, or ~0.56 hrs) avrg 0.007282 sec. test-8: Total time/calls: 2009.64/276316 sec (~33.49 mins, or ~0.56 hrs) avrg 0.007273 sec. test-9: Total time/calls: 2087.49/286738 sec (~34.79 mins, or ~0.58 hrs) avrg 0.007280 sec. Test-10: Total time/calls: 2089.34/287487 sec (~34.82 mins, or ~0.58 hrs) avrg 0.007268 sec.


TT-8  (2009.64 sec) minus TT-7 (2007.19 sec) = **2.45 sec**onds to refresh page **Multi-Thread**
TT-10 (2089.34 sec ) mins TT-9 (2087.49 sec) = **1.85 sec**onds to refresh page of  **Mono-Thread**

It appears in this experiment that the results obtained are in the opposite direction of the expected results. This result can be explained in several ways:  

- The generation of threads takes more time than the transmission of the request  
- A possible overload of the Neptune triplet server  
- Unexpected network overload  
- Other possible factors.  

Without any doubt, this is a case to be reproduced several times in order to extract more significant statistical results.

# Tracks of future investigations

Here are some of the important classes involved in finding the information needed to produce the list of individuals. Classes that are more or less in order of call.

 - `GetRenderedSearchIndividualsByVClass`
 - `IndividualTemplateModelBuilder`
 - `ShortViewServiceImpl`
 - `IndividualSDB`
 - `RequestModelAccessImpl`
 - `RDFServiceSparql`

Two approaches are possible to reduce latency:
1- Redefine the call logic to reduce the number of queries sent to the network.
2- Buffering the treatment by making a first request extracting all the information specific to the individual to form a model in local memory and then applying the treatment logic on the model in local memory.

FYI: @chenejac , @brianjlowe , @wwelling , @litvinovg

vivo-project / VIVO

Testing performance of external graph usage #3823

VIVO (TDB/Neptune) Performance measurement - Preliminary results-

Prerequisite

System details

Testing Procedure

Results analysis

Test with TDB

Test with NEPTUNE

General remarks

A comprehensive investigation

Experimental procedure to evaluate the performance of the solution

Test setup

Test flow

Test parameters

Log compilers

Output formatting

Output formatting - END

calls sec sec/call method

calls sec sec/call method

calls sec sec/call method

calls sec sec/call method

calls sec sec/call method

calls sec sec/call method

calls sec sec/call method

calls sec sec/call method

calls sec sec/call method

calls sec sec/call method