Open chenejac opened 1 year ago
VIVO's internal logging of SPARQL query timings may be useful here: https://wiki.lyrasis.org/display/VIVO/The+Developer+Panel See sections about SPARQL Query Settings and developer.properties.
@chenejac , @brianjlowe , @wwelling , @kaladay
Here are some of the results of the analysis to date
read https://wiki.lyrasis.org/display/VIVO/Performance+Testing
$VIVO_HOME/config/developer.properties
configured as:
developer.enabled = true
developer.loggingRDFService.enable = true
developer.loggingRDFService.queryRestriction = .*
developer.loggingRDFService.stackRestriction = .*
Env. Vars
export CATALINA_OPTS="-Xms6g -Xmx14g"
export JAVA_OPTIONS="-XX:+UnlockCommercialFeatures -XX:+FlightRecorder \
-Dcom.sun.management.jmxremote=true \
-Dcom.sun.management.jmxremote.port=3614 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.autodiscovery=true "
applicationSetup.n3
for appropriate situation: tdbConfigurationTripleSource or sparqlConfigurationTripleSource with appropriate sparql-endpoint URL$VIVO_HOME/rdf/abox/filegraph
$TOMCAT_HOME/logs/
Script result after reindexing (step 8):
calls sec sec/call method
==================================================
52915 21.58 0.0004 sparqlConstructQuery
211517 85.92 0.0004 sparqlSelectQuery
10578 2.89 0.0003 sparqlAskQuery
Total time: 110.394 sec (~1 mins, or ~0 hrs)
Script result after reindexing (step 11):
calls sec sec/call method
==================================================
53313 21.64 0.0004 sparqlConstructQuery
212285 86.02 0.0004 sparqlSelectQuery
10958 2.93 0.0003 sparqlAskQuery
Total time: 110.596 sec (~1 mins, or ~0 hrs)
Remarks:
A total of 1546 SPARQ operations are requested to refresh the Person tab form According to our calculations, these operations are performed in .202 seconds
Script result after reindexing (step 8):
calls sec sec/call method
==================================================
51760 685.09 0.0132 sparqlConstructQuery
205288 3036.38 0.0148 sparqlSelectQuery
10269 126.92 0.0124 sparqlAskQuery
Total time: 3848.392 sec (~64 mins, or ~1 hrs)
Script result after reindexing (step 11):
calls sec sec/call method
==================================================
52156 685.89 0.0132 sparqlConstructQuery
206056 3038.81 0.0147 sparqlSelectQuery
10647 127.57 0.0120 sparqlAskQuery
Total time: 3852.273 sec (~64 mins, or ~1 hrs)
Remarks:
A total of 1544 SPARQ operations are requested to refresh the Person tab form According to our calculations, these operations are performed in 3,31 seconds
It appears that changing pages (person, organization, search, etc.) requires a number of queries proportional to the number of triples in the TriplesStore. For a networked database detached from the VIVO application, the consequences are quite small for a little database. On the other hand, the sensitivity to latency becomes more perceptible for larger databases.
The hypothesis that emerges from the observation is that the high latency is mainly due to repeated sending of small queries. We believe that it would be advantageous to group small queries into larger ones to reduce the number of calls to the triplestore.
@michel-heon thanks for this. This is a nice first step. I am wondering can we make more concrete definition for further steps including code analysis and listing links to code lines where the improvement is needed (for instance link to a method which should use bulk fetching of data).
This commentary describes the investigations that have been conducted to identify the source of the network latency. In addition, a first correction is proposed in the Vitro PR 378
The investigation's goal is to determine the entry point requiring optimization. Inspired by the defined procedure VIVO-Wiki/Performance+Testing, several executions with a step-by-step trace have been performed.
The investigation is carried out by performing five iterations, each comprising two tests
The tests are produced in the following sequence:
flowchart TD
Z[Clean tomcat logs ] --> A[Set parameters]
A -->|STGE-Type, #-indv, #-Thread| B(compile/start-vivo)
BB[(Data set)] -. Loading data set .-> C
B --> C{{Wait for the reindexing to be completed}}
C -. trace log .-> E[(test-#.txt)]
C --> F{{Select Person in VIVO}}
F-. trace Log .-> G[(test-#+1.txt)]
F--> H[Stop tomcat]
Iteration are parameterized according to the following three variables
The result of the tests is realized by the execution of this script which is an adaptation of the script proposed in parse
by
#!/bin/bash
###################################################################
# Script Name :
# Description :
# Args : extract from https://wiki.lyrasis.org/download/attachments/183076653/parse.sh?version=1&modificationDate=1583178696512&api=v2&download=true
# Author : Michel Héon PhD
# Institution : Université du Québec à Montréal
# Copyright : Université du Québec à Montréal (c) 2022
# Email : heon.michel@uqam.ca
###################################################################
export SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd -P)"
source $SCRIPT_DIR/../conf/vs-conf.sh
source $BASH_LIB/cleanup.sh
LOG_FILES="$TOMCAT_HOME/logs/$APP_NAME.all.*"
log_file=$TMPDIR/file.log
for f in $LOG_FILES do echo -e '\n' >> $log_file cat $f >> $log_file done set +x
grep 'ingest:' < $log_file
divider=================================== divider=$divider$divider
header="\n %7s %8s %10s %20s\n" format=" %7d %8.2f %10.4f %20s\n"
width=50
printf "$header" "calls" "sec" "sec/call" "method" ; printf "%$width.${width}s\n" "$divider"
total=0
total_count=0
for method in {changeSetUpdate,sparqlConstructQuery,sparqlDescribeQuery,sparqlSelectQuery,sparqlAskQuery,serializeAll,serializeGraph,isEquivalentGraph};
do
count=grep RDFServiceLogger $log_file | cut -d' ' -f9,10 | grep -c $method;
sum=grep RDFServiceLogger $log_file | cut -d' ' -f9,10 | grep $method | cut -d' ' -f1 | paste -sd+ - | bc -l ;
if [ ! -z "$sum" ]
then
rate=echo "${sum}/${count}" | bc -l
printf "$format" "$count" "$sum" "$rate" "$method";
math="${total}+${sum}"
total=echo $math | bc -l
total_count=echo "${total_count}+${count}"|bc
fi
done
mins=echo "${total}/60.0" | bc -l
hours=echo "${total}/3600.0" | bc -l
printf "Total time/calls: %.2f/%d sec (~%.2f mins, or ~%.2f hrs) avrg %f sec.\n" "${total}" "$total_count" "${mins}" "${hours}" "$(echo "${total}/$total_count" | bc -l)"
echo "Number of lines $(wc -l < $log_file)"
### Data set
- From a UQAM triplestore in pre-production
- Contains 396047 triples
- Data are internationalized fr-CA and en-US
- Represents about 1300 persons, 600 organizations, 360 research entries, 3600 expert concepts and 2070 events
### Java classes affected by the tests
see details in [Vitro PR 378](https://github.com/vivo-project/Vitro/pull/378)
## Test results
### Summary of experiments
|Test #| STGE Type| nbrindv/page| Threading | after-reindex|after-touch-person|
| -| -| -| -|-|-|
test-1 | T | 30 |1| X|
test-2 | T | 30 | 1| | X|
test-3 | T | 15 | X | X|
test-4 | T | 15 | X | | X|
test-5 | N | 30 |1| X|
test-6 | N | 30 | 1| | X|
test-7 | N | 15 | X | X|
test-8 | N | 15 | X | | X|
test-9 | N | 15 |1| X|
test-10 | N | 15 |1 | | X|
### Test details
#### test-1
59468 27.36 0.0005 sparqlConstructQuery 236584 108.38 0.0005 sparqlSelectQuery 11829 3.80 0.0003 sparqlAskQuery Total time/calls: 139.55/307881 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000453 sec. Number of lines 307929
#### test-2
59864 27.44 0.0005 sparqlConstructQuery 237352 108.50 0.0005 sparqlSelectQuery 12207 3.84 0.0003 sparqlAskQuery Total time/calls: 139.78/309423 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000452 sec. Number of lines 309471
#### test-3
51926 24.37 0.0005 sparqlConstructQuery 207613 98.01 0.0005 sparqlSelectQuery 10379 3.70 0.0004 sparqlAskQuery Total time/calls: 126.08/269918 sec (~2.10 mins, or ~0.04 hrs) avrg 0.000467 sec. Number of lines 269963
#### test-4
52109 24.54 0.0005 sparqlConstructQuery 207997 98.19 0.0005 sparqlSelectQuery 10551 3.74 0.0004 sparqlAskQuery Total time/calls: 126.47/270657 sec (~2.11 mins, or ~0.04 hrs) avrg 0.000467 sec. Number of lines 270703
#### test-5
55176 352.31 0.0064 sparqlConstructQuery 209343 1613.03 0.0077 sparqlSelectQuery 10465 58.37 0.0056 sparqlAskQuery Total time/calls: 2023.71/274984 sec (~33.73 mins, or ~0.56 hrs) avrg 0.007359 sec. Number of lines 275029
#### test-6
55574 353.08 0.0064 sparqlConstructQuery 210133 1615.31 0.0077 sparqlSelectQuery 10845 59.00 0.0054 sparqlAskQuery Total time/calls: 2027.39/276552 sec (~33.79 mins, or ~0.56 hrs) avrg 0.007331 sec. Number of lines 276599
#### test-7
56134 351.02 0.0063 sparqlConstructQuery 209061 1598.22 0.0076 sparqlSelectQuery 10453 57.96 0.0055 sparqlAskQuery Total time/calls: 2007.19/275648 sec (~33.45 mins, or ~0.56 hrs) avrg 0.007282 sec. Number of lines 275693
#### test-8
56301 351.57 0.0062 sparqlConstructQuery 209408 1599.73 0.0076 sparqlSelectQuery 10607 58.34 0.0055 sparqlAskQuery Total time/calls: 2009.64/276316 sec (~33.49 mins, or ~0.56 hrs) avrg 0.007273 sec. Number of lines 276362
#### test-9
58137 361.48 0.0062 sparqlConstructQuery 217714 1664.85 0.0076 sparqlSelectQuery 10887 61.15 0.0056 sparqlAskQuery Total time/calls: 2087.49/286738 sec (~34.79 mins, or ~0.58 hrs) avrg 0.007280 sec. Number of lines 286784
#### test-10
58328 361.86 0.0062 sparqlConstructQuery 218091 1666.02 0.0076 sparqlSelectQuery 11068 61.46 0.0056 sparqlAskQuery Total time/calls: 2089.34/287487 sec (~34.82 mins, or ~0.58 hrs) avrg 0.007268 sec. Number of lines 2875344
# Discussion
## Performance between the local TDB type and the remote triples server (Neptune)
The two tests representing this test case are test-2 and test-6:
Test-2: Total time/calls: 139.78/309423 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000452 sec. Test-3: Total time/calls: 2027.39/276552 sec (~33.79 mins, or ~0.56 hrs) avrg 0.007331 sec.
We observe an average resolution of obtaining information of 0.000452 sec for the TdbLocal mode and 0.007331 sec for the TdbDistant mode, that is to say a ratio for **local** TDB of **16 times faster** than the **distant** mode.
## Impact of reducing the number of individuals to read per page
This test case aims at analyzing the performance of the reduction of the individual number to be read to obtain the necessary information to supply the page. In its initial state, Vitro reads the information of 30 individuals. Each individual uses sometimes up to 60 SPARQL queries, which for 30 individuals is equivalent to nearly 1800 queries.
More specifically, by referring to the results below for these particular tests (309423-307881 = **1542**):
Test-1: Total time/calls: 139.55/307881 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000453 sec. Test-2: Total time/calls: 139.78/309423 sec (~2.33 mins, or ~0.04 hrs) avrg 0.000452 sec.
To evaluate the effect of the decrease in the number of individuals we will review the results of the test-6 and test-10 for a reduction from 30 to 15 individuals on Neptune.
To make this comparison, it is necessary to consider only the totals obtained when the person page is called. The totals from the indexing must be subtracted from the totals after the call to the person's page. The result of the test-5 and test-9 must be part of the calculation
test-5: Total time/calls: 2023.71/274984 sec (~33.73 mins, or ~0.56 hrs) avrg 0.007359 sec. test-6: Total time/calls: 2027.39/276552 sec (~33.79 mins, or ~0.56 hrs) avrg 0.007331 sec. test-9: Total time/calls: 2087.49/286738 sec (~34.79 mins, or ~0.58 hrs) avrg 0.007280 sec. Test-10: Total time/calls: 2089.34/287487 sec (~34.82 mins, or ~0.58 hrs) avrg 0.007268 sec.
**Result**
Total Time (TT) from test-6 (2027.39 sec) minus TT-5 (2023.71 sec) = **3.68 sec**onds to refresh page of (**30 indvs**)
TT-10 (2089.34 sec ) minus TT-9 (2087.49 sec) = **1.85 sec**onds to refresh page of (**15 indvs**)
So the page containing 15 individuals refreshes within a factor of 2... which makes sense!
## Impact of parallel processing (Multithreading) of individuals
test-7: Total time/calls: 2007.19/275648 sec (~33.45 mins, or ~0.56 hrs) avrg 0.007282 sec. test-8: Total time/calls: 2009.64/276316 sec (~33.49 mins, or ~0.56 hrs) avrg 0.007273 sec. test-9: Total time/calls: 2087.49/286738 sec (~34.79 mins, or ~0.58 hrs) avrg 0.007280 sec. Test-10: Total time/calls: 2089.34/287487 sec (~34.82 mins, or ~0.58 hrs) avrg 0.007268 sec.
TT-8 (2009.64 sec) minus TT-7 (2007.19 sec) = **2.45 sec**onds to refresh page **Multi-Thread**
TT-10 (2089.34 sec ) mins TT-9 (2087.49 sec) = **1.85 sec**onds to refresh page of **Mono-Thread**
It appears in this experiment that the results obtained are in the opposite direction of the expected results. This result can be explained in several ways:
- The generation of threads takes more time than the transmission of the request
- A possible overload of the Neptune triplet server
- Unexpected network overload
- Other possible factors.
Without any doubt, this is a case to be reproduced several times in order to extract more significant statistical results.
# Tracks of future investigations
Here are some of the important classes involved in finding the information needed to produce the list of individuals. Classes that are more or less in order of call.
- `GetRenderedSearchIndividualsByVClass`
- `IndividualTemplateModelBuilder`
- `ShortViewServiceImpl`
- `IndividualSDB`
- `RequestModelAccessImpl`
- `RDFServiceSparql`
Two approaches are possible to reduce latency:
1- Redefine the call logic to reduce the number of queries sent to the network.
2- Buffering the treatment by making a first request extracting all the information specific to the individual to form a model in local memory and then applying the treatment logic on the model in local memory.
FYI: @chenejac , @brianjlowe , @wwelling , @litvinovg
Is your feature request related to a problem? Please describe. It looks performance in communication with external/remote graph bases is quite low. We need to test those performance and create baseline for comparation with further improvements.
Describe the solution you'd like A wiki page for performance testing should be updated - https://wiki.lyrasis.org/display/VIVO/Performance+Testing. The page is documenting the scenario of testing and results. Taking into account that ingestion has been improved recently in this PR, all existing tests at wiki page should be repeated. The definition of the environment should be also included (operating system, processor, RAM, SDD or HDD disk).
Additional context Wiki page - https://wiki.lyrasis.org/display/VIVO/Performance+Testing Take a look in another Lyrasis communities testing performance procedures and persentation of results: