Closed bprusinowski closed 3 months ago
The latest updates on your projects. Learn more about Vercel for Git ↗︎
Name | Status | Preview | Comments | Updated (UTC) |
---|---|---|---|---|
visualization-tool | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | Jun 14, 2024 6:35am |
@bprusinowski can you share an example of a less performing query? thank you
@Rdataflow sure, this is a query to fetch dimension values for every dimension in the Bathing water quality cube (takes ~12s). Compare this to queries fired on PROD: https://visualize.admin.ch/en/create/lmbY5klvYJAm?dataSource=Prod&flag__debug=true&flag__server-side-cache.disable=true (takes 1.4s for 5 queries, one per each dimension, fired in parallel).
The above example is for a non-filtered query (fetches all dimension values), but we also need to be able to filter and unversion each dimension separately (see PR description). This is why I think we need to combine individual queries like this – let me know if it's enough information or if you see some other direction to try.
@bprusinowski it seems those OPTIONAL
s for schema:version
have a negative impact. You may try to further parallelize using a query pattern of
{
# versioned case
...
?dimension schema:version ?version
...
} UNION {
# nonversioned case
...
FILTER NOT EXISTS { ?dimension schema:version ?version }
...
}
i.e. like https://s.zazuko.com/HenZ6K
HTH for self guiding the next steps 👍
Hey @Rdataflow, I optimized the query based on your suggestion and fixed some issues, it seems to fetch the correct values now.
However it looks like it's less performing than the current approach of parallel queries, both for smaller and bigger cubes. You can see some examples for Photovoltaikanlagen (TEST, PR) and NFI: Change (TEST, PR) cubes (both using INT data source to not rely on cached endpoint and with disabled server-side cache).
It looks like the new approach is ~2x slower than what we currently have on TEST. I will take a deeper look to see if there's something obvious to optimize, but it would be great if you could also take a look in case you have a bit of time.
Let me know what you about the whole direction of combining the queries 👀
@bprusinowski unfortunately those /create/ links won't endure... do you have some permalinks maybe?
i.e. those /create/new?from=<published>
would work IIUC
@Rdataflow of course – I updated the links, should be ok now (I blame it on Monday morning 🤦 😅)
@bprusinowski what happens if you drop #pragma evaluate on
everywhere? with the query in the current form this might help. curious to see...
@Rdataflow I did some tests to fire the query "for the first time", to avoid some apparent caching on LINDAS side (did this by modifying the query to e.g. remove retrieval of color, so the query looks like new); it still looks like #pragma
improves the situation a bit (1.8s for pragma vs 2.2s without) for the Traffic noise pollution cube. The timings change if I "comment out" other properties, but the delta seems to always be in favor of #pragma
...
@bprusinowski the query pattern still would profit of some minor optimization steps...
the 3rd and 4th UNION block of each dimension may be optimized using a inner SELECT to prevent ?obs ?dimensionIri ?versionedValue
from evaluation in case of empty due to FILTER NOT EXISTS { ... }
returning empty set
this might look like i.e.
# 3rd UNION block on a dimension
{
SELECT DISTINCT ?dimensionIri ?versionedValue ?unversionedValue
WHERE {
{
SELECT ?observation
WHERE {
VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
<https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationConstraint/sh:property ?dimension .
?dimension sh:path ?dimensionIri .
?dimension schema:version ?version .
FILTER NOT EXISTS {
?dimension sh:in ?in .
}
<https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationSet/cube:observation ?observation .
}
}
VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
?observation ?dimensionIri ?versionedValue .
?versionedValue schema:sameAs ?unversionedValue .
}
}
UNION
{
# 4th UNION block on a dimension
SELECT DISTINCT ?dimensionIri ?versionedValue ?unversionedValue
WHERE {
{
SELECT ?observation
WHERE {
VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
<https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationConstraint/sh:property ?dimension .
?dimension sh:path ?dimensionIri .
FILTER NOT EXISTS {
?dimension sh:in ?in .
}
FILTER NOT EXISTS {
?dimension schema:version ?version .
}
<https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationSet/cube:observation ?observation .
}
}
?observation ?dimensionIri ?versionedValue .
BIND(?versionedValue AS ?unversionedValue)
}
}
then as the innermost SELECT ?observation
becomes emtpy (true for many dimensions) the UNION blocks are faster now 😄
nb: or in case you prefer the #pragma
s transfer those pragmas to the inner SELECT ?observation
then
Thanks again @Rdataflow for optimizing the query 💯
Unfortunately it looks that it's still significantly less performing that the ones we have on TEST / INT / PROD. See this combined query that takes 12-13s – the same cube on TEST takes ~7-8s to load values for every dimension when fired separately.
I think we might reach out to Zazuko, seeing that the approach we currently try doesn't seem to improve things – does it sound good? Maybe I miss some additional context, but knowing that we'll use a cached endpoint that will already offload a load of computing power from Stardog, I am not sure if it's worth it to sacrifice 50% of performance (assuming is scales linearly 😅 – but even if not, an overhead of 4s for NFI cubes if noticeable) just to send a smaller number of queries.
Let me know what you think @Rdataflow :)
cc @sosiology @adintegra
@bprusinowski the proposed query obviously misses to constrain the dimensionIri to the relevant dimension only - therefore it suffers heavily degraded performance
see comments inline https://s.zazuko.com/23hyB45
nb: regarding perf on TEST see VSHN SBAR-1122 and comment inline
cc @sosiology @adintegra
Closes #1470
This PR is an exploration of the feasibility to fetch values for multiple dimensions in a single SPARQL query.
Constrains
FILTER(IF(?dimensionIri = <A>, ?dimensionIri = <a> && ?dimensionIri = <a>, ?dimensionIri))
schema:sameAs
is not used to indicate unversioned values, this is only the case when a dimension is versioned).we need to combine individual queries into a big one (at least that's my current assumption).