refactor: Combined query to fetch dimension values

bprusinowski commented 5 months ago

Closes #1470

This PR is an exploration of the feasibility to fetch values for multiple dimensions in a single SPARQL query.

Constrains

We need to be able to filter each dimension individually, due to cascading filters behavior. Could be achieved with FILTER(IF(?dimensionIri = <A>, ?dimensionIri = <a> && ?dimensionIri = <a>, ?dimensionIri))
We need to be able to unversion dimension values per individual dimension (can't unversion values outside of individual SELECT queries, as sometimes schema:sameAs is not used to indicate unversioned values, this is only the case when a dimension is versioned).

we need to combine individual queries into a big one (at least that's my current assumption).

vercel[bot] commented 5 months ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
visualization-tool	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 14, 2024 6:35am

Rdataflow commented 5 months ago

@bprusinowski can you share an example of a less performing query? thank you

bprusinowski commented 5 months ago

@Rdataflow sure, this is a query to fetch dimension values for every dimension in the Bathing water quality cube (takes ~12s). Compare this to queries fired on PROD: https://visualize.admin.ch/en/create/lmbY5klvYJAm?dataSource=Prod&flag__debug=true&flag__server-side-cache.disable=true (takes 1.4s for 5 queries, one per each dimension, fired in parallel).

The above example is for a non-filtered query (fetches all dimension values), but we also need to be able to filter and unversion each dimension separately (see PR description). This is why I think we need to combine individual queries like this – let me know if it's enough information or if you see some other direction to try.

Rdataflow commented 5 months ago

@bprusinowski it seems those OPTIONALs for schema:version have a negative impact. You may try to further parallelize using a query pattern of

{ 
    # versioned case
    ...
    ?dimension schema:version ?version 
    ... 
} UNION { 
    # nonversioned case
    ... 
    FILTER NOT EXISTS { ?dimension schema:version ?version } 
    ... 
}

i.e. like https://s.zazuko.com/HenZ6K

HTH for self guiding the next steps 👍

bprusinowski commented 5 months ago

Hey @Rdataflow, I optimized the query based on your suggestion and fixed some issues, it seems to fetch the correct values now.

However it looks like it's less performing than the current approach of parallel queries, both for smaller and bigger cubes. You can see some examples for Photovoltaikanlagen (TEST, PR) and NFI: Change (TEST, PR) cubes (both using INT data source to not rely on cached endpoint and with disabled server-side cache).

It looks like the new approach is ~2x slower than what we currently have on TEST. I will take a deeper look to see if there's something obvious to optimize, but it would be great if you could also take a look in case you have a bit of time.

Let me know what you about the whole direction of combining the queries 👀

Rdataflow commented 5 months ago

@bprusinowski unfortunately those /create/ links won't endure... do you have some permalinks maybe?

Rdataflow commented 5 months ago

i.e. those /create/new?from=<published> would work IIUC

bprusinowski commented 5 months ago

@Rdataflow of course – I updated the links, should be ok now (I blame it on Monday morning 🤦 😅)

Rdataflow commented 5 months ago

@bprusinowski what happens if you drop #pragma evaluate on everywhere? with the query in the current form this might help. curious to see...

bprusinowski commented 5 months ago

@Rdataflow I did some tests to fire the query "for the first time", to avoid some apparent caching on LINDAS side (did this by modifying the query to e.g. remove retrieval of color, so the query looks like new); it still looks like #pragma improves the situation a bit (1.8s for pragma vs 2.2s without) for the Traffic noise pollution cube. The timings change if I "comment out" other properties, but the delta seems to always be in favor of #pragma...

Rdataflow commented 5 months ago

@bprusinowski the query pattern still would profit of some minor optimization steps... the 3rd and 4th UNION block of each dimension may be optimized using a inner SELECT to prevent ?obs ?dimensionIri ?versionedValue from evaluation in case of empty due to FILTER NOT EXISTS { ... } returning empty set

this might look like i.e.

# 3rd UNION block on a dimension
  {
    SELECT DISTINCT ?dimensionIri ?versionedValue ?unversionedValue
    WHERE {
      {
        SELECT ?observation
        WHERE {
          VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
          <https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationConstraint/sh:property ?dimension .
          ?dimension sh:path ?dimensionIri .
          ?dimension schema:version ?version .
          FILTER NOT EXISTS {
            ?dimension sh:in ?in .
          }
          <https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationSet/cube:observation ?observation .
        }
      }
      VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
      ?observation ?dimensionIri ?versionedValue .
      ?versionedValue schema:sameAs ?unversionedValue .
    }
  }
  UNION
  {
# 4th UNION block on a dimension
    SELECT DISTINCT ?dimensionIri ?versionedValue ?unversionedValue
    WHERE {
      {
        SELECT ?observation
        WHERE {
          VALUES ?dimensionIri { <https://environment.ld.admin.ch/foen/nfi/inventory> }
          <https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationConstraint/sh:property ?dimension .
          ?dimension sh:path ?dimensionIri .
          FILTER NOT EXISTS {
            ?dimension sh:in ?in .
          }
          FILTER NOT EXISTS {
            ?dimension schema:version ?version .
          }
          <https://environment.ld.admin.ch/foen/nfi/nfi_T-changes/cube/2024-1> cube:observationSet/cube:observation ?observation .
        }
      }
      ?observation ?dimensionIri ?versionedValue .
      BIND(?versionedValue AS ?unversionedValue)
    }
  }

then as the innermost SELECT ?observation becomes emtpy (true for many dimensions) the UNION blocks are faster now 😄

nb: or in case you prefer the #pragmas transfer those pragmas to the inner SELECT ?observation then

bprusinowski commented 5 months ago

Thanks again @Rdataflow for optimizing the query 💯

Unfortunately it looks that it's still significantly less performing that the ones we have on TEST / INT / PROD. See this combined query that takes 12-13s – the same cube on TEST takes ~7-8s to load values for every dimension when fired separately.

I think we might reach out to Zazuko, seeing that the approach we currently try doesn't seem to improve things – does it sound good? Maybe I miss some additional context, but knowing that we'll use a cached endpoint that will already offload a load of computing power from Stardog, I am not sure if it's worth it to sacrifice 50% of performance (assuming is scales linearly 😅 – but even if not, an overhead of 4s for NFI cubes if noticeable) just to send a smaller number of queries.

Let me know what you think @Rdataflow :)

cc @sosiology @adintegra

Rdataflow commented 4 months ago

@bprusinowski the proposed query obviously misses to constrain the dimensionIri to the relevant dimension only - therefore it suffers heavily degraded performance

see comments inline https://s.zazuko.com/23hyB45

nb: regarding perf on TEST see VSHN SBAR-1122 and comment inline

cc @sosiology @adintegra

visualize-admin / visualization-tool

refactor: Combined query to fetch dimension values #1487

Constrains