Long wait on "Mapping to class/routine coverage" 15mins+ on local machine

stevelee12 commented 1 week ago

As per subject, the "Mapping to class/routine coverage" process is taking a long time to complete. Running locally takes between 15-20 minutes on average (over 50mins in Azure DevOps).

Run locally:

I've put a couple of debug lines at various points in TestCoverage.Data.Run.MapRunCoverage() to track timings, flagged with the original COS comments where possible, here's my findings:

[10/29/2024 09:44:54] MapRunCoverage Started, pRunIndex is: 1
[10/29/2024 09:44:54] "Executing worse performing approach query"
[10/29/2024 09:44:54] opening cursor
[10/29/2024 09:44:55] running "Copy any other metrics captured/requested as well"
[10/29/2024 09:44:55] Sql statment is: INSERT OR UPDATE %NOLOCK %NOCHECK INTO TestCoverage_Data.Coverage_RtnLine (Coverage,element_key,RtnLine) SELECT target.ID,map.ToLine,NVL(oldMetric.RtnLine,0) + SUM(metric.RtnLine) FROM TestCoverage_Data.Coverage source JOIN TestCoverage_Data.CodeUnitMap map "_$c(9)_"ON source.Hash = map.FromHash JOIN TestCoverage_Data.Coverage_RtnLine metric "_$c(9)_"ON metric.Coverage = source.ID "_$c(9)_"AND metric.element_key = map.FromLine JOIN TestCoverage_Data.Coverage target "_$c(9)_"ON target.Run = source.Run "_$c(9)_"AND target.Hash = map.ToHash "_$c(9)_"AND target.TestPath = source.TestPath LEFT JOIN TestCoverage_Data.Coverage_RtnLine oldMetric "_$c(9)_"ON oldMetric.ID = target.ID "_$c(9)_"AND oldMetric.element_key = map.ToLine WHERE source.Run = ? "_$c(9)_"AND source.Ignore = 0"_$c(9)_"AND source.Calculated = 0 GROUP BY target.ID,map.ToLine
[10/29/2024 10:01:59] Sql statment completed
[10/29/2024 10:01:59] Exiting with status tSC:1

Size of tables:

Running the sqlstatement as a straight count(*) without the insert on SMP just sits waiting forever:

However when I remove the join back to "TestCoverage_Data.Coverage target" the query returns instantly

Can anyone help me with this please?

Thanks as always :)

EDIT: The straight count did eventually return a result after 46min:

stevelee12 commented 1 week ago

@isc-tleavitt are you or a colleague able to shed any light on this please? Thanks 👍

isc-tleavitt commented 1 week ago

@stevelee12 sorry for the delay. What IRIS version are you running on? Can you paste in the query plan you're getting on your system for the slow query?

isc-tleavitt commented 1 week ago

One possible thought here - in our CI processes we do this before each build to clear out previous data (note, this will delete EVERYTHING from previous TestCoverage runs): do ##class(TestCoverage.Utils).Clear()

Running that could help with performance if past runs' data are a factor.

As a comparison point, I'm seeing this performance on one of our larger applications with a low-resourced build machine running IRIS for UNIX (Red Hat Enterprise Linux 8 for x86-64) 2022.1.2 (Build 574U) Fri Jan 13 2023 14:58:02 EST:

Collecting coverage data for all tests: 13.699757 seconds
Mapping to class/routine coverage: 4.725704 seconds
Aggregating coverage data: .119715 seconds
Code coverage: 23.78%

Codebase size is fairly comparable (not smaller enough to explain a 250x slowdown - and we have much higher coverage too):

select count(*) from TestCoverage_Data.Coverage
union all
select count(*) from TestCoverage_Data.CodeUnitMap
union all
select count(*) from TestCoverage_Data.Coverage_RtnLine

Gives:

1736
246907
380051

The operative query:

SELECT count(*) FROM TestCoverage_Data.Coverage source JOIN TestCoverage_Data.CodeUnitMap map       ON source.Hash = map.FromHash JOIN TestCoverage_Data.Coverage_RtnLine metric    ON metric.Coverage = source.ID  AND metric.element_key = map.FromLine JOIN TestCoverage_Data.Coverage target    ON target.Run = source.Run      AND target.Hash = map.ToHash    AND target.TestPath = source.TestPath LEFT JOIN TestCoverage_Data.Coverage_RtnLine oldMetric    ON oldMetric.ID = target.ID     AND oldMetric.element_key = map.ToLine WHERE source.Run = ?     AND source.Ignore = 0   AND source.Calculated = 0 GROUP BY target.ID,map.ToLine

Returns in under a second with query plan:

• Read index map TestCoverage_Data.Coverage.MeaningfulCoverageData, using the given Run, Calculated, and Ignore, and looping on Hash and %SQLUPPER(TestPath), and getting ID.
• For each row:
    - Read master map TestCoverage_Data.Coverage_RtnLine.IDKEY, using the given Coverage, and looping on element_key.
    - For each row:
        · Read index map TestCoverage_Data.CodeUnitMap.HashForward, using the given FromHash and FromLine, and looping on ToHash and ToLine.
        · For each row:
            - Read index map TestCoverage_Data.Coverage.UniqueCoverageData, using the given Run, Hash, and %SQLUPPER(TestPath), and getting ID.
            - Check distinct values for ToLine and ID using temp-file A,
                subscripted by values.
            - For each distinct row:
                · Add a row to temp-file A, subscripted by the hash,
                    with node data of ID and ToLine.
            - Update the accumulated count(rows) in temp-file A,
                subscripted by the hash

stevelee12 commented 1 week ago

IRIS for UNIX (Ubuntu Server LTS for x86-64 Containers) 2022.1.5 (Build 940U) Thu Apr 18 2024 14:30:11 EDT The container spins up fresh, installs the test coverage package and executes tests with coverage so I wouldn’t think there’s anything to clear but I’ll try it

stevelee12 commented 1 week ago

Not sure if it’s relevant but the code coverage I’m analysing is 100% routine .mac classes rather than .cls’

isc-tleavitt commented 1 week ago

@stevelee12 can you snag the query plan and see if it's the same?

stevelee12 commented 6 days ago

Before executing unit tests:

running the tests now...

stevelee12 commented 6 days ago

I forgot to add, I tried running the SQL on terminal. Query executes but when I try to do RS.Next() on the first row it hangs

stevelee12 commented 6 days ago

still going..

stevelee12 commented 6 days ago

I quit the tests early with ctrl+c, the query plan is still the same as above, but executing it will not return as yours does. Happy to show you on a Teams call on Monday or any day next week if you're available?

isc-tleavitt commented 4 days ago

@stevelee12 please drop me an email: tleavitt <at> intersystems.com - we'll set something up.

isc-tleavitt commented 4 days ago

Before executing unit tests:

running the tests now...

This query plan is meaningfully different and I think I see the bad choice: for each routine line we're looping over all of the hashes for the given run and test path! That's a lot of silly extra work.

TuneTable isn't much help here because we're starting out from nothing, but we might be able to trick the query optimizer in the right direction with a %IGNOREINDEX pointer. Unfortunately, we need to use TestCoverage_Data.Coverage.MeaningfulCoverageData on the outer loop. The best possibility/hope would be that ignoring TestCoverage_Data.CodeUnitMap.HashReverse would get it to use HashForward and do so first.

isc-tleavitt commented 4 days ago

Ah - actually we can use %NOINDEX in ON too, just thought to look for that: https://docs.intersystems.com/iris20221/csp/docbook/DocBook.UI.Page.cls?KEY=RSQL_join#RSQL_join_performance_on_indexing

isc-tleavitt commented 4 days ago

@stevelee12 - rather than meeting, I'm asking @isc-shuliu to put up a PR with the query optimizer keywords to fix the issue; if that doesn't resolve it we can meet.

isc-tleavitt commented 4 days ago

Proper optimization strategy:

Rewrite the query to change the join order to: Coverage Coverage_RtnLine CodeUnitMap Coverage

And use the %INORDER query optimizer hint.

stevelee12 commented 4 days ago

This looks very promising!

isc-tleavitt commented 4 days ago

@stevelee12 thank you for confirming! I've merged and we'll release 4.0.5 today.

isc-tleavitt commented 4 days ago

@stevelee12 we've released 4.0.5 here and via Open Exchange/IPM.

stevelee12 commented 3 days ago

Hi @isc-tleavitt Can you check the 4.0.5.xml release please? I could be going mad but I dont think the xml in MapRunCoverage query in there matches what's in Git?

isc-tleavitt commented 3 days ago

@stevelee12 you're completely right - filed #58 to fix this. There's a new artifact, that'll be the right one.

intersystems / TestCoverage

Long wait on "Mapping to class/routine coverage" 15mins+ on local machine #56