relax test for timing difference

mohamed-barakat commented 11 months ago

see: https://github.com/homalg-project/CategoricalTowers/pull/444#issuecomment-1819187002

zickgraf commented 11 months ago

I don't think this relaxation makes sense without collecting more information first: The errors in the CI might be due to timing fluctuations, but they might also be due to small performance regressions introduced over time. I suggest to either:

Remove the test and call it a day.
Remove the test from the CI but execute it manually every time the compiled code changes.
Manually compare the timings at the time when this test was added to the current timings. If there is a regression, investigate. If not, then the relaxation actually makes sense.

codecov[bot] commented 11 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (c13d54e) 78.97% compared to head (697642d) 75.47%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #446 +/- ## ========================================== - Coverage 78.97% 75.47% -3.50% ========================================== Files 270 379 +109 Lines 33956 56120 +22164 ========================================== + Hits 26816 42357 +15541 - Misses 7140 13763 +6623 ``` | [Flag](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | Coverage Δ | | |---|---|---| | [Algebroids](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `75.73% <100.00%> (ø)` | | | [CatReps](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `83.16% <ø> (?)` | | | [CategoriesWithAmbientObjects](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `89.30% <ø> (ø)` | | | [ExteriorPowersCategories](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `43.70% <ø> (?)` | | | [FiniteCocompletions](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `92.11% <ø> (?)` | | | [FpCategories](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `86.70% <ø> (?)` | | | [FunctorCategories](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `64.49% <ø> (?)` | | | [GradedCategories](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `86.57% <ø> (ø)` | | | [InternalModules](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `81.66% <ø> (ø)` | | | [IntrinsicCategories](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `82.57% <ø> (ø)` | | | [IntrinsicGradedModules](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `50.89% <ø> (ø)` | | | [IntrinsicModules](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `82.15% <ø> (ø)` | | | [LazyCategories](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `74.20% <ø> (ø)` | | | [Locales](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `85.44% <ø> (ø)` | | | [PreSheaves](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `92.42% <ø> (ø)` | | | [QuotientCategories](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `91.58% <ø> (ø)` | | | [SubcategoriesForCAP](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `81.32% <ø> (ø)` | | | [ToolsForCategoricalTowers](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `83.32% <ø> (ø)` | | | [Toposes](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `90.26% <ø> (ø)` | | | [ZariskiFrames](https://app.codecov.io/gh/homalg-project/CategoricalTowers/pull/446/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project) | `74.31% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=homalg-project#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mohamed-barakat commented 11 months ago

I don't think this relaxation makes sense without collecting more information first: The errors in the CI might be due to timing fluctuations, but they might also be due to small performance regressions introduced over time. I suggest to either:

Remove the test and call it a day.

Remove the test from the CI but execute it manually every time the compiled code changes.

Manually compare the timings at the time when this test was added to the current timings. If there is a regression, investigate. If not, then the relaxation actually makes sense.

I never encountered a regression on my laptop. And since the compiled code did not change for a while I believe that the 7 / 10 is too tight when considering the fluctuations of a CI running on a virtual machine.

kamalsaleh commented 11 months ago

On my laptop this test-line fails almost 60% of the times and setting the quotient to 8/10 would allow tests to pass. Since this PR is created, I would suggest merging it, and investigate further if the issue remains.

mohamed-barakat commented 11 months ago

On my laptop this test-line fails almost 60% of the times and setting the quotient to 8/10 would allow tests to pass. Since this PR is created, I would suggest merging it, and investigate further if the issue remains.

I also see that the 4 / 10 test also sporadically fails. I would wait a bit longer and maybe relax both.

zickgraf commented 11 months ago

On my laptop this test-line fails almost 60% of the times and setting the quotient to 8/10 would allow tests to pass. Since this PR is created, I would suggest merging it, and investigate further if the issue remains.

Ah, that's a different situation than CI failures now and then. Could you

execute AdelmanCategoryOfAdditiveClosureOfAlgebroidVsAdelmanCategoryOfQuiverRows.g 5 times and report the numbers runtime and runtime_quiver,
repeat this with commit 31de779bda22f02b0890865ce76c646b883116af?

Then we know which range of results to expect and if there has been a regression since beginning of this year.

kamalsaleh commented 11 months ago

I also see that the 4 / 10 test also sporadically fails. I would wait a bit longer and maybe relax both.

At the event just now :)

if runtime <= runtime_quiver * 4 / 10 then Display( true ); else Display( runtime ); Display( runtime_quiver ); fi;
# Expected output:
true
# But found:
2027
4896

kamalsaleh commented 11 months ago

On master f0654eb56d2a37b0bdd12ea32ddcfef1b02ff3e3 [ [ 17329, 34929 ], [ 20397, 28526 ], [ 18249, 28558 ], [ 19977, 31687 ], [ 20543, 29235 ] ] producing the quotients [ 0.496121, 0.715032, 0.639015, 0.630448, 0.702685 ]

On commit 31de779bda22f02b0890865ce76c646b883116af [ [ 17102, 29659 ], [ 21527, 34266 ], [ 21474, 28003 ], [ 19019, 28918 ], [ 16074, 28570 ] ] producing the quotients: [ 0.576621, 0.628232, 0.766846, 0.657687, 0.562618 ]

zickgraf commented 11 months ago

On master f0654eb [ [ 17329, 34929 ], [ 20397, 28526 ], [ 18249, 28558 ], [ 19977, 31687 ], [ 20543, 29235 ] ] producing the quotients [ 0.496121, 0.715032, 0.639015, 0.630448, 0.702685 ]

On commit 31de779 [ [ 17102, 29659 ], [ 21527, 34266 ], [ 21474, 28003 ], [ 19019, 28918 ], [ 16074, 28570 ] ] producing the quotients: [ 0.576621, 0.628232, 0.766846, 0.657687, 0.562618 ]

Wow, those numbers are really unstable. I guess this might be due to the P and E cores of your CPU, with results changing depending on which cores run which part of the test.

Some numbers from my laptop for comparison:

[ [ 27027, 37421 ], [ 26355, 37113 ], [ 25695, 36010 ], [ 24504, 34867 ], [ 24439, 34402 ] ]

producing the quotients

[ 0.72, 0.71, 0.71, 0.70, 0.71 ]

Here, all numbers are very close to each other.

The intention of this test was to capture small regressions of < 10%. To accommodate for the above numbers, we would have to relax the bounds to maybe 45% and 75% or even more, which means we only notice regressions > 30%. Such regressions should also be noticeable elsewhere, so I don't think the test is very useful then. Hence, I suggest to simply disable this test.

zickgraf commented 11 months ago

I have now also tested 31de779bda22f02b0890865ce76c646b883116af and get the following results:

[ [ 23363, 34735 ], [ 23123, 34800 ], [ 23362, 33842 ], [ 22077, 33028 ], [ 21984, 32944 ] ]

producing the quotients

[ 0.67, 0.66, 0.69, 0.66, 0.66 ]

So there indeed has been a regression since the beginning of this year and the CI failures have pointed out a valid problem, which would have gone unnoticed without the very strict bounds.

mohamed-barakat commented 11 months ago

The absolute values of the timings show that there is definitely a regression > 10%:

On master https://github.com/homalg-project/CategoricalTowers/commit/c1711cb6d4c1274a7e2eb41930f1bac83e9f76b9 [ [ 11039, 15411 ], [ 11041, 15814 ], [ 10853, 15671 ], [ 10845, 15607 ], [ 11008, 15566 ] ] producing the quotients [ 0.716307, 0.698179, 0.692553, 0.694881, 0.707182 ]

On commit https://github.com/homalg-project/CategoricalTowers/commit/31de779bda22f02b0890865ce76c646b883116af (https://github.com/homalg-project/CAP_project/commit/3a6f758af8b51895f5206f7368c38dc80704cc26, https://github.com/homalg-project/FinSetsForCAP/commit/9f23f456b2608a802519f80575701a865457cc00) [ [ 9764, 14761 ], [ 9855, 14687 ], [ 9886, 15026 ], [ 9681, 14680 ], [ 9771, 14646 ] ] producing the quotients: [ 0.661473, 0.671002, 0.657926, 0.659469, 0.667145 ]

mohamed-barakat commented 11 months ago

It is very hard to bisect having three different repos, any ideas?

zickgraf commented 11 months ago

It is very hard to bisect having three different repos, any ideas?

I have once written a script for this situation: it checks out all git repos in a directory with the last commit before a given date and time:

#!/bin/bash

set -e

if [ "$#" -lt 1 ]; then
    echo "you must pass at least one arguments"
    exit 1
fi

if [ "$#" -gt 2 ]; then
    echo "you must pass at most two arguments"
    exit 1
fi

DATE="$1"

if ! [[ "$DATE" =~ ^[0-9]{4}-[0-9]{2}-[0-9]{2}$ ]]; then
    echo "Year, month and day must be given as \"YYYY-MM-DD\""
    exit 1
fi

if [ "$#" -eq 2 ]; then
    TIME="$3"
else
    # include the whole day by default
    TIME="23:59"
fi

if ! [[ "$TIME" =~ ^[0-9]{2}:[0-9]{2}$ ]]; then
    echo "Time must be given as \"HH:MM\""
    exit 1
fi

# append 59 seconds to always include the whole minute
TIME="$TIME:59"

for repo in *; do
    cd "$repo"
    commit=$(git rev-list --first-parent -n1 --before "$DATE $TIME" origin/master)
    current_commit=$(git rev-parse --verify HEAD)
    if [[ "$commit" == "$current_commit" ]]; then
        modified=""
    else
        modified="+"
    fi
    printf "%-30s %1s %s\n" "$repo" "$modified" "$(git show --date=iso-local -s --format="%cd %h %s" "$commit")"
    git checkout -q "$commit"
    cd ..
done

mohamed-barakat commented 11 months ago

It is very hard to bisect having three different repos, any ideas?

I have once written a script for this situation: it checks out all git repos in a directory with the last commit before a given date and time:

Wonderful, thank you very much.

homalg-project / CategoricalTowers

relax test for timing difference #446

Codecov Report