hendrikmuhs / ccache-action

github action to speedup building using ccache
MIT License
114 stars 52 forks source link

Version 1.2.12 fails to work on CentOS7, but 1.2.11 was fine #178

Open iakov opened 7 months ago

iakov commented 7 months ago

Thus, this action fails to work where it was used via @v1.2. Please, remove this release or re-release as v1.2.14 the backward-compatible version. The problem is hidden in the cache action version 3.2.3 (PR #175), that introduces node20 dependency. I suggest to re-release the update as v1.3 or even v2, because actually It is not a "minor fix".

hendrikmuhs commented 7 months ago

Thanks for the report.

Is this a custom runner image? CentOs7 is not listed at https://github.com/actions/runner-images.

To set expectations right: I can't provide any support for this.

that introduces node20 dependency

you probably saw the merge commit, the change is here: https://github.com/hendrikmuhs/ccache-action/pull/177

The reason is this.

It is not a "minor fix". You are right about classification, however I would also not make a big deal out of it. It should be totally fine for you to stay on version 1.2.11. I don't have the capacity to maintain 2 versions, especially as node 16 is EOL anyway. You might as well fix it on your side if you are using a custom runner.

Anyway, can you explain your setup in more detail and if possible provide logs?

Are you effected by this? Can you confirm that you can execute node 18?

If so, downgrading to node 18 might be the better alternative. I would love to avoid maintaining 2 versions.

firewave commented 7 months ago

Also occurs with workflows utilizing older docker images.

See https://github.com/danmar/cppcheck/actions/runs/7677304162/job/20926809147?pr=5917 https://github.com/danmar/cppcheck/pull/5918

firewave commented 7 months ago

It also seems like this caused a major slowdown in the Post ccache step - it now takes way above 2 minutes instead of being of a few seconds:

Before: https://github.com/danmar/cppcheck/actions/runs/7646437504/job/20835239981 After: https://github.com/danmar/cppcheck/actions/runs/7681894296/job/20935549647?pr=5918

The jobs fixated on the 1.2.11 version are not experiencing this and are much faster with way less data to cache: https://github.com/danmar/cppcheck/actions/runs/7681894302/job/20935550032?pr=5918

hendrikmuhs commented 7 months ago

@firewave

You're 1st problem is similar to the reported one, you are running it inside a custom image. That image seems to lack node 20. It would be nice if you could find out if node 18 would work.

It would be technically better if you run the action outside of docker(on the official github runner) and use docker only for building. This requires exposing the host fs and setting the ccache path accordingly like it is done here.

There is no alternative to an upgrade of the nodejs version, node 16 is EOL as clearly stated by github. However we are flexible w.r.t. upgrading it to 18 vs. 20.

Your 2nd problem: It ran slower, because it did not found a cache. This seems like a different problem, which might or might not be caused by the upgrade of nodejs. For such cases it is best to let it run a 2nd time to see if the problem persists. It might be that the underlying github cache layer puts something more into the key. A 2nd run should get the cache from the 1st run.

iakov commented 7 months ago

@hendrikmuhs , we use containers for CentOS:7. Imho, there is no need to maintain two versions. Please, do not introduce such a dramatical change with the patch version increment next time :) Probably, there is no action actually needed ATM, because all poor projects, unhappy with the nodejs hidden version shift will stick with the 1.2.11 after reading this thread.

hendrikmuhs commented 7 months ago

@iakov

As said, there has never be a guarantee that custom runners work. All official github runners support node 20.

Repeating my question: Would node 18 work? - if you or anyone else affected can answer this question, I happily change this, otherwise it stays at it is. I don't have the resources to test this myself.

iakov commented 7 months ago

Thank you. Node18 does not help. Node18 requires newer GLIBC and is incompatible with Ubuntu 18.04 and other obsolete/EOL releases. The only thing I ask you is about versioning. We are happy to use unsupported older 1.2 version of your action as we do with checkout@v3, it works fine. I kindly as you to tag release as v2 or at least v1.3 for new features with new incompatible dependencies. However, maybe those few projects with broken CI and CD will come to this thread for solution and choose to stick with 1.2.11 until build server update.

firewave commented 7 months ago

I second to start a new version for this change and leave v1.2 on Node 16.

Your 2nd problem: It ran slower, because it did not found a cache. This seems like a different problem, which might or might not be caused by the upgrade of nodejs. For such cases it is best to let it run a 2nd time to see if the problem persists. It might be that the underlying github cache layer puts something more into the key. A 2nd run should get the cache from the 1st run.

It did find a cache. Otherwise it would not have been able to restore that and have any hits. Here's the same results from a main branch build after the version has been fixated for the docker-based workflows:

1.2.11: https://github.com/danmar/cppcheck/actions/runs/7687753684/job/20948156884 1.2.12: https://github.com/danmar/cppcheck/actions/runs/7687753675/job/20948156975

The stats are different because different ccache version are being used. So I wondered if this might just be an issue with the newer version of that but that is refuted by a build on a newer platform which also uses 1.2.11: https://github.com/danmar/cppcheck/actions/runs/7687753684/job/20948157383

More 1.2.12 cases on various runners without docker involved: https://github.com/danmar/cppcheck/actions/runs/7687753687

I do assume this is actually a Node issue and I will file an issue upstream. We should also track this in a different issue within this repo.

hendrikmuhs commented 7 months ago

@firewave

You might have a problem with cache eviction as I see many many jobs. Back to your initial post:

It did find a cache. Otherwise it would not have been able to restore that and have any hits.

This is from your run:

Run hendrikmuhs/ccache-action@v1.2
Install ccache
Restore cache
  No cache found.
Configure ccache, linux

Source:

After: https://github.com/danmar/cppcheck/actions/runs/7681894296/job/20935549647?pr=5918

Update:

Or do you mean: It finds the cache now? I am not sure I fully understand. What we know so far is that after the update, the cache was not found.

The inner workings of the cache, meaning the github backend side (not this action, not ccache), are best to my knowledge not open/public source. I found other cases where it behaved strangely, maybe the node version is somehow backed into the key in an implicit way. We also had similar behaviour after github upgraded the client libraries. In summary it happens that a cache entry is for some reason missing, but usually it fixes itself with consecutive runs. This is just the usual SLA of a cache, but as long as it works most of the time, this seems fine.

hendrikmuhs commented 7 months ago

The parallel issue https://github.com/actions/runner/issues/2906 reports some success after downgrade, however https://github.com/nodejs/node/blob/v18.x/BUILDING.md#platform-list clearly states the newer glibc for x64.

In general I advise not to use tag sliding. I slide the tag on release, but I consider this bad practice. The suggested practice is an explicit version in combination with dependabot.

The action has very little QA coverage especially when it comes to exotic setups.

johnwason commented 7 months ago

I am seeing the same issue with older Ubuntu docker images. Version @1.2.11 seems to work, but @v1 fails.

https://github.com/robotraconteur/robotraconteur/actions/runs/7691430275

hendrikmuhs commented 7 months ago

Alright, I think here is what we should do:

If you want to help:

firewave commented 7 months ago

Alright, I think here is what we should do:

  • revert the switch to node 20 back to 16 and release it as 1.2.13 with tags for v1 and v1.2
  • add a note in the Readme
  • jump to node 18 on main and release it as v2 and go forward from there

Agreed!

If you want to help:

  • it would be nice to setup a test that executes the action in a custom environment, e.g. an ubuntu 16.04 container

Can do.

iakov commented 7 months ago

I am ready to test new tags for obsolete builders. You are free to send a PRs with changes in this line https://github.com/trikset/trikRuntime/blob/master/.github/workflows/centos.yml#L73 , or I can do it by request.

firewave commented 2 months ago

Even with fixating to version 1.2.11 the two minute hang is now present because the underlying runners have all been updated to Node 20.