publish_sdks workflow needs to be retry-able by language

t0yv0 commented 1 month ago

AWS v6.45.0 failed to publish Java artifacts to Maven central due to a CI issue.

https://github.com/pulumi/pulumi-aws/actions/runs/9962834625

AS maintainers we would like to retry Java SDK publishing and only that, now that the credentials are fixed. However, currently the SDK publishing is a monolithic step involving every language.

guineveresaenger commented 1 month ago

@t0yv0 can you link a bit more context to the CI issue here? presumably this was not a Maven issue?

t0yv0 commented 1 month ago

Maven Central credentials required rotation.

danielrbradley commented 1 month ago

Oh hey there .. I think this overlaps with this issue:

https://github.com/pulumi/pulumi-package-publisher/issues/16

The intention within pulumi-package-publisher is that creating earch release is idempotent and can safely be retried because it'll just skip if already created. However this then interacts badly with the fact that we don't fail when Java fails because of historical flakeyness. This means that Java just gets skipped and the job marked as complete meaning we can't then retry the failure.

I think the solution here is to:

Stop skipping Java errors.
Build in retries for Java to alleviate the pain when it does flake.
Just use the GHA failed job retry mechanism for retries.

I think this work could be included in the epic to cut a GA of the pulumi-package-publisher action.

t0yv0 commented 1 month ago

To me as a user it seems like a separate issue from silently ignoring failures. I need to be able to retry Java publishing manually without retrying other SDKs that successfully published. I don't think publishing is idempotent in general, it 100% is not for Maven and I'd love us not to count on it being idempotent.

guineveresaenger commented 1 month ago

I believe most publishing processes do allow to be idempotently retried at this point: PyPI and npm do so out of the box, and we run nuget push with --skip-duplicate. I'm not sure about Go but think Go is idempotent too.

I think the two issues are related, and maybe it boils down to a design decision on whether we're able/willing to have separate publishing runs for each relevant language.

t0yv0 commented 1 month ago

What's the reason these are coupled currently? Even if other languages are idempotent, rerunning them just to get Java to publish is not ideal.

guineveresaenger commented 1 month ago

I believe if we publish them with the same runner, we a) save runners and b) cut down on artifact download time overall, but I may be overestimating how much of an issue that would be.

danielrbradley commented 1 month ago

I think its reasonable to assume we can implement idempotent behaviour here even if the service doesn't support it directly. Checking if a package version exists should be possible in all package managers, and failing that it should just be a first write wins and the re-pushed package should be ignored.

Publishing in a single job is almost certainly going to be faster overall for us than using separate parallel jobs due to runner contention and the overheads.

What we've got is pretty good and working well so we should just focus on making the Java release reliable, auto retryable, not ignoring errors and allow retying of the whole job when one or more fails.

t0yv0 commented 1 month ago

It's not reasonable for Maven Central. There's hours of delay in the OSSRH<->Central publishing pipe. The only chance to make an idempotent solution is trying to publish and then interpreting error codes as "already published" to count them as success, or else use a side channel such as an S3 sentinel to make the step artificially idempotent.

I concede that reliability https://github.com/pulumi/pulumi-package-publisher/issues/16 is more important to work on in the first place, but I'm really wondering why are we prioritizing runner contention over usability. I am guessing in an ideal world GHA would allow steps to be scheduled on a single runner but independently retryable so this could be decided to a win-win. However as we stand, does adding 4 more steps to a 30-step workflow really have any observable effect on runner contention? I think having separate GHA steps could be so much easier for the operator to locate errors and logs in as well. At the very least maybe break the languages into separate steps, e.g. see how they all go in a single step https://github.com/pulumi/pulumi-aws/actions/runs/10064400756/job/27825467506#step:4:82 mixing up the logs.

pulumi / ci-mgmt

publish_sdks workflow needs to be retry-able by language #1043