vemonet / setup-spark

:octocat:✨ Setup Apache Spark in GitHub Action workflows
https://github.com/marketplace/actions/setup-apache-spark
MIT License
20 stars 11 forks source link

Action spend too much time sometimes to complete #8

Closed panicoenlaxbox closed 1 year ago

panicoenlaxbox commented 3 years ago

Hello,

I don't know if it's a bug or it's a issue with the server from where is downloading the file, but I have seen that this task can spend from "seconds" to "minutes", for example:

image

Do you know how I can investigate whose fault it is?

Thank you so much

vemonet commented 3 years ago

Hi @panicoenlaxbox , thanks for the report! I never experienced such behaviors though, usually it completes in a few seconds

I run this workflow to test the action: https://github.com/vemonet/setup-spark/blob/main/.github/workflows/test-setup-spark.yml You can see the output of a run here: https://github.com/vemonet/setup-spark/actions/runs/1151223338

I retrieve the file from https://archive.apache.org/dist/spark/ It gives access to more version than the latest mirrors (which only gives access to the 2 last versions if I remember properly)

If the issue comes from the URL, you can change the URL to download the Spark binary using spark-url (cf. README)

I don't see where else it could come from, the rest is mostly setting up environments variables

Note that we use wget to download the binary, you can see the whole process here: https://github.com/vemonet/setup-spark/blob/main/src/setup-spark.ts#L25

I could add some console.log with timestamp to better see which part is causing problem maybe

vemonet commented 3 years ago

I improved added some logs with timestamps, let me know if that helps clarifying the problem!

brianpep commented 2 years ago

I've seen this issue happen on MacOS runners periodically. It is an issue with the runners themselves, though if you used actions/cache (or if this module had an option to turn cache on like many other setup modules :P) it would probably not happen anymore.

panicoenlaxbox commented 2 years ago

I will peek at the pipelines the next week to see the differences, but, I have to say, that it's working like a charm from I wrote this message :) Thank you so much.

Scot3004 commented 1 year ago

I think this action requires a cache for the downloaded artifacts, it's downloading them each time that I run the build

marius-xentral commented 1 year ago

happens quite often

image

vemonet commented 1 year ago

Thanks @marius-xentral it seems to have gotten worse lately, and was also reported by @pullepuvinay in https://github.com/vemonet/setup-spark/issues/19

Currently the action download the spark binary from apache archive because it provides the largest coverage for Spark versions: https://archive.apache.org/dist/spark/

But it seems that downloading from the apache archive has been particularly slow lately (e.g. https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3-scala2.13.tgz)

A potential solution would be to try to download Spark from the mirror recommended by their download page (extracting the URL with a regex probably): https://www.apache.org/dyn/closer.lua/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz

On my laptop using the archive link to download 3.4.0 takes more than 1h to download Spark, while using the recommended mirror takes less than 2 min (for 370M)

But this would work only for the versions available on the download page (which seems to be the 3 latest releases: 3.2.4, 3.3.2 and 3.4.0 at the moment), so it would not help for other versions such as 3.1.2

Does anyone knows another mirror URL or direct download that covers more versions than the official download page and is faster than the archive links?

If someone wants to start implementing the extraction of the mirror URL from the official spark download page, feel free to do it, it should be quite easy! (https://github.com/vemonet/setup-spark/blob/main/src/setup-spark.ts). Otherwise I'll do it when I have some time

r-t-m commented 1 year ago

@vemonet thanks for the nice action. If wget/curl with a file exists condition is split into separate command from untar it would be much easier to use github actions in such scenariou with slow/unavailable download.

vemonet commented 1 year ago

For anyone who would like to improve the speed: we currently use the archive.apache.org download URL for all Spark versions, but this archive website can be really slow some times

The official download page for Spark only support the 3 latest versions, but the download is magnitude faster (cf. my previous test where it takes 1h using the archive URL, and 2 min using the official download URL)

A fix that would be easy to implement would be:

Example of official download URL: https://www.apache.org/dyn/closer.lua/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz

I am not using actively setup-spark at the moment, so I am not really in a hurry to implement this, but if someone wants to do it I;ll be happy to merge! (just @ tag me in the PR or issues to make sure I notice in less than a month!)

benjamincitrin commented 1 year ago

@vemonet Just put in a pull request

vemonet commented 1 year ago

Thanks a lot @benjamincitrin !

The test workflow runtime has gone down from 20/30min to 2 min :D