Closed panicoenlaxbox closed 1 year ago
Hi @panicoenlaxbox , thanks for the report! I never experienced such behaviors though, usually it completes in a few seconds
I run this workflow to test the action: https://github.com/vemonet/setup-spark/blob/main/.github/workflows/test-setup-spark.yml You can see the output of a run here: https://github.com/vemonet/setup-spark/actions/runs/1151223338
I retrieve the file from https://archive.apache.org/dist/spark/ It gives access to more version than the latest mirrors (which only gives access to the 2 last versions if I remember properly)
If the issue comes from the URL, you can change the URL to download the Spark binary using spark-url
(cf. README)
I don't see where else it could come from, the rest is mostly setting up environments variables
Note that we use wget
to download the binary, you can see the whole process here: https://github.com/vemonet/setup-spark/blob/main/src/setup-spark.ts#L25
I could add some console.log
with timestamp to better see which part is causing problem maybe
I improved added some logs with timestamps, let me know if that helps clarifying the problem!
I've seen this issue happen on MacOS runners periodically. It is an issue with the runners themselves, though if you used actions/cache (or if this module had an option to turn cache on like many other setup modules :P) it would probably not happen anymore.
I will peek at the pipelines the next week to see the differences, but, I have to say, that it's working like a charm from I wrote this message :) Thank you so much.
I think this action requires a cache for the downloaded artifacts, it's downloading them each time that I run the build
happens quite often
Thanks @marius-xentral it seems to have gotten worse lately, and was also reported by @pullepuvinay in https://github.com/vemonet/setup-spark/issues/19
Currently the action download the spark binary from apache archive because it provides the largest coverage for Spark versions: https://archive.apache.org/dist/spark/
But it seems that downloading from the apache archive has been particularly slow lately (e.g. https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3-scala2.13.tgz)
A potential solution would be to try to download Spark from the mirror recommended by their download page (extracting the URL with a regex probably): https://www.apache.org/dyn/closer.lua/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
On my laptop using the archive link to download 3.4.0 takes more than 1h to download Spark, while using the recommended mirror takes less than 2 min (for 370M)
But this would work only for the versions available on the download page (which seems to be the 3 latest releases: 3.2.4, 3.3.2 and 3.4.0 at the moment), so it would not help for other versions such as 3.1.2
Does anyone knows another mirror URL or direct download that covers more versions than the official download page and is faster than the archive links?
If someone wants to start implementing the extraction of the mirror URL from the official spark download page, feel free to do it, it should be quite easy! (https://github.com/vemonet/setup-spark/blob/main/src/setup-spark.ts). Otherwise I'll do it when I have some time
@vemonet thanks for the nice action. If wget/curl with a file exists condition is split into separate command from untar it would be much easier to use github actions in such scenariou with slow/unavailable download.
For anyone who would like to improve the speed: we currently use the archive.apache.org download URL for all Spark versions, but this archive website can be really slow some times
The official download page for Spark only support the 3 latest versions, but the download is magnitude faster (cf. my previous test where it takes 1h using the archive URL, and 2 min using the official download URL)
A fix that would be easy to implement would be:
Example of official download URL: https://www.apache.org/dyn/closer.lua/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
I am not using actively setup-spark
at the moment, so I am not really in a hurry to implement this, but if someone wants to do it I;ll be happy to merge! (just @ tag me in the PR or issues to make sure I notice in less than a month!)
@vemonet Just put in a pull request
Thanks a lot @benjamincitrin !
The test workflow runtime has gone down from 20/30min to 2 min :D
Hello,
I don't know if it's a bug or it's a issue with the server from where is downloading the file, but I have seen that this task can spend from "seconds" to "minutes", for example:
Do you know how I can investigate whose fault it is?
Thank you so much