vemonet / setup-spark

:octocat:✨ Setup Apache Spark in GitHub Action workflows
https://github.com/marketplace/actions/setup-apache-spark
MIT License
20 stars 12 forks source link

🐞 Spark build started taking longer than usual #28

Open spagnoloe-amenitiz opened 2 days ago

spagnoloe-amenitiz commented 2 days ago

Describe the bug

Hi there,

Starting today, we have seen a significant increase in the time to setup Spark action. Until yesterday this step would not take longer than 10 minutes, and today we see it is taking at least 30 minutes, with many instances taking more than 1 hour.

We are currently running the action on spark version 3.3.0 on ubuntu-20.04.

- uses: vemonet/setup-spark@v1
  timeout-minutes: 10
  with:
    spark-version: '3.3.0'
    hadoop-version: '3'
    spark-url: 'https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz'

image

Is anyone else experiencing the same issue?

Reproduction

No response

Which version of the action are you using?

v1

With which versions of Spark is it happening?

3.3.0

Operating System and environment

ubuntu-20.04

Additional context

No response

vemonet commented 2 days ago

Hi @spagnoloe-amenitiz

Downloading from https://archive.apache.org can be quite slow (but it has the best coverage in term of old versions), I would recommend to take a look there to find the officially recommended mirror: https://spark.apache.org/downloads.html

In the code we built up on this URL: https://github.com/vemonet/setup-spark/blob/main/src/setup-spark.ts#L44 not sure if it has changed since we

That makes me think maybe we should not use the archive.apache.org for the example in the readme though, might be misleading

spagnoloe-amenitiz commented 1 day ago

Hi @spagnoloe-amenitiz

Downloading from https://archive.apache.org can be quite slow (but it has the best coverage in term of old versions), I would recommend to take a look there to find the officially recommended mirror: https://spark.apache.org/downloads.html

In the code we built up on this URL: https://github.com/vemonet/setup-spark/blob/main/src/setup-spark.ts#L44 not sure if it has changed since we

That makes me think maybe we should not use the archive.apache.org for the example in the readme though, might be misleading

Hi @vemonet,

Thanks for the quick reply. I tried removing the URL so that it would try to download from https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz, but apparently this binary is not available so it resorts to the archive again :(

image

I was trying to investigate what you mentioned on the mirrors, but couldn't figure out where to find this info. The archives are listed here, but could not find any details on the recommended mirror.

Do you know where is this info available? How can I specify the mirror then in the Github Action step?

Many thanks,