springml / spark-sftp

Spark connector for SFTP
Apache License 2.0
100 stars 98 forks source link

ArrayIndexOutOfBoundsException #41

Open pawelantczak opened 5 years ago

pawelantczak commented 5 years ago

Hello.

While in local mode everything is running smoothly, when I'm executing application on remote cluster, I'm getting this error:

java.lang.ArrayIndexOutOfBoundsException: 0
    at com.springml.spark.sftp.DefaultSource.copiedFile(DefaultSource.scala:287)
    at com.springml.spark.sftp.DefaultSource.writeToTemp(DefaultSource.scala:262)
    at com.springml.spark.sftp.DefaultSource.createRelation(DefaultSource.scala:124)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)

in addition, when tempLocation is set, I can see files on Spark server.

shockdm commented 5 years ago

@pawelantczak did you ever figure this out?

fabian-fuentealba commented 4 years ago

and ... the solution is ?????? ... did you use docker ??

sukanya-pai commented 3 years ago

I encountered this problem while running spark on cluster mode. Based on my research and understanding, when spark is running on cluster mode, it tends to write on tmp file first which can be stored anywhere randomly on the worker nodes.

I also saw that for some people, the problem was solved by using the latest version, but it did not help me.

Running the spark on standalone mode solved this problem for me. To run spark in standalone mode just use this piece of code while creating the sparksession object:

SparkSession.builder()
        .appName(yourAppName).master("local")
        .getOrCreate()

Hope this helps.