harsha2010 / magellan

Geo Spatial Data Analytics on Spark
Apache License 2.0
534 stars 149 forks source link

Array Index out of bound exception while reading shape files which has around 22K shapes #167

Closed khajaasmath786 closed 6 years ago

khajaasmath786 commented 6 years ago

Hi Harsha,

I am trying to load shape file which has around 22 K shapes. It is resulting in exception. May I know is there any size limit while reading polygons within shape files?

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 51.0 failed 1 times, most recent failure: Lost task 0.0 in stage 51.0 (TID 411, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.hadoop.io.Text.append(Text.java:237) at magellan.mapreduce.DBReader.initialize(DBReader.scala:132) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:177) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) I have also uploaded 3 shape files inside this github for your reference.

https://github.com/khajaasmath786/OozieSamples/tree/master/oozieProject/data/airawat-syslog

Thanks, Asmath

harsha2010 commented 6 years ago

@khajaasmath786 hmm, not sure..it looks like a bug at first glance, let me try testing these files out. unfortunately i wont have time until tomorrow to test, but will get to it first thing tomorrow

harsha2010 commented 6 years ago

@khajaasmath786 #169 fixes this. I have added tests to verify the fix, let me know if it works.

khajaasmath786 commented 6 years ago

Hi Harsha,

I am going to use only the library supported with 2.1 version. is this fix applicable been updated in 2.1 or only applicable in 2.2 version of spark?

Thanks, Asmath

On Tue, Sep 19, 2017 at 9:01 PM, Ram Sriharsha notifications@github.com wrote:

Closed #167 https://github.com/harsha2010/magellan/issues/167 via #169 https://github.com/harsha2010/magellan/pull/169.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/harsha2010/magellan/issues/167#event-1256602059, or mute the thread https://github.com/notifications/unsubscribe-auth/AIWYMWoM-IOjIKfDGs1roOOuEqhm2Dz4ks5skHHqgaJpZM4PcsBC .

harsha2010 commented 6 years ago

@khajaasmath786 do you mean 2.0.0 ? magellan master should work with 2.1+. I don;t know about Cloudera versions, but if this doesn't work on Apache Spark 2.1 + , let me know.

khajaasmath786 commented 6 years ago

Hi Harsha,

I was able to run magellan earlier by adding --jars to spark2-shell using command below

spark2-shell --jars /hoome/yyy251/harsha2010:magellan:1.0.4-s_2.11

I cannot run by using the command below as mentioned by you in readme instructions of magellan. Please find below error. This is forcing me to get the latest version of the jar, download and push to cluster and access it using spark2-shell --jars /hoome/yyy251/harsha2010:magellan:1.0.4-s_2.11 instead of spark2-shell --packages /hoome/yyy251/harsha2010:magellan:1.0.4-s_2.11

I cannot confirm if the issue is resolved as packages are not downloaded directly. can you please share the latest jar for magellan:1.0.4-s_2.11 which resolves this issue.

Using username "yyy1k78". yyy1k78@brksvl168's password: Last login: Wed Sep 20 09:03:39 2017 from whqpc-l82713.whq.navistar.com [yyy1k78@brksvl168 ~]$ clear [yyy1k78@brksvl168 ~]$ kinit yyy1k78 Password for yyy1k78@AD.NAVISTAR.COM: [yyy1k78@brksvl168 ~]$ clear [yyy1k78@brksvl168 ~]$ spark2^Chell --packages harsha2010:magellan:1.0.4-s_2.11 [yyy1k78@brksvl168 ~]$ clear [yyy1k78@brksvl168 ~]$ spark2-shell --packages harsha2010:magellan:1.0.4-s_2.11 Ivy Default Cache set to: /home/yyy1k78/.ivy2/cache The jars for the packages stored in: /home/yyy1k78/.ivy2/jars :: loading settings :: url = jar:file:/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml harsha2010#magellan added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found harsha2010#magellan;1.0.4-s_2.11 in spark-packages :: resolution report :: resolve 1296ms :: artifacts dl 11ms :: modules in use: harsha2010#magellan;1.0.4-s_2.11 from spark-packages in [default]

    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   2   |   0   |   0   |   0   ||   1   |   0   |
    ---------------------------------------------------------------------

:: problems summary :: :::: WARNINGS module not found: commons-io#commons-io;2.4

    ==== local-m2-cache: tried

      file:/home/yyy1k78/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.pom

      -- artifact commons-io#commons-io;2.4!commons-io.jar:

      file:/home/yyy1k78/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar

    ==== local-ivy-cache: tried

      /home/yyy1k78/.ivy2/local/commons-io/commons-io/2.4/ivys/ivy.xml

      -- artifact commons-io#commons-io;2.4!commons-io.jar:

      /home/yyy1k78/.ivy2/local/commons-io/commons-io/2.4/jars/commons-io.jar

    ==== central: tried

      https://repo1.maven.org/maven2/commons-io/commons-io/2.4/commons-io-2.4.pom

      -- artifact commons-io#commons-io;2.4!commons-io.jar:

      https://repo1.maven.org/maven2/commons-io/commons-io/2.4/commons-io-2.4.jar

    ==== spark-packages: tried

      http://dl.bintray.com/spark-packages/maven/commons-io/commons-io/2.4/commons-io-2.4.pom

      -- artifact commons-io#commons-io;2.4!commons-io.jar:

      http://dl.bintray.com/spark-packages/maven/commons-io/commons-io/2.4/commons-io-2.4.jar

            ::::::::::::::::::::::::::::::::::::::::::::::

            ::          UNRESOLVED DEPENDENCIES         ::

            ::::::::::::::::::::::::::::::::::::::::::::::

            :: commons-io#commons-io;2.4: not found

            ::::::::::::::::::::::::::::::::::::::::::::::

:::: ERRORS Server access error at url https://repo1.maven.org/maven2/commons-io/commons-io/2.4/commons-io-2.4.pom (javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)

    Server access error at url https://repo1.maven.org/maven2/commons-io/commons-io/2.4/commons-io-2.4.jar (javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target)

:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: commons-io#commons-io;2.4: not found] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1078) at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:296) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:160) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) [yyy1k78@brksvl168 ~]$

khajaasmath786 commented 6 years ago

Hi Harsha,

I have downloaded the latest jar by spark-shell command. Here is the output of my spark-shell.

image

I went inside the jar file to check if DBReader.scala is updated for version 1.4.

C:\Users\yyy1k78.ivy2\cache\harsha2010\magellan\jars

image

Updated fix is not applied on the older version. can you confirm if I will be able to use it only in newer versions of magellan jar or any other work around to use it in 1.4 version of magellan?

Thanks, Asmath

harsha2010 commented 6 years ago

can you use magellan_1.0.5 ? https://dl.bintray.com/spark-packages/maven

  <dependency>
  <groupId>harsha2010</groupId>
  <artifactId>magellan</artifactId>
  <version>1.0.5-s_2.11</version>
  <type>jar</type>
</dependency>

Since you are using Spark 2.1, 1.0.5 should work Why do you need 1.0.4?

khajaasmath786 commented 6 years ago

Let me try that.

BTW, I changed the shape files to geojson and it worked. I can see the output. Strange but want to try out and resolve it.

I have a question, what is maximum size of shapes that we can include in one geojson file?

On Thu, Sep 21, 2017 at 11:57 AM, Ram Sriharsha notifications@github.com wrote:

can you use magellan_1.0.5 ? https://dl.bintray.com/spark-packages/maven

harsha2010 magellan 1.0.5-s_2.11 jar

Since you are using Spark 2.1, 1.0.5 should work Why do you need 1.0.4?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/harsha2010/magellan/issues/167#issuecomment-331218035, or mute the thread https://github.com/notifications/unsubscribe-auth/AIWYMe_EHnVx2z5qnOxlW7tTtEA5oC8Xks5skpVzgaJpZM4PcsBC .