microsoft / hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
https://aka.ms/hyperspace
Apache License 2.0
424 stars 115 forks source link

Cannot run Spark 3 tests in IntelliJ IDEA #440

Open andrei-ionescu opened 3 years ago

andrei-ionescu commented 3 years ago

Describe the issue

I get the following error when trying to run the E2EHyperspaceRulesTest tests:

An exception or error caused a run to abort: org.apache.parquet.hadoop.ParquetOutputFormat.getJobSummaryLevel(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/parquet/hadoop/ParquetOutputFormat$JobSummaryLevel; 
java.lang.NoSuchMethodError: org.apache.parquet.hadoop.ParquetOutputFormat.getJobSummaryLevel(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/parquet/hadoop/ParquetOutputFormat$JobSummaryLevel;
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:131)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:132)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:178)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
    at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
    at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:848)
    at com.microsoft.hyperspace.SampleData$.save(SampleData.scala:48)
    at com.microsoft.hyperspace.index.E2EHyperspaceRulesTest.beforeAll(E2EHyperspaceRulesTest.scala:50)
    at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
    at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
    at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
    at com.microsoft.hyperspace.index.E2EHyperspaceRulesTest.org$scalatest$BeforeAndAfter$$super$run(E2EHyperspaceRulesTest.scala:34)
    at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:258)
    at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:256)
    at com.microsoft.hyperspace.index.E2EHyperspaceRulesTest.org$scalatest$BeforeAndAfterAllConfigMap$$super$run(E2EHyperspaceRulesTest.scala:34)
    at org.scalatest.BeforeAndAfterAllConfigMap.liftedTree1$1(BeforeAndAfterAllConfigMap.scala:248)
    at org.scalatest.BeforeAndAfterAllConfigMap.run(BeforeAndAfterAllConfigMap.scala:245)
    at org.scalatest.BeforeAndAfterAllConfigMap.run$(BeforeAndAfterAllConfigMap.scala:242)
    at com.microsoft.hyperspace.index.E2EHyperspaceRulesTest.run(E2EHyperspaceRulesTest.scala:34)
    at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
    at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1346)
    at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1340)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1340)
    at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:1031)
    at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:1010)
    at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1506)
    at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1010)
    at org.scalatest.tools.Runner$.run(Runner.scala:850)
    at org.scalatest.tools.Runner.run(Runner.scala)
    at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2or3(ScalaTestRunner.java:38)
    at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:25)

To Reproduce

To setup Hyperspace project in IntelliJ IDEA I followed steps provided by @imback82 in https://github.com/microsoft/hyperspace/pull/434#issuecomment-829801997.

  1. Marked src/main/scala and src/main/scala-spark3 as sources.
  2. Marked src/test/scala and src/test/scala-spark3 as test sources.
  3. Marked src/test/resources as test resources.
  4. Compiled from terminal with sbt clean compile.
  5. From IntelliJ IDEA I right clicked on the E2EHyperspaceRulesTest file and chose Run 'E2EHyperspaceRulesTest'...

Expected behaviour

E2EHyperspaceRulesTest to run successfully.

Environment

imback82 commented 3 years ago

Thanks @andrei-ionescu for reporting this. @clee704 Could you take a look when you get a chance? It's crucial to get this running in IntelliJ since most of us use it for development/debugging.

andrei-ionescu commented 3 years ago

There is another nuisance: after setting the project according to @imback82's https://github.com/microsoft/hyperspace/pull/434#issuecomment-829801997, whenever I close the IDE everything needs to set once more. I need to remove once more the root and root-build modules and mark all the sources and tests. This takes time and is a nuisance.

clee704 commented 3 years ago

There is another nuisance: after setting the project according to @imback82's #434 (comment), whenever I close the IDE everything needs to set once more. I need to remove once more the root and root-build modules and mark all the sources and tests. This takes time and is a nuisance.

Are you sure that the .idea directory is created in the repository root and the content of the directory is retained when you close and open the IDE again? For me, I don't have to do such things and everything was fine. By the way, you don't have to remove root-build project (at least for me).

clee704 commented 3 years ago

I'm looking into the issue. As a workaround, you can use the built-in sbt shell and type

Another workaround is right click the file, then click "Modify Run Configuration...", check "Use sbt" in the dialog, and click "OK" to close the dialog. The test will be run with sbt even if you use built-in buttons and commands. After a successful run with sbt, you can uncheck "Use sbt" and run the test without sbt, if you want. Strangely this worked for me.

Also, it might help: don't forget to set "Project SDK" and "Project language level" to Java 8, as well as "Module SDK" for each module.

andrei-ionescu commented 3 years ago

@clee704, @imback82, @sezruby, @rapoth

I think more time has to be invested to make it easier for developers to contribute to the project.

Getting back to the issues discussed here...

1) I do have the "Use sbt" option enabled but that does NOT work at all if you need to debug a test and have some breakpoints. In my case the tests don't even start when I click "Debug".

2) I used the steps provided by @imback82. Can you, together with @imback82, put together a document on how develop in IntelliJ IDEA?

3) I do have that .idea folder present. But the changes that we do by removing the root, etc. cannot be persisted because the project structure is defined in build.sbt file. Whenever the IDE decides to reload the project any such change is lost. For example when ever I do a Git Pull I get into this issue.

After multiple days of trying out different things, the only way I found I can debug tests from IntelliJ IDEA was by:

  1. Enabling use sbt everywhere
  2. Opening the sbt shell tool window and start the shell (View -> Tool Window -> sbt shell)
  3. Attach debugger to sbt shell (clicked the bug on the left side of the sbt shell window)
  4. Execute specific test from sbt shell window by typing testOnly *ExplainTest -- -z "Testing subquery scenario" and pressing enter.

Now that the project got a little bit complex with this support for multiple Spark versions, it is required to have some docs on how to develop (compile, run, debug, execute tests, test debugging, etc) in some of the most wide-spread IDEs.

andrei-ionescu commented 3 years ago

@clee704, @imback82, @sezruby

I tried today to get back to the "Nested fields" support in Hyperspace but I cannot get the tests green.

I hava a test that seems to fail on a specific Spark version.

[error] (spark3_1 / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful

1) I don't know how to run a test just for a specific Spark version. 2) After the latest addition of support for Spark 3.1, the solution that I previously used to debug is no longer working.

clee704 commented 3 years ago

You should be able to use the sbt shell to run and debug a specific test. Set the project to spark2_4, spark3_0, or spark3_1 in the sbt shell by project <project name> and any task will be performed for the selected project. Or you can prefix a task with <project name>/ to run a task for a specific project. Example: spark3_1/testOnly *E2EHyperspaceRulesTest. Debugging should work as well although you should check "Enable debugging" in Preferences > Build, Execution, Deployment > Build Tools > sbt. I'll update README with this information.

clee704 commented 3 years ago

I've tried for several hours to make IntelliJ work with our sbt build but it just didn't work however I tried. I even removed symlinks and copied the source directory for each subproject (which is mapped to a module in IntelliJ), but it did not solve the test issue (it did solve other issues - without symlinks, you don't have to manually remove the root/hyperspace-sources modules and set source/test folders by yourself). IntelliJ just doesn't seem to work well with multi-project sbt builds (and symlinked source files).

Since there is a workaround - using the sbt shell - and it doesn't prevent you from developing Hyperspace with IntelliJ as long as you know how to use sbt, I'd like to close this issue as won't fix. Contributors should be familiar with sbt, as Hyperspace is mostly written in Scala and sbt is the de-facto standard build tool for Scala. Personally, I didn't have difficulty debugging in IntelliJ with the sbt shell. If you need further help please let me know. Or if you can come up with a better build that works well with IntelliJ, it would be awesome. I'll be really happy if one can make a multi-project sbt build that works well with IntelliJ, as I'm kind of disappointed with IntelliJ regarding sbt support.

By the way, I've found one thing that might make things easier for IntelliJ. In the sbt window, try ignoring the root and hyperspace-sources projects by selecting "Ignore sbt Project" on the right-click menu. I'll update README file with this and more detailed information regarding working with IntelliJ.

andrei-ionescu commented 3 years ago

Please, don't close this issue. Or if you close it please create another ticket for the creation of the documentation explaining the way to develop on Hyperspace and IntelliJ IDEA (is should contain the way to debug a test - some screenshots would be great also).

It is not the common way of developing and because of that it should be documented.

Not helping the community and letting them bang their heads for hours (or even days) to be able to debug a test it will make the community of developers grow up in reluctance against this project.

If it helps, I can share my way of debugging, which I found by trying out different approaches.

clee704 commented 3 years ago

As I said already, if you use the sbt shell in IntelliJ, everything works fine, including debugging with breakpoints. It's been only two months since I started developing in Scala so it should be easy for anyone developing in Scala. I'll update the README.

andrei-ionescu commented 3 years ago

@clee704 In my case is NOT working - even if I use the sbt shell inside IntelliJ. When I add breakpoints an run the test from sbt shell it does NOT stop to the break points. If it works for you it doesn't mean it works for every one.

The only way to make it work was to run sbt shell in terminal with debug ports open, then connect from IntelliJ IDEA with a remote debug configuration, then run the tests from the terminal sbt shell.

There are multiple problems - development-wise - that are now present in the project and those problems are a nuisance for the community.

For example, every time I close IntelliJ IDEA, I have to go through the process of removing root, etc. and mark the source folders once more, which is time consuming and a pain.

I my have different options in IntelliJ than the ones you have and that needs also to be documented.

@imback82, @sezruby, @rapoth: For me, right now, after the support of Spark 3 has been added, it is a pain to develop in Hyperspace project.

What can be done to make it better? How can I help?

clee704 commented 3 years ago

I just realized that I omitted crucial information for debugging in IntelliJ. You should set Test/fork to false in the sbt shell to use breakpoints. This can be done by typing

  1. set Test/fork := false if you've set the current project, or
  2. set <project name>/Test/fork := false, e.g. set spark3_1/Test/fork := false

in the sbt shell.

This is because with Test/fork set to true tests will be run in a forked JVM. As the debugger is attached to the shell process, tests must be run in the same process as the shell. I'll update README with this information.

I acknowledge that the current situation is not the best. If you have any idea to improve IntelliJ experience, please share with us or make a PR.