Closed ssiegel95 closed 1 year ago
Can you update the readme? https://github.com/linkedin/spark-tfrecord/blob/master/README.md#including-the-library
Adding a line about 0.6.0 and spark 3.4
@junshi15 I don't see a clean way of supporting spark 3.2/3.3 and 3.4 simultaneously. This could be due to the fact that my scala skills are very poor (apologies). I tried using
new Path(file.toString),
instead of file.toPath
since toString
is the only common method I can find on the PartitionedFile
class between v3.2/3.3 and v3.4
That fails with
Cause: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: path:%20file://home/stus/git/github.com/spark-tfrecord/tf-sandbox/example_overwrite.tfrecord/part-00001-9119ef7a-ba07-4f4a-9548-6ea6a85902ea-c000.tfrecord,%20range:%200-828,%20partition%20values:%20%5Bempty%20row%5D
No worries. Incompatible change from Spark: https://issues.apache.org/jira/browse/SPARK-41970 I have created a branch for Spark-3.2 Let's support spark-3.4 on the master branch. This PR should only support Spark 3.4. Please address the comments above.
No worries. Incompatible change from Spark: https://issues.apache.org/jira/browse/SPARK-41970 I have created a branch for Spark-3.2 Let's support spark-3.4 on the master branch. This PR should only support Spark 3.4. Please address the comments above.
@junshi15 it seems that the current master also supports spark-3.3, do you think we should create a branch for spark-3.3?
The branching strategy seems reasonable. I'll address the other issues this week. Thanks again.
No worries. Incompatible change from Spark: https://issues.apache.org/jira/browse/SPARK-41970 I have created a branch for Spark-3.2 Let's support spark-3.4 on the master branch. This PR should only support Spark 3.4. Please address the comments above.
@junshi15 it seems that the current master also supports spark-3.3, do you think we should create a branch for spark-3.3?
Does it work out of box or you have to change pom.xml and recompile? I have not tested either of them. If you can verify the binary works for both 3.2 and 3.3, then we can just rename the branch as spark-3.2-3.3. If you have to change pom.xml, then let's create a separate branch and spin a binary. My guess is the former, but please verify it if you have time.
No worries. Incompatible change from Spark: https://issues.apache.org/jira/browse/SPARK-41970 I have created a branch for Spark-3.2 Let's support spark-3.4 on the master branch. This PR should only support Spark 3.4. Please address the comments above.
@junshi15 it seems that the current master also supports spark-3.3, do you think we should create a branch for spark-3.3?
Does it work out of box or you have to change pom.xml and recompile? I have not tested either of them. If you can verify the binary works for both 3.2 and 3.3, then we can just rename the branch as spark-3.2-3.3. If you have to change pom.xml, then let's create a separate branch and spin a binary. My guess is the former, but please verify it if you have time.
@junshi15 I think the pom.xml is the default configuration, which can be changed by the passed parameters to mvn
command. I've verified that current master branch can pass all unit tests for spark-3.2 and spark-3.3, however, spark-3.4 will fail with the error @ssiegel95 fixed in this PR.
mvn -Pscala-2.13 -Dspark.version=3.2.0 test
:
...
Run completed in 9 seconds, 690 milliseconds.
Total number of tests run: 31
Suites: completed 6, aborted 0
Tests: succeeded 31, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
mvn -Pscala-2.13 -Dspark.version=3.3.0 test
:
Run completed in 10 seconds, 397 milliseconds.
Total number of tests run: 31
Suites: completed 6, aborted 0
Tests: succeeded 31, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
mvn -Pscala-2.13 -Dspark.version=3.4.0 test
:
[ERROR] /home/mizhou/tmp/0611/spark-tfrecord/src/main/scala/com/linkedin/spark/datasources/tfrecord/TFRecordFileReader.scala:26: type mismatch;
found : org.apache.spark.paths.SparkPath
required: String
[ERROR] new Path(new URI(file.filePath)),
[ERROR] ^
[ERROR] one error found
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 7.222 s
[INFO] Finished at: 2023-06-11T22:56:05-07:00
[INFO] Final Memory: 94M/1444M
@mizhou-in, thanks for checking. I have renamed the branch to spark-3.2-3.3
published 0.6.0 for both scala-2.12/2.13
Makes TFRecordFileReader.scala compatible with spark 3.4.0's
SparkPath
.