linkedin / spark-tfrecord

Read and write Tensorflow TFRecord data from Apache Spark.
BSD 2-Clause "Simplified" License
290 stars 57 forks source link

make spark 3.4.0 compatible #64

Closed ssiegel95 closed 1 year ago

ssiegel95 commented 1 year ago

Makes TFRecordFileReader.scala compatible with spark 3.4.0's SparkPath.

junshi15 commented 1 year ago

Can you update the readme? https://github.com/linkedin/spark-tfrecord/blob/master/README.md#including-the-library

Adding a line about 0.6.0 and spark 3.4

ssiegel95 commented 1 year ago

@junshi15 I don't see a clean way of supporting spark 3.2/3.3 and 3.4 simultaneously. This could be due to the fact that my scala skills are very poor (apologies). I tried using

new Path(file.toString),

instead of file.toPath since toString is the only common method I can find on the PartitionedFile class between v3.2/3.3 and v3.4

That fails with

 Cause: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: path:%20file://home/stus/git/github.com/spark-tfrecord/tf-sandbox/example_overwrite.tfrecord/part-00001-9119ef7a-ba07-4f4a-9548-6ea6a85902ea-c000.tfrecord,%20range:%200-828,%20partition%20values:%20%5Bempty%20row%5D
junshi15 commented 1 year ago

No worries. Incompatible change from Spark: https://issues.apache.org/jira/browse/SPARK-41970 I have created a branch for Spark-3.2 Let's support spark-3.4 on the master branch. This PR should only support Spark 3.4. Please address the comments above.

mizhou-in commented 1 year ago

No worries. Incompatible change from Spark: https://issues.apache.org/jira/browse/SPARK-41970 I have created a branch for Spark-3.2 Let's support spark-3.4 on the master branch. This PR should only support Spark 3.4. Please address the comments above.

@junshi15 it seems that the current master also supports spark-3.3, do you think we should create a branch for spark-3.3?

ssiegel95 commented 1 year ago

The branching strategy seems reasonable. I'll address the other issues this week. Thanks again.

junshi15 commented 1 year ago

No worries. Incompatible change from Spark: https://issues.apache.org/jira/browse/SPARK-41970 I have created a branch for Spark-3.2 Let's support spark-3.4 on the master branch. This PR should only support Spark 3.4. Please address the comments above.

@junshi15 it seems that the current master also supports spark-3.3, do you think we should create a branch for spark-3.3?

Does it work out of box or you have to change pom.xml and recompile? I have not tested either of them. If you can verify the binary works for both 3.2 and 3.3, then we can just rename the branch as spark-3.2-3.3. If you have to change pom.xml, then let's create a separate branch and spin a binary. My guess is the former, but please verify it if you have time.

mizhou-in commented 1 year ago

No worries. Incompatible change from Spark: https://issues.apache.org/jira/browse/SPARK-41970 I have created a branch for Spark-3.2 Let's support spark-3.4 on the master branch. This PR should only support Spark 3.4. Please address the comments above.

@junshi15 it seems that the current master also supports spark-3.3, do you think we should create a branch for spark-3.3?

Does it work out of box or you have to change pom.xml and recompile? I have not tested either of them. If you can verify the binary works for both 3.2 and 3.3, then we can just rename the branch as spark-3.2-3.3. If you have to change pom.xml, then let's create a separate branch and spin a binary. My guess is the former, but please verify it if you have time.

@junshi15 I think the pom.xml is the default configuration, which can be changed by the passed parameters to mvn command. I've verified that current master branch can pass all unit tests for spark-3.2 and spark-3.3, however, spark-3.4 will fail with the error @ssiegel95 fixed in this PR.

junshi15 commented 1 year ago

@mizhou-in, thanks for checking. I have renamed the branch to spark-3.2-3.3

junshi15 commented 1 year ago

published 0.6.0 for both scala-2.12/2.13