[spark-tensorflow-connector] All input data are loaded to infer schema

tensorflow / ecosystem

Integration of TensorFlow with other open-source frameworks

Apache License 2.0

1.37k stars 392 forks source link

Open manuzhang opened 5 years ago

manuzhang commented 5 years ago

I find all input data are loaded when I just want to print schema with the following code.

spark.read \
    .option("recordType", "Example") \
    .format("tfrecords") \
    .load(path) \
    .printSchema()

WegenPan commented 4 years ago

you can load data like this, and you need to prepare the schema sparkSession.read.format("tfrecords").schema(schema).load(inputPattern)

manuzhang commented 4 years ago

Unless I don't know the schema and want to find out with printSchema

WegenPan commented 4 years ago

Unless I don't know the schema and want to find out with printSchema

I guess the process need to scan the whole data to generate the schema, may be you can try use less data