Closed Riser01 closed 2 years ago
Do you think it's a performance problem (running slow) or it just hangs? What would happen if you try your code on a small subset of the table? Would it work fine?
Also, please note that "make_spark_converter" takes a spark dataframe and materializes it into a parquet file. If your data is big, this materialization step may take a long time.
@selitvin it completed but took a long time,
also i am getting the following warnings while training
/databricks/python/lib/python3.7/site-packages/petastorm/arrow_reader_worker.py:53: FutureWarning: Calling .data on ChunkedArray is provided for compatibility after Column was removed, simply drop this attribute column_as_pandas = column.data.chunks[0].to_pandas() WARNING:tensorflow:From /databricks/python/lib/python3.7/site-packages/tensorflow/python/ops/summary_ops_v2.py:1277: stop (from tensorflow.python.eager.profiler) is deprecated and will be removed after 2020-07-01.
this is how the GPU and CPU utilisation looks , while traning
how does make_spark_converter compare with using directly tf records? After
make_spark_converter
is done converting a pyspark dataframe into parquet, we instantiatesmake_batch_reader()
to read the data from the materialized Parquet. I assume your question is howmake_batch_reader()
compares to reading directly from TFRecords? If so, then TFRecords would typically be faster, but it's not exactly comparing apples to apples as the added value of petastorm is saving the typical ETL step that is required to produce TFRecords and allowing you to read directly from a parquet store.can I use workers_count to increase the speed, does workers_count mean the no of CPU cores on a single node GPU instance. Tweaking
workers_count
could help speeding up your training. You can also try settingreader_pool_type='process'
. It's hard to say upfront since the particular choice of these parameters are function of the data and the processing you are doing.
I have 2 very large (in tb) datasets (using pentastorm to train tf model)
what I am doing is loading the datasets using pentastorm and then creating a single feature and labels dataset, as I cant pass two separate datasets
using pentastorm :
model function:
Traning loop :
Error:
Any help would be great.