Open yajunwong opened 4 years ago
Another option is to try using generate_statistics_from_dataframe
if you can load your dataset as a pandas dataframe.
import tensorflow_data_validation as tfdv
import pandas as pd
CSV_FILE_PATH = ''
df = pd.read_csv(CSV_FILE_PATH)
stats = tfdv.generate_statistics_from_dataframe(df)
Hi Yajunwang,
When executing your pipeline locally, the default values for the properties in PipelineOptions are generally sufficient and direct runner on one compute. https://cloud.google.com/dataflow/docs/guides/specifying-exec-params
On Sat, Jan 4, 2020, 04:18 Paul Suganthan notifications@github.com wrote:
Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe.
import tensorflow_data_validation as tfdv import pandas as pd CSV_FILE_PATH = '' df = pd.read_csv(CSV_FILE_PATH) stats = tfdv.generate_statistics_from_dataframe(df)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/data-validation/issues/98?email_source=notifications&email_token=AEYAML5PHJXRYIRFD6X3GCTQ36TUHA5CNFSM4KCR5WI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEICDLOQ#issuecomment-570701242, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYAMLYSHUNOI27MX5D4ZNTQ36TUHANCNFSM4KCR5WIQ .
Hi Yajunwang, When executing your pipeline locally, the default values for the properties in PipelineOptions are generally sufficient and direct runner on one compute. https://cloud.google.com/dataflow/docs/guides/specifying-exec-params … On Sat, Jan 4, 2020, 04:18 Paul Suganthan @.***> wrote: Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe. import tensorflow_data_validation as tfdv import pandas as pd CSV_FILE_PATH = '' df = pd.read_csv(CSV_FILE_PATH) stats = tfdv.generate_statistics_from_dataframe(df) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#98?email_source=notifications&email_token=AEYAML5PHJXRYIRFD6X3GCTQ36TUHA5CNFSM4KCR5WI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEICDLOQ#issuecomment-570701242>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYAMLYSHUNOI27MX5D4ZNTQ36TUHANCNFSM4KCR5WIQ .
It's seem not invalid for this option! Please infer this gist https://gist.github.com/yajunwong/f317c565f375125fd3ec2963967ba164
Another option is to try using
generate_statistics_from_dataframe
if you can load your dataset as a pandas dataframe.import tensorflow_data_validation as tfdv import pandas as pd CSV_FILE_PATH = '' df = pd.read_csv(CSV_FILE_PATH) stats = tfdv.generate_statistics_from_dataframe(df)
I try to this api, but report error, please refer this issue: https://github.com/tensorflow/data-validation/issues/98#issuecomment-570701242
Hi According to the tfx examples, I pass the
pipeline_options
togenerate_statistics_from_csv
which set--direct_num_workers=16
like:It's seem that this option cannot speed up this API, when I set
direct_num_workers=1
, the cost time is equal the 16 worker, like that:Could someone help me?