tensorflow / data-validation

Library for exploring and validating machine learning data
Apache License 2.0
766 stars 174 forks source link

The generate_statistics_from_csv very slowly for large dataset in single server #98

Open yajunwong opened 4 years ago

yajunwong commented 4 years ago

Hi According to the tfx examples, I pass the pipeline_options to generate_statistics_from_csv which set --direct_num_workers=16 like:

pipeline_options = PipelineOptions(['--direct_num_workers=16'])

It's seem that this option cannot speed up this API, when I set direct_num_workers=1, the cost time is equal the 16 worker, like that:

# direct_num_workers=1
python prep.py  99.27s user 5.84s system 99% cpu 1:45.67 total

# direct_num_workers=16
python prep.py  101.92s user 5.22s system 98% cpu 1:48.44 total

Could someone help me?

paulgc commented 4 years ago

Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe.

import tensorflow_data_validation as tfdv
import pandas as pd
CSV_FILE_PATH = ''
df = pd.read_csv(CSV_FILE_PATH)
stats = tfdv.generate_statistics_from_dataframe(df)
IveJ commented 4 years ago

Hi Yajunwang,

When executing your pipeline locally, the default values for the properties in PipelineOptions are generally sufficient and direct runner on one compute. https://cloud.google.com/dataflow/docs/guides/specifying-exec-params

On Sat, Jan 4, 2020, 04:18 Paul Suganthan notifications@github.com wrote:

Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe.

import tensorflow_data_validation as tfdv import pandas as pd CSV_FILE_PATH = '' df = pd.read_csv(CSV_FILE_PATH) stats = tfdv.generate_statistics_from_dataframe(df)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/data-validation/issues/98?email_source=notifications&email_token=AEYAML5PHJXRYIRFD6X3GCTQ36TUHA5CNFSM4KCR5WI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEICDLOQ#issuecomment-570701242, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYAMLYSHUNOI27MX5D4ZNTQ36TUHANCNFSM4KCR5WIQ .

yajunwong commented 4 years ago

Hi Yajunwang, When executing your pipeline locally, the default values for the properties in PipelineOptions are generally sufficient and direct runner on one compute. https://cloud.google.com/dataflow/docs/guides/specifying-exec-params On Sat, Jan 4, 2020, 04:18 Paul Suganthan @.***> wrote: Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe. import tensorflow_data_validation as tfdv import pandas as pd CSV_FILE_PATH = '' df = pd.read_csv(CSV_FILE_PATH) stats = tfdv.generate_statistics_from_dataframe(df) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#98?email_source=notifications&email_token=AEYAML5PHJXRYIRFD6X3GCTQ36TUHA5CNFSM4KCR5WI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEICDLOQ#issuecomment-570701242>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYAMLYSHUNOI27MX5D4ZNTQ36TUHANCNFSM4KCR5WIQ .

It's seem not invalid for this option! Please infer this gist https://gist.github.com/yajunwong/f317c565f375125fd3ec2963967ba164

yajunwong commented 4 years ago

Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe.

import tensorflow_data_validation as tfdv
import pandas as pd
CSV_FILE_PATH = ''
df = pd.read_csv(CSV_FILE_PATH)
stats = tfdv.generate_statistics_from_dataframe(df)

I try to this api, but report error, please refer this issue: https://github.com/tensorflow/data-validation/issues/98#issuecomment-570701242