Open abhalawat opened 2 years ago
@abhalawat It's processing the data in a distributed way. This is the ability of Flink runtime.
How will I achieve that?Will PyFlink help me do that?
Yes, when you submit a PyFlink to a remote cluster, e.g. YARN, K8s, etc, it will execute it in a distributed manner (you need to set the parallelism to a value other than 1). See https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/cli/#submitting-pyflink-jobs for more details on how to submit PyFlink jobs to a remote cluster.
I followed this doc and created task manager and job manager.But while running the first code i.e. 1_word_count.py.
I ma facing the error as:
Could you post the full exception stack?
Sure here it is:
@abhalawat From the exception stack, it failed to create the output directory. There should be some permission issues. You could change the sink connector to other connectors such as [print](https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/print/)
to work around this problem.
Should I change this command:
t_env.execute_sql("""
CREATE TABLE mySink (
word STRING,
`count` BIGINT
) WITH (
'connector' = 'filesystem',
'format' = 'csv',
'path' = '/opt/examples/table/output/word_count_output'
)
""")
I created new folder for output named as result then too it was showing error:
Also,In here connector has to be file system because I am attaching file in SQL query.
I am trying to get about 14million data and want this process to work faster.Is there any way PyFlink could help?