pyflink / playgrounds

Provide docker environment and examples for PyFlink
Apache License 2.0
186 stars 84 forks source link

Can we do multiprocessing or distributed computing through PyFlink? #21

Open abhalawat opened 2 years ago

abhalawat commented 2 years ago

I am trying to get about 14million data and want this process to work faster.Is there any way PyFlink could help?

dianfu commented 2 years ago

@abhalawat It's processing the data in a distributed way. This is the ability of Flink runtime.

abhalawat commented 2 years ago

How will I achieve that?Will PyFlink help me do that?

dianfu commented 2 years ago

Yes, when you submit a PyFlink to a remote cluster, e.g. YARN, K8s, etc, it will execute it in a distributed manner (you need to set the parallelism to a value other than 1). See https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/cli/#submitting-pyflink-jobs for more details on how to submit PyFlink jobs to a remote cluster.

abhalawat commented 2 years ago

I followed this doc and created task manager and job manager.But while running the first code i.e. 1_word_count.py.

I ma facing the error as: flink

dianfu commented 2 years ago

Could you post the full exception stack?

abhalawat commented 2 years ago

Sure here it is: error1 error2 error3 error4

dianfu commented 2 years ago

@abhalawat From the exception stack, it failed to create the output directory. There should be some permission issues. You could change the sink connector to other connectors such as [print](https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/print/) to work around this problem.

abhalawat commented 2 years ago

Should I change this command:

t_env.execute_sql("""
         CREATE TABLE mySink (
           word STRING,
           `count` BIGINT
         ) WITH (
           'connector' = 'filesystem',
           'format' = 'csv',
           'path' = '/opt/examples/table/output/word_count_output'
         )
     """)
abhalawat commented 2 years ago

I created new folder for output named as result then too it was showing error:

output1 output2 output3 output4

abhalawat commented 2 years ago

Also,In here connector has to be file system because I am attaching file in SQL query.