pathwaycom / pathway

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
https://pathway.com
Other
2.84k stars 98 forks source link

getting error in "Streaming ETL pipelines" showcase #48

Open xHishamSaeedx opened 1 month ago

xHishamSaeedx commented 1 month ago

i tried run the "Streaming ETL pipelines in Python with Airbyte and Pathway" and for many sources and i kept getting the folllowing error :

Traceback (most recent call last): File "/home/hisham/Documents/GitHub/Hubspot-activity/test.py", line 3, in commits_table = pw.io.airbyte.read( File "/usr/local/lib/python3.10/dist-packages/pathway/io/airbyte/init.py", line 276, in read for stream in source.configured_catalog["streams"]: File "/usr/local/lib/python3.10/dist-packages/airbyte_serverless/sources.py", line 169, in getattr return getattr(self.source, name) File "/usr/local/lib/python3.10/dist-packages/airbyte_serverless/sources.py", line 102, in configured_catalog configured_catalog = self.catalog File "/usr/local/lib/python3.10/dist-packages/airbyte_serverless/sources.py", line 97, in catalog message = self._run_and_return_first_message('discover') File "/usr/local/lib/python3.10/dist-packages/airbyte_serverless/sources.py", line 73, in _run_and_return_first_message message = next( File "/usr/local/lib/python3.10/dist-packages/airbyte_serverless/sources.py", line 74, in (message for message in messages if message['type'] not in ['LOG', 'TRACE']), File "/usr/local/lib/python3.10/dist-packages/airbyte_serverless/sources.py", line 68, in _run raise AirbyteSourceException(json.dumps(message['trace']['error'])) airbyte_serverless.sources.AirbyteSourceException: {"message": "Something went wrong in the connector. See the logs for more details.", "internal_message": "[Errno 2] No such file or directory: '/mnt/temp/config.json'", "stack_trace": "Traceback (most recent call last):\n File \"/airbyte/integration_code/main.py\", line 8, in \n run()\n File \"/airbyte/integration_code/source_hubspot/run.py\", line 14, in run\n launch(source, sys.argv[1:])\n File \"/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py\", line 235, in launch\n for message in source_entrypoint.run(parsed_args):\n File \"/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py\", line 108, in run\n raw_config = self.source.read_config(parsed_args.config)\n File \"/usr/local/lib/python3.9/site-packages/airbyte_cdk/connector.py\", line 51, in read_config\n config = BaseConnector._read_json_file(config_path)\n File \"/usr/local/lib/python3.9/site-packages/airbyte_cdk/connector.py\", line 61, in _read_json_file\n with open(file_path, \"r\") as file:\nFileNotFoundError: [Errno 2] No such file or directory: '/mnt/temp/config.json'\n", "failure_type": "system_error", "stream_descriptor": null}

janchorowski commented 1 month ago

Hi, we are investigating the issue. Can you clarify if:

  1. all connectors are failing, or
  2. some connectors work, but others fail?

What is your setup (operating system, architecture)?

berkecanrizai commented 1 month ago

Hey @xHishamSaeedx , issue is from the auto-generated config, I replicated the error. First, make sure version 0.23 of airbyte-serverless is installed, you can check with: pip show airbyte-serverless.

Create the config with abs create github --source "airbyte/source-github", (I believe you have already done this)

Specific error is from several places in the config,

If you want to customize the app for yourself, above solution should fix.

If you just want to run the pipeline as in the showcase, you can completely remove the contents of the file after creating the config.

Then, copy and paste the config defined in the showcase as:

source:
  docker_image: "airbyte/source-github"  # Here the airbyte connector type is specified
  config: 
    credentials:
      option_title: "PAT Credentials"  # The second authentication option you've uncommented
      personal_access_token: "<TOKEN>"  # Taken from https://github.com/settings/tokens
    repositories:
      - pathwaycom/pathway  # Pathway repository
    api_url: "https://api.github.com/"
  streams: commits

Make sure to change the part with <TOKEN> to put your own git token (make sure that "read public repo" permission is selected while creating the token.

Make sure that, path in the pw.io.airbyte.read is pointing to the yaml file above

commits_table = pw.io.airbyte.read(
    "./connections/github.yaml",  # our yaml config
    streams=["commits"],
)

This should successfully run when you call pw.run(). Hope this helps.

xHishamSaeedx commented 1 month ago

Hey @xHishamSaeedx , issue is from the auto-generated config, I replicated the error. First, make sure version 0.23 of airbyte-serverless is installed, you can check with: pip show airbyte-serverless.

Create the config with abs create github --source "airbyte/source-github", (I believe you have already done this)

Specific error is from several places in the config,

* the `repository` (right on top of the `repositories` entry) shouldn't be left empty, you can completely remove or comment it out.

* `start_date` also throws that error when it is left empty, you can also remove or populate it.

* lastly, `repositories` should take a list in yaml format (rather than single string), replace the autogenerated snippet with:
repositories: 
  - "pathwaycom/pathway"

If you want to customize the app for yourself, above solution should fix.

If you just want to run the pipeline as in the showcase, you can completely remove the contents of the file after creating the config.

Then, copy and paste the config defined in the showcase as:

source:
  docker_image: "airbyte/source-github"  # Here the airbyte connector type is specified
  config: 
    credentials:
      option_title: "PAT Credentials"  # The second authentication option you've uncommented
      personal_access_token: "<TOKEN>"  # Taken from https://github.com/settings/tokens
    repositories:
      - pathwaycom/pathway  # Pathway repository
    api_url: "https://api.github.com/"
  streams: commits

Make sure to change the part with <TOKEN> to put your own git token (make sure that "read public repo" permission is selected while creating the token.

Make sure that, path in the pw.io.airbyte.read is pointing to the yaml file above

commits_table = pw.io.airbyte.read(
    "./connections/github.yaml",  # our yaml config
    streams=["commits"],
)

This should successfully run when you call pw.run(). Hope this helps.

Still doesnt work

YAML file :

source:
  docker_image: "airbyte/source-github"  # Here the airbyte connector type is specified
  config: 
    credentials:
      option_title: "PAT Credentials"  # The second authentication option you've uncommented
      personal_access_token: "<TOKEN>"  # Taken from https://github.com/settings/tokens
    repositories:
      - pathwaycom/pathway  # Pathway repository
    api_url: "https://api.github.com/"
  streams: commits

and the file i ran was :

import pathway as pw

commits_table = pw.io.airbyte.read(
    "./connections/github.yaml",
    streams=["commits"],
)

pw.io.jsonlines.write(commits_table, "commits.jsonlines")
pw.run()

and before it was running perfectly , only a few days ago it stopped working without me making any changes to the yaml file, and at that point , github issues werent being pulled but comments somehow were and commits

xHishamSaeedx commented 1 month ago

Hi, we are investigating the issue. Can you clarify if:

1. all connectors are failing, or

2. some connectors work, but others fail?

What is your setup (operating system, architecture)?

this error i got a few days ago , before it was running smoothly, but after i got the error without making changes to the yaml file at all, i checked other streams like commits , comments and those were still working , then even they stopped

i tried hubspot connector too but that also gave same error

operating system is Linux (Ubuntu) and architecture is x86_64,

berkecanrizai commented 1 month ago

Hi, we are investigating the issue. Can you clarify if:

1. all connectors are failing, or

2. some connectors work, but others fail?

What is your setup (operating system, architecture)?

this error i got a few days ago , before it was running smoothly, but after i got the error without making changes to the yaml file at all, i checked other streams like commits , comments and those were still working , then even they stopped

i tried hubspot connector too but that also gave same error

operating system is Linux (Ubuntu) and architecture is x86_64,

Hey, I just ran your code from your repository test-pathway without any error. Few things: