signcl / docsearch-scraper-action

Algolia DocSearch Scraper in Docker for GitHub Actions
MIT License
16 stars 1 forks source link

Having trouble getting this to run #1

Closed galligan closed 2 years ago

galligan commented 2 years ago

Thanks for putting this together! Hoping it's going to work for our use case.

I tossed the config that you've got as an example directly into algoria.json and the Action spit out this :

2022-02-24T18:51:29.1064679Z ##[group]Run signcl/docsearch-scraper-action@master
2022-02-24T18:51:29.1064976Z env:
2022-02-24T18:51:29.1065591Z   APPLICATION_ID: ***
2022-02-24T18:51:29.1065858Z   API_KEY: ***
2022-02-24T18:51:29.1066830Z   CONFIG: {"index_name":"xmtp_docs","start_urls":["https://docs.xmtp.org/","https://mg0716-docs-updates.xmtp-docs-test.pages.dev/"],"sitemap_urls":["https://docs.xmtp.org/sitemap.xml","https://mg0716-docs-updates.xmtp-docs-test.pages.dev/sitemap.xml"],"sitemap_alternate_links":true,"stop_urls":[],"selectors":{"lvl1":"header h1","lvl2":"article h2","lvl3":"article h3","lvl4":"article h4","lvl5":"article h5, article td:first-child","lvl6":"article h6","text":"article p, article li, article td:last-child"},"strip_chars":" .,;:#","custom_settings":{"separatorsToIndex":"_","attributesForFaceting":["language","version","type","docusaurus_tag"],"attributesToRetrieve":["hierarchy","content","anchor","url","url_without_anchor","type"]}}
2022-02-24T18:51:29.1067792Z ##[endgroup]
2022-02-24T18:51:29.1292370Z ##[command]/usr/bin/docker run --name db2d71370be3b957d46a3bae3ffc9bfb22e1e_6d377d --label 7db2d7 --workdir /github/workspace --rm -e APPLICATION_ID -e API_KEY -e CONFIG -e HOME -e GITHUB_JOB -e GITHUB_REF -e GITHUB_SHA -e GITHUB_REPOSITORY -e GITHUB_REPOSITORY_OWNER -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RETENTION_DAYS -e GITHUB_RUN_ATTEMPT -e GITHUB_ACTOR -e GITHUB_WORKFLOW -e GITHUB_HEAD_REF -e GITHUB_BASE_REF -e GITHUB_EVENT_NAME -e GITHUB_SERVER_URL -e GITHUB_API_URL -e GITHUB_GRAPHQL_URL -e GITHUB_REF_NAME -e GITHUB_REF_PROTECTED -e GITHUB_REF_TYPE -e GITHUB_WORKSPACE -e GITHUB_ACTION -e GITHUB_EVENT_PATH -e GITHUB_ACTION_REPOSITORY -e GITHUB_ACTION_REF -e GITHUB_PATH -e GITHUB_ENV -e RUNNER_OS -e RUNNER_ARCH -e RUNNER_NAME -e RUNNER_TOOL_CACHE -e RUNNER_TEMP -e RUNNER_WORKSPACE -e ACTIONS_RUNTIME_URL -e ACTIONS_RUNTIME_TOKEN -e ACTIONS_CACHE_URL -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/work/_temp/_github_home":"/github/home" -v "/home/runner/work/_temp/_github_workflow":"/github/workflow" -v "/home/runner/work/_temp/_runner_file_commands":"/github/file_commands" -v "/home/runner/work/docs/docs":"/github/workspace" 7db2d7:1370be3b957d46a3bae3ffc9bfb22e1e
2022-02-24T18:51:30.5196856Z Traceback (most recent call last):
2022-02-24T18:51:30.5197212Z   File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
2022-02-24T18:51:30.5197465Z     "__main__", mod_spec)
2022-02-24T18:51:30.5197708Z   File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
2022-02-24T18:51:30.5197935Z     exec(code, run_globals)
2022-02-24T18:51:30.5198201Z   File "/root/src/index.py", line 119, in <module>
2022-02-24T18:51:30.5198730Z     run_config(environ['CONFIG'])
2022-02-24T18:51:30.5198972Z   File "/root/src/index.py", line 33, in run_config
2022-02-24T18:51:30.5199204Z     config = ConfigLoader(config)
2022-02-24T18:51:30.5199460Z   File "/root/src/config/config_loader.py", line 84, in __init__
2022-02-24T18:51:30.5199708Z     self._parse()
2022-02-24T18:51:30.5199938Z   File "/root/src/config/config_loader.py", line 120, in _parse
2022-02-24T18:51:30.5200237Z     self.selectors = SelectorsParser().parse(self.selectors)
2022-02-24T18:51:30.5200562Z   File "/root/src/config/selectors_parser.py", line 69, in parse
2022-02-24T18:51:30.5200842Z     config_selectors[selectors_key])
2022-02-24T18:51:30.5201107Z   File "/root/src/config/selectors_parser.py", line 10, in _parse_selectors_set
2022-02-24T18:51:30.5201393Z     selectors_set[key] = config_selectors[key]
2022-02-24T18:51:30.5201640Z TypeError: string indices must be integers
2022-02-24T18:51:30.7383868Z Cleaning up orphan processes

Here's the config file:

{
  "index_name": "docs",
  "start_urls": ["https://example.com/"],
  "sitemap_urls": ["https://example.com/sitemap.xml"],
  "sitemap_alternate_links": true,
  "stop_urls": [],
  "selectors": {
    "lvl1": "header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5, article td:first-child",
    "lvl6": "article h6",
    "text": "article p, article li, article td:last-child"
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": ["language", "version", "type", "docusaurus_tag"],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  }
}

Any thoughts?

york1to commented 1 year ago

@galligan I met the same problem. Would you share how you solved the problem?