typesense / typesense-docsearch-scraper

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://typesense.org/docs/guide/docsearch.html
Other
95 stars 35 forks source link

Selenium test failing on the master branch #53

Closed CodeSandwich closed 8 months ago

CodeSandwich commented 8 months ago

Description

The scraper/src/tests/config_loader/open_selenium_browser_test.py::TestOpenSeleniumBrowser::test_browser_needed_when_config_contains_automatic_tag test is failing.

Steps to reproduce

Run steps from Setting up the Python environment in CONTRIBUTING.md, then run ./docsearch test .

Expected Behavior

All the tests are passing.

Actual Behavior

All the tests are passing except scraper/src/tests/config_loader/open_selenium_browser_test.py::TestOpenSeleniumBrowser::test_browser_needed_when_config_contains_automatic_tag , which fails with selenium.common.exceptions.JavascriptException: Message: javascript error: $ is not defined.

The full output ``` ['pytest', './scraper/src'] ============================================================================================================ test session starts ============================================================================================================= platform linux -- Python 3.10.13, pytest-7.3.1, pluggy-1.0.0 rootdir: /home/zuczek/workspace/typesense-docsearch-scraper collected 98 items scraper/src/tests/config_loader/anchors_test.py ... [ 3%] scraper/src/tests/config_loader/basic_test.py .... [ 7%] scraper/src/tests/config_loader/domains_test.py .... [ 11%] scraper/src/tests/config_loader/get_extra_facets_test.py .... [ 15%] scraper/src/tests/config_loader/open_selenium_browser_test.py ..F [ 18%] scraper/src/tests/config_loader/selectors_exclude_test.py ... [ 21%] scraper/src/tests/config_loader/sitemap_test.py ... [ 24%] scraper/src/tests/config_loader/start_urls_test.py ...... [ 30%] scraper/src/tests/config_loader/stop_urls_test.py .. [ 32%] scraper/src/tests/default_strategy/custom_attributes_test.py . [ 33%] scraper/src/tests/default_strategy/default_value_test.py ...... [ 39%] scraper/src/tests/default_strategy/get_anchor_test.py ....... [ 46%] scraper/src/tests/default_strategy/get_hierarchy_radio_test.py ... [ 50%] scraper/src/tests/default_strategy/get_level_weight_test.py . [ 51%] scraper/src/tests/default_strategy/get_records_from_dom_test.py ................. [ 68%] scraper/src/tests/default_strategy/get_settings_test.py . [ 69%] scraper/src/tests/default_strategy/globals_test.py ...... [ 75%] scraper/src/tests/default_strategy/meta_test.py ......... [ 84%] scraper/src/tests/default_strategy/min_indexed_level_test.py . [ 85%] scraper/src/tests/default_strategy/page_rank_test.py .... [ 89%] scraper/src/tests/default_strategy/searchable_level_test.py .. [ 91%] scraper/src/tests/default_strategy/strip_chars_test.py .. [ 93%] scraper/src/tests/default_strategy/tags_test.py ... [ 96%] scraper/src/tests/default_strategy/xpath_test.py ... [100%] ================================================================================================================== FAILURES ================================================================================================================== _______________________________________________________________________________ TestOpenSeleniumBrowser.test_browser_needed_when_config_contains_automatic_tag _______________________________________________________________________________ self = , monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f3ef24e2230> def test_browser_needed_when_config_contains_automatic_tag(self, monkeypatch): from .mocked_init import MockedInit monkeypatch.setattr("selenium.webdriver.chrome", lambda x: MockedInit()) monkeypatch.setattr("time.sleep", lambda x: "") # When c = config({ "start_urls": [ { "url": "https://symfony.com/doc/(?P.*?)/(?P.*?)/", "variables": { "version": { "url": "https://symfony.com/doc/current/book/controller.html", "js": """\ var versions = $('.doc-switcher .versions li').map(function (i, elt) {\ return $(elt).find('a').html().split('/')[0].replace(/ |\\n/g,'');\ }).toArray();\ versions.push('current');\ return JSON.stringify(versions);""" }, "type_of_content": ["book", "bundles", "reference", "components", "cookbook", "best_practices"] } } ] }) > actual = ConfigLoader(c) scraper/src/tests/config_loader/open_selenium_browser_test.py:61: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ scraper/src/config/config_loader.py:85: in __init__ self._parse() scraper/src/config/config_loader.py:125: in _parse self.start_urls = UrlsParser.parse(self.start_urls) scraper/src/config/urls_parser.py:54: in parse values[match] = executor.execute( scraper/src/js_executor.py:16: in execute result = self.driver.execute_script(js) ../../.local/share/virtualenvs/typesense-docsearch-scraper-OXKn5A9S/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py:500: in execute_script return self.execute(command, {"script": script, "args": converted_args})["value"] ../../.local/share/virtualenvs/typesense-docsearch-scraper-OXKn5A9S/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py:440: in execute self.error_handler.check_response(response) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = response = {'status': 500, 'value': '{"value":{"error":"javascript error","message":"javascript error: $ is not defined\\n (Sess...\\n#18 0x55acb52230f5 \\u003Cunknown>\\n#19 0x55acb5231cce \\u003Cunknown>\\n#20 0x7f27c94aa9eb \\u003Cunknown>\\n"}}'} def check_response(self, response: Dict[str, Any]) -> None: """Checks that a JSON response from the WebDriver does not have an error. :Args: - response - The JSON response from the WebDriver server as a dictionary object. :Raises: If the response contains an error message. """ status = response.get("status", None) if not status or status == ErrorCode.SUCCESS: return value = None message = response.get("message", "") screen: str = response.get("screen", "") stacktrace = None if isinstance(status, int): value_json = response.get("value", None) if value_json and isinstance(value_json, str): import json try: value = json.loads(value_json) if len(value) == 1: value = value["value"] status = value.get("error", None) if not status: status = value.get("status", ErrorCode.UNKNOWN_ERROR) message = value.get("value") or value.get("message") if not isinstance(message, str): value = message message = message.get("message") else: message = value.get("message", None) except ValueError: pass exception_class: Type[WebDriverException] if status in ErrorCode.NO_SUCH_ELEMENT: exception_class = NoSuchElementException elif status in ErrorCode.NO_SUCH_FRAME: exception_class = NoSuchFrameException elif status in ErrorCode.NO_SUCH_SHADOW_ROOT: exception_class = NoSuchShadowRootException elif status in ErrorCode.NO_SUCH_WINDOW: exception_class = NoSuchWindowException elif status in ErrorCode.STALE_ELEMENT_REFERENCE: exception_class = StaleElementReferenceException elif status in ErrorCode.ELEMENT_NOT_VISIBLE: exception_class = ElementNotVisibleException elif status in ErrorCode.INVALID_ELEMENT_STATE: exception_class = InvalidElementStateException elif ( status in ErrorCode.INVALID_SELECTOR or status in ErrorCode.INVALID_XPATH_SELECTOR or status in ErrorCode.INVALID_XPATH_SELECTOR_RETURN_TYPER ): exception_class = InvalidSelectorException elif status in ErrorCode.ELEMENT_IS_NOT_SELECTABLE: exception_class = ElementNotSelectableException elif status in ErrorCode.ELEMENT_NOT_INTERACTABLE: exception_class = ElementNotInteractableException elif status in ErrorCode.INVALID_COOKIE_DOMAIN: exception_class = InvalidCookieDomainException elif status in ErrorCode.UNABLE_TO_SET_COOKIE: exception_class = UnableToSetCookieException elif status in ErrorCode.TIMEOUT: exception_class = TimeoutException elif status in ErrorCode.SCRIPT_TIMEOUT: exception_class = TimeoutException elif status in ErrorCode.UNKNOWN_ERROR: exception_class = WebDriverException elif status in ErrorCode.UNEXPECTED_ALERT_OPEN: exception_class = UnexpectedAlertPresentException elif status in ErrorCode.NO_ALERT_OPEN: exception_class = NoAlertPresentException elif status in ErrorCode.IME_NOT_AVAILABLE: exception_class = ImeNotAvailableException elif status in ErrorCode.IME_ENGINE_ACTIVATION_FAILED: exception_class = ImeActivationFailedException elif status in ErrorCode.MOVE_TARGET_OUT_OF_BOUNDS: exception_class = MoveTargetOutOfBoundsException elif status in ErrorCode.JAVASCRIPT_ERROR: exception_class = JavascriptException elif status in ErrorCode.SESSION_NOT_CREATED: exception_class = SessionNotCreatedException elif status in ErrorCode.INVALID_ARGUMENT: exception_class = InvalidArgumentException elif status in ErrorCode.NO_SUCH_COOKIE: exception_class = NoSuchCookieException elif status in ErrorCode.UNABLE_TO_CAPTURE_SCREEN: exception_class = ScreenshotException elif status in ErrorCode.ELEMENT_CLICK_INTERCEPTED: exception_class = ElementClickInterceptedException elif status in ErrorCode.INSECURE_CERTIFICATE: exception_class = InsecureCertificateException elif status in ErrorCode.INVALID_COORDINATES: exception_class = InvalidCoordinatesException elif status in ErrorCode.INVALID_SESSION_ID: exception_class = InvalidSessionIdException elif status in ErrorCode.UNKNOWN_METHOD: exception_class = UnknownMethodException else: exception_class = WebDriverException if not value: value = response["value"] if isinstance(value, str): raise exception_class(value) if message == "" and "message" in value: message = value["message"] screen = None # type: ignore[assignment] if "screen" in value: screen = value["screen"] stacktrace = None st_value = value.get("stackTrace") or value.get("stacktrace") if st_value: if isinstance(st_value, str): stacktrace = st_value.split("\n") else: stacktrace = [] try: for frame in st_value: line = frame.get("lineNumber", "") file = frame.get("fileName", "") if line: file = f"{file}:{line}" meth = frame.get("methodName", "") if "className" in frame: meth = f"{frame['className']}.{meth}" msg = " at %s (%s)" msg = msg % (meth, file) stacktrace.append(msg) except TypeError: pass if exception_class == UnexpectedAlertPresentException: alert_text = None if "data" in value: alert_text = value["data"].get("text") elif "alert" in value: alert_text = value["alert"].get("text") raise exception_class(message, screen, stacktrace, alert_text) # type: ignore[call-arg] # mypy is not smart enough here > raise exception_class(message, screen, stacktrace) E selenium.common.exceptions.JavascriptException: Message: javascript error: $ is not defined E (Session info: headless chrome=119.0.6045.159) E Stacktrace: E #0 0x55acb52326d4 E #1 0x55acb4f3748e E #2 0x55acb4f3c570 E #3 0x55acb4f3e0d4 E #4 0x55acb4fb69cf E #5 0x55acb4f9f132 E #6 0x55acb4fb5f65 E #7 0x55acb4f9eed3 E #8 0x55acb4f71420 E #9 0x55acb4f72a93 E #10 0x55acb52054c0 E #11 0x55acb5208780 E #12 0x55acb52081fa E #13 0x55acb5208c95 E #14 0x55acb51f765b E #15 0x55acb5209080 E #16 0x55acb51e2830 E #17 0x55acb5222ee7 E #18 0x55acb52230f5 E #19 0x55acb5231cce E #20 0x7f27c94aa9eb ../../.local/share/virtualenvs/typesense-docsearch-scraper-OXKn5A9S/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py:245: JavascriptException ============================================================================================================== warnings summary ============================================================================================================== scraper/src/tests/config_loader/get_extra_facets_test.py::TestGetExtraFacets::test_extra_facets_should_be_set_from_start_urls_variables_browser scraper/src/tests/config_loader/get_extra_facets_test.py::TestGetExtraFacets::test_extra_facets_should_be_set_from_start_urls_variables_with_two_start_url_browser scraper/src/tests/config_loader/get_extra_facets_test.py::TestGetExtraFacets::test_extra_facets_should_be_set_from_start_urls_variables_with_multiple_tags_browser scraper/src/tests/config_loader/open_selenium_browser_test.py::TestOpenSeleniumBrowser::test_browser_needed_when_js_render_true scraper/src/tests/config_loader/open_selenium_browser_test.py::TestOpenSeleniumBrowser::test_browser_needed_when_config_contains_automatic_tag scraper/src/tests/config_loader/start_urls_test.py::TestStartUrls::test_start_urls_should_be_generated_when_there_is_automatic_tagging_browser /home/zuczek/workspace/typesense-docsearch-scraper/scraper/src/config/browser_handler.py:35: DeprecationWarning: executable_path has been deprecated, please pass in a Service object driver = webdriver.Chrome( -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========================================================================================================== short test summary info =========================================================================================================== FAILED scraper/src/tests/config_loader/open_selenium_browser_test.py::TestOpenSeleniumBrowser::test_browser_needed_when_config_contains_automatic_tag - selenium.common.exceptions.JavascriptException: Message: javascript error: $ is not defined ================================================================================================== 1 failed, 97 passed, 6 warnings in 6.62s ================================================================================================== ```

Metadata

Typesense Version: 0.9.1

OS: Linux Manjaro

CodeSandwich commented 8 months ago

So the problem is that this test starts off by building the list of start URLs by scraping https://symfony.com/doc/current/book/controller.html for the list of versions of Symfony, but the page apparently got updated, and the JS code can't find the right DOM objects, so it fails.