mozilla / fathom

A framework for extracting meaning from web pages
http://mozilla.github.io/fathom/
Mozilla Public License 2.0
1.97k stars 76 forks source link

fathom train fails: 'WebDriver' object has no attribute 'find_element_by_id' #308

Open Trikolon opened 1 year ago

Trikolon commented 1 year ago

The train command fails because of an incompatibility with more recent versions of geckodriver.

I've tested this with multiple version of geckodriver. Versions >= 0.30.0 fail with the stacktrace below, while versions < 0.30.0 suffer from https://github.com/mozilla/fathom/issues/295.

macOS 12.4 (M1 Max)

PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python fathom train samples/training --ruleset rulesets.js --trainee a
Building FathomFox with your ruleset...done.
Running Firefox...done.
Starting HTTP server...done.
Configuring Vectorizer...Traceback (most recent call last):
  File "/Users/pbz/Library/Python/3.8/bin/fathom", line 8, in <module>
    sys.exit(fathom())
  File "/Users/pbz/Library/Python/3.8/lib/python/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/pbz/Library/Python/3.8/lib/python/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/pbz/Library/Python/3.8/lib/python/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/pbz/Library/Python/3.8/lib/python/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/pbz/Library/Python/3.8/lib/python/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/pbz/Library/Python/3.8/lib/python/site-packages/fathom_web/commands/train.py", line 223, in train
    make_or_find_vectors(ruleset,
  File "/Users/pbz/Library/Python/3.8/lib/python/site-packages/fathom_web/vectorizer.py", line 66, in make_or_find_vectors
    vectorize(ruleset, trainee, sample_set, sample_cache, show_browser, kind_of_set, delay, tabs)
  File "/Users/pbz/Library/Python/3.8/lib/python/site-packages/fathom_web/vectorizer.py", line 158, in vectorize
    run_vectorizer(firefox, trainee_id, sample_filenames, output_path, kind_of_set, port, delay, tabs)
  File "/Users/pbz/Library/Python/3.8/lib/python/site-packages/fathom_web/vectorizer.py", line 478, in run_vectorizer
    ruleset_dropdown_selector = Select(firefox.find_element_by_id('trainee'))
AttributeError: 'WebDriver' object has no attribute 'find_element_by_id'
Trikolon commented 1 year ago

Looks like firefox.find_element_by_id calls need to be replaced with firefox.find_element(By.ID, [...] See https://stackoverflow.com/questions/69875125/find-element-by-commands-are-deprecated-in-selenium

DimiDL commented 1 year ago

From the selenium change log, find_element_by_* is deprecated in 4.3.0. So I guess we probably want to change https://github.com/mozilla/fathom/blob/2b2c84eace185b4cc6fa4f75d00d028728a30f8a/cli/setup.py#L21 to 'selenium>=3.141.0,<4.3.0',?

OR

use find_element as paul suggested and set 'selenium>=4.3.0',

Trikolon commented 1 year ago

I think find_element is better. It was an easy fix for me locally. However I'm not very familiar with the codebase, so there may be more breaking changes I haven't triggered yet?