rbaron / dict.cc.py

📘 Unofficial command line client for dict.cc
157 stars 37 forks source link

More robust HTML parsing #15

Closed rbaron closed 6 years ago

rbaron commented 6 years ago

The way the HTML is parsed is very hacky and very fragile to changes in the results page. I think we would benefit from a more robust strategy for looking for the relevant tags.

The parsing is currently done here.

aaschmid commented 6 years ago

What about something like in this diff: https://github.com/rbaron/dict.cc.py/compare/master...aaschmid:master This might be solving #15, it also solves #12 and suggest an improvment for copy&pasting words from CLI.

Any feedback welcome and I am happy to improve it. I will also create single PRs for every topic if some is already OK for you...

aaschmid commented 6 years ago

@rbaron: Any thoughts on my first steps in order to improve it and create a PR?

rbaron commented 6 years ago

Hi @aaschmid,

it seems like your branch is not working correctly on my computer with Python3:

% dict.cc.py en de body
Traceback (most recent call last):
  File "/private/tmp/dict.cc.py/venv/bin/dict.cc.py", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/private/tmp/dict.cc.py/scripts/dict.cc.py", line 81, in <module>
    run()
  File "/private/tmp/dict.cc.py/scripts/dict.cc.py", line 62, in run
    args.output_language)
  File "/private/tmp/dict.cc.py/dictcc/dictcc.py", line 60, in translate
    result = cls._parse_response(response_body)
  File "/private/tmp/dict.cc.py/dictcc/dictcc.py", line 101, in _parse_response
    languages = [language.strings.next() for language in soup.find_all("td", class_="td2", attrs={'dir': "ltr"})]
  File "/private/tmp/dict.cc.py/dictcc/dictcc.py", line 101, in <listcomp>
    languages = [language.strings.next() for language in soup.find_all("td", class_="td2", attrs={'dir': "ltr"})]
AttributeError: 'generator' object has no attribute 'next'
aaschmid commented 6 years ago

better @rbaron?

rbaron commented 6 years ago

Hi @aaschmid, I just tested it and works great. Feel free to send a PR!