jeroenjanssens / data-science-at-the-command-line

Data Science at the Command Line
https://datascienceatthecommandline.com
Other
3.77k stars 766 forks source link

first edition, ch5, heading 5.4, scrape script throws error AttributeError: 'str' object has no attribute 'decode' #113

Closed armenic closed 3 years ago

armenic commented 3 years ago

Dear Author,

Thank you so much for an awesome book! I enjoy every page of it. While doing the examples in ch5, header ch5.4 I encountered the following error and here are the reproducible steps:

docker image:

datasciencetoolbox/dsatcl2e latest ae2a972cf9a6 3 weeks ago 2.67GB

commands submitted:

curl -sL 'http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio' > wiki.html
< wiki.html scrape -b -e 'table.wikitable > tr:not(:first-child)' > table.html

error:

Traceback (most recent call last):
  File "/usr/bin/dsutils/scrape", line 78, in <module>
    exit(main())
  File "/usr/bin/dsutils/scrape", line 37, in main
    args.expression = [e.decode('utf-8') for e in args.expression]
  File "/usr/bin/dsutils/scrape", line 37, in <listcomp>
    args.expression = [e.decode('utf-8') for e in args.expression]
AttributeError: 'str' object has no attribute 'decode'
aborruso commented 3 years ago

@armenic in the book I do not see this CSS selector query table.wikitable > tr:not(:first-child).

Anyway it seems to me wrong. Try to use table.wikitable tr:not(:first-child). You can test it also via browser

image

armenic commented 3 years ago

thanks @aborruso, sorry my mistake, it was in the first edition https://www.datascienceatthecommandline.com/1e/chapter-5-scrubbing-data.html#working-with-xmlhtml-and-json I appreciate that there is an issue with selector and please pay attention to the error. Apparently in the scrape Python script the author used decode method on the string which does not exist in Python 3.9.5

image

jeroenjanssens commented 3 years ago

Thank you @armenic and @aborruso for your feedback. In the second edition scrape has been replaced by pup, so this is no longer an issue.