google / corpuscrawler

Crawler for linguistic corpora
Other
190 stars 56 forks source link

Does not run in python3.7 or python 2.7 #73

Open ftyers opened 4 years ago

ftyers commented 4 years ago
$ python2 --version
Python 2.7.16+

$ python3 --version
Python 3.7.2+
$ python3 ./corpuscrawler --language tzh --output output-tzh/
Cache-Hit:      http://listen.bible.is/robots.txt
Traceback (most recent call last):
  File "./corpuscrawler", line 28, in <module>
    sys.exit(corpuscrawler.main.main())
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/main.py", line 1249, in main
    crawls[args.language](crawler)
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/crawl_tzh.py", line 21, in crawl
    crawl_bibleis(crawler, out, bible='TZHSBM')
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 776, in crawl_bibleis
    init = crawler.fetch(firsturl)
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 136, in fetch
    if not self.is_fetch_allowed_by_robots_txt(url):
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 259, in is_fetch_allowed_by_robots_txt
    checker.parse(robots_txt.decode('utf-8'))
AttributeError: 'str' object has no attribute 'decode'

$ python2 ./corpuscrawler --language tzh --output output-tzh/
Traceback (most recent call last):
  File "./corpuscrawler", line 24, in <module>
    import corpuscrawler.main
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/main.py", line 20, in <module>
    from corpuscrawler import (
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/crawl_aaz.py", line 16, in <module>
    from corpuscrawler.util import crawl_bibleis
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 18, in <module>
    from builtins import open, bytes, chr
ImportError: No module named builtins
ftyers commented 4 years ago

Fixed by installing python-future and running with python2 in Debian.

But, still the Ts'eltal downloading doesn't work:

$ ./corpuscrawler --language tzh --output output-tzh/
Cache-Hit:      http://listen.bible.is/robots.txt
Cache-Hit:      http://listen.bible.is/TZHSBM/Matt/1
$