google / corpuscrawler

Crawler for linguistic corpora
Other
194 stars 55 forks source link

Error when crawling Kaqchikel #42

Closed ftyers closed 5 years ago

ftyers commented 5 years ago

Not sure what is going on here:

fran@ipek:~/source/corpuscrawler$ ./corpuscrawler --output ~/corpora/languages/kaqchikel/corpcrawl/ --language cak
Downloading:    http://listen.bible.is/robots.txt
Downloading:    http://listen.bible.is/CAKSBG/Matt/1
Traceback (most recent call last):
  File "./corpuscrawler", line 28, in <module>
    sys.exit(corpuscrawler.main.main())
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/main.py", line 1249, in main
    crawls[args.language](crawler)
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/crawl_cak.py", line 21, in crawl
    crawl_bibleis(crawler, out, bible='CAKSBG')
  File "/home/fran/source/corpuscrawler/Lib/corpuscrawler/util.py", line 718, in crawl_bibleis
    jsonraw = json.loads(content.split('var chaptersByBook = ')[1].split(';\n')[0])
IndexError: list index out of range
brawer commented 5 years ago

Looks like the website listen.bible.is has changed their HTML, so the code for crawl_bibleis() in Lib/corpuscrawler/util.py would need to be adjusted. Do you want to make the change and send a pull request?

ftyers commented 5 years ago

Ooh, that would be cool, thanks! :)

cash commented 5 years ago

This can be closed because of #45 being merged in.