google / corpuscrawler

Crawler for linguistic corpora
Other
190 stars 56 forks source link

404 error with Myanmar Zawgyi #50

Closed blackblitz closed 4 years ago

blackblitz commented 4 years ago

I ran ./corpuscrawler --language=my-t-d0-zawgyi --output=./corpus (with python 2.7 on Ubuntu 18.04) and the program crashed while downloading from some url. The output is shown below.

Downloading:    http://thanlwintimes.com/robots.txt
Downloading:    http://thanlwintimes.com/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8/page/3/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/3/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/4/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/5/
Downloading:    http://thanlwintimes.com/category/%e1%80%b1%e1%80%86%e1%80%ac%e1%80%84%e1%80%b9%e1%80%b8%e1%80%95%e1%80%ab%e1%80%b8/page/6/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/page/3/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%84%e1%80%b9%e1%80%90%e1%80%ac%e1%80%97%e1%80%ba%e1%80%b4%e1%80%b8/page/4/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%9a%e1%80%b9%e1%80%92%e1%80%ae%e1%80%90%e1%80%ac%e1%80%b7-%e1%80%a1%e1%80%ac%e1%80%b1%e1%80%98%e1%80%ac%e1%80%b9/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%a1%e1%80%9a%e1%80%b9%e1%80%92%e1%80%ae%e1%80%90%e1%80%ac%e1%80%b7-%e1%80%a1%e1%80%ac%e1%80%b1%e1%80%98%e1%80%ac%e1%80%b9/
Downloading:    http://thanlwintimes.com/category/%e1%80%a1%e1%80%9a%e1%80%b9%e1%80%92%e1%80%ae%e1%80%90%e1%80%ac%e1%80%b7-%e1%80%a1%e1%80%ac%e1%80%b1%e1%80%98%e1%80%ac%e1%80%b9/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/2/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/3/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/4/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/5/
Downloading:    http://thanlwintimes.com/category/%e1%80%80%e1%80%ac%e1%80%90%e1%80%bc%e1%80%94%e1%80%b9%e1%80%b8/page/6/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8%e1%80%93%e1%80%ab%e1%80%90%e1%80%b9%e1%80%95%e1%80%af%e1%80%b6/
Cache-Hit:      http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8%e1%80%93%e1%80%ab%e1%80%90%e1%80%b9%e1%80%95%e1%80%af%e1%80%b6/
Downloading:    http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8%e1%80%93%e1%80%ab%e1%80%90%e1%80%b9%e1%80%95%e1%80%af%e1%80%b6/page/2/
Traceback (most recent call last):
  File "./corpuscrawler", line 28, in <module>
    sys.exit(corpuscrawler.main.main())
  File "/home/dell/nlp/corpuscrawler/Lib/corpuscrawler/main.py", line 1249, in main
    crawls[args.language](crawler)
  File "/home/dell/nlp/corpuscrawler/Lib/corpuscrawler/crawl_my_t_d0_zawgyi.py", line 24, in crawl
    _crawl_than_lwin_times(crawler, out)
  File "/home/dell/nlp/corpuscrawler/Lib/corpuscrawler/crawl_my_t_d0_zawgyi.py", line 28, in _crawl_than_lwin_times
    urls = find_wordpress_urls(crawler, 'http://thanlwintimes.com/')
  File "/home/dell/nlp/corpuscrawler/Lib/corpuscrawler/util.py", line 767, in find_wordpress_urls
    assert pgdoc.status == 200, (pgdoc.status, pgurl)
AssertionError: (404, u'http://thanlwintimes.com/category/%e1%80%9e%e1%80%90%e1%80%84%e1%80%b9%e1%80%b8%e1%80%93%e1%80%ab%e1%80%90%e1%80%b9%e1%80%95%e1%80%af%e1%80%b6/page/2/')
sffc commented 4 years ago

Seems like the bug is here:

https://github.com/google/corpuscrawler/blob/master/Lib/corpuscrawler/util.py#L756

That function should be more lenient when there is a 404. In this case, it found a link to /page/2 in this category, but that page is actually a 404, and that should not be a fatal error.

I would probably change

assert pgdoc.status == 200, (pgdoc.status, pgurl)

to something more like

if pgdoc.status != 200:
  print("Error %3d:      %s" % (pgdoc.status, pgurl))
  next

Can you open a PR?

sffc commented 4 years ago

Closing as fixed; let me know if you have any more issues.