pythonhacker / harvestman-crawler

Automatically exported from code.google.com/p/harvestman-crawler
1 stars 3 forks source link

Error crawling sites containing characters with encoding standards different than Latin-1 #20

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. svn up
2. run in shell: "harvestman -C config.xml"
3.

What is the expected output? What do you see instead?

Expected:

crawl the website

Actual:

Traceback (most recent call last):
  File "/usr/lib/python2.5/logging/__init__.py", line 750, in emit
    self.stream.write(fs % msg)
ValueError: I/O operation on closed file
[21:03:10] Done.
Traceback (most recent call last):
  File "/usr/lib/python2.5/logging/__init__.py", line 750, in emit
    self.stream.write(fs % msg)
ValueError: I/O operation on closed file
Exception in thread fetcher0 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.5/threading.py", line 486, in __bootstrap_inner
  File
"/home/stefan/work/HarvestMan-2.0/build/lib/harvestman/lib/crawler.py",
line 251, in run
  File
"/home/stefan/work/HarvestMan-2.0/build/lib/harvestman/lib/crawler.py",
line 572, in action
  File
"/home/stefan/work/HarvestMan-2.0/build/lib/harvestman/lib/crawler.py",
line 272, in sleep
  File
"/home/stefan/work/HarvestMan-2.0/build/lib/harvestman/lib/common/common.py",
line 97, in sleep
  File "/usr/lib/python2.5/threading.py", line 353, in set
  File "/usr/lib/python2.5/threading.py", line 268, in notifyAll
<type 'exceptions.TypeError'>: 'NoneType' object is not callable

What version of the product are you using? On what operating system?

version 83, Ubuntu 8.04, X86_64

Please provide any additional information below.

Used japanese website. (see attached config file)

Original issue reported on code.google.com by andrei.p...@gmail.com on 22 Jul 2008 at 6:06

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by abpil...@gmail.com on 6 Oct 2008 at 11:25

GoogleCodeExporter commented 9 years ago

Original comment by abpil...@gmail.com on 6 Oct 2008 at 11:25

GoogleCodeExporter commented 9 years ago
It is 3.30 am here and I have not yet slept... keeping this for tomorrow !

Original comment by abpil...@gmail.com on 11 Oct 2008 at 10:10

GoogleCodeExporter commented 9 years ago
Wow. Get some sleep :) That is late.
I tried this config file but I got this error:
  File
"/home/lucas/projects/harvestman-crawler/trunk/HarvestMan/harvestman/apps/spider
.py",
line 420, in init_config
    self.get_options()
  File
"/home/lucas/projects/harvestman-crawler/trunk/HarvestMan/harvestman/apps/appbas
e.py", line
81, in get_options
    objects.config.get_program_options()
  File
"/home/lucas/projects/harvestman-crawler/trunk/HarvestMan/harvestman/lib/config.
py",
line 1477, in get_program_options
    res = self.parse_arguments()
  File
"/home/lucas/projects/harvestman-crawler/trunk/HarvestMan/harvestman/lib/config.
py",
line 1034, in parse_arguments
    if SUCCESS(self.check_value(option,value)): self.set_option_xml('cache_status',
self.process_value(value))
  File
"/home/lucas/projects/harvestman-crawler/trunk/HarvestMan/harvestman/lib/config.
py",
line 721, in set_option_xml
    self.assign_option(option_val, value)
  File
"/home/lucas/projects/harvestman-crawler/trunk/HarvestMan/harvestman/lib/config.
py",
line 590, in assign_option
    fval = (eval(typ))(value)
ValueError: invalid literal for int() with base 10: 'tmp/config-bug20.xml'

Somehow the name of the config file is passed in as an option variable?
Also checkin 148 has one unit test failing. Not sure if these are connected. 
Thanks,
Lucas

Original comment by szybal...@gmail.com on 12 Oct 2008 at 5:05

GoogleCodeExporter commented 9 years ago
I am not seeing any error like this when trying with this config.xml . Also 
there is
no unit test failing for me. Can you let me know the full command-line by which 
you
ran the program ?

Original comment by abpil...@gmail.com on 12 Oct 2008 at 6:44

GoogleCodeExporter commented 9 years ago
The previous comment was reply for Lukasz's comment, not for the original bug.
Lukasz, please reply.

For the original bug, I could not reproduce it in my Ubuntu 8.04, i686, Python 
2.5.2.
After the fix for issue #21, it looks like the encoding issues are fixed.

I could not test it in x86_64 since I dont have a 64 bit Linux to test on. 
Andrei,
could you check it again on your system with latest code from the trunk ?

Marking this as "Worksforme".

Original comment by abpil...@gmail.com on 12 Oct 2008 at 7:22

GoogleCodeExporter commented 9 years ago

Original comment by abpil...@gmail.com on 12 Oct 2008 at 7:24

GoogleCodeExporter commented 9 years ago

Original comment by abpil...@gmail.com on 12 Oct 2008 at 7:25

GoogleCodeExporter commented 9 years ago

Original comment by abpil...@gmail.com on 12 Oct 2008 at 7:25

GoogleCodeExporter commented 9 years ago
My feeling is that this is a "random" bug. It happens in HarvestMan since it 
uses
many threads and they sometimes can produce "chaotic" bugs, which are difficult 
to
reproduce. Let me know if this is a repeating bug for you, I will test it 
further.

Original comment by abpil...@gmail.com on 12 Oct 2008 at 7:26

GoogleCodeExporter commented 9 years ago
Did an svn update and python setup.py install but got the following errors when
tested on the harvestman --selftest:

 harvestman --selftest
Traceback (most recent call last):
  File "/usr/bin/harvestman", line 8, in <module>
    load_entry_point('HarvestMan==2.0.3dev-r156', 'console_scripts', 'harvestman')()
  File "/usr/lib/python2.5/site-packages/pkg_resources.py", line 277, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/lib/python2.5/site-packages/pkg_resources.py", line 2179, in
load_entry_point
    return ep.load()
  File "/usr/lib/python2.5/site-packages/pkg_resources.py", line 1912, in load
    entry = __import__(self.module_name, globals(),globals(), ['__name__'])
  File
"/usr/lib/python2.5/site-packages/HarvestMan-2.0.3dev_r156-py2.5.egg/harvestman/
apps/spider.py",
line 92, in <module>
    from harvestman.lib.event import HarvestManEvent
  File
"/usr/lib/python2.5/site-packages/HarvestMan-2.0.3dev_r156-py2.5.egg/harvestman/
apps/harvestman.py",
line 90, in <module>
    from event import HarvestManEvent
ImportError: No module named event

Original comment by andrei.p...@gmail.com on 12 Oct 2008 at 4:02

GoogleCodeExporter commented 9 years ago
havestaman.py is no longer in the repository. It was replaced by spider.py.

Please check your installation.

You could for example
cd havestman where you have a folder like lib,apps etc. and do
rm -r ./*
cd ../
svn update --force

be careful with the rm -r...

Try again then.
Lucas

Original comment by szybal...@gmail.com on 12 Oct 2008 at 4:09

GoogleCodeExporter commented 9 years ago
removed my repo, did the checkout + install. Went
into:/harvestman-crawler/HarvestMan/harvestman/apps and typed: python spider.py 
-C
config-sample.xml

Output is:

Loading system configuration...
Loading user configuration...
Error assigning option "proxyport_value" => Error: invalid literal for int() 
with
base 10: ''
Pass option -h for command line usage.
Printing error traceback for debugging...
  File
"/usr/lib/python2.5/site-packages/HarvestMan-2.0.3dev_r156-py2.5.egg/harvestman/
lib/config.py",
line 697, in set_option_xml_attr
    self.assign_option(option_val, value, attrs)
  File
"/usr/lib/python2.5/site-packages/HarvestMan-2.0.3dev_r156-py2.5.egg/harvestman/
lib/config.py",
line 623, in assign_option
    raise HarvestManConfigError, "Error: " + str(e)
Error: invalid literal for int() with base 10: ''

Original comment by andrei.p...@gmail.com on 12 Oct 2008 at 4:19

GoogleCodeExporter commented 9 years ago
cd /harvestman-crawler/HarvestMan/
python setup.py install
harvestman -c ./havestman/apps/config-sample.xml

I am getting it too. Will let you know as soon as we fix it.

Original comment by szybal...@gmail.com on 13 Oct 2008 at 3:38

GoogleCodeExporter commented 9 years ago
Strange, I am not getting this error. Maybe I am missing something ?
Lukasz, comment out the exception tracking code in assign_option (let it
raise the exception and die) and print out the variables (option_val, value, 
attrs).
This will tell you which one is causing the problem.

Original comment by abpil...@gmail.com on 13 Oct 2008 at 5:06

GoogleCodeExporter commented 9 years ago
Guys, this is the problem. This is not coming from the config-sample.xml but 
from
loading your user configuration from ~/.harvestman/config/config.xml. This is 
the way
to fix it.

$ rm -rf ~/.harvestman

Then run harvestman again. I think basically you are having an old config.xml 
file
copied there long time back which is conflicting with the current code.

Btw, there seems to be a problem in creating the crawl database in 
~/.harvestman at
least on darwin (mac os x). So I fixed it in db.py in trunk. Sync the trunk, do 
this
and let me know.

Thanks!

Original comment by abpil...@gmail.com on 13 Oct 2008 at 5:19

GoogleCodeExporter commented 9 years ago
It works now: both "python spider.py --selftest" and "python spider.py -C
config-sample.xml".

Original comment by andrei.p...@gmail.com on 13 Oct 2008 at 7:57

GoogleCodeExporter commented 9 years ago
Thanks for the quick verification andrei. Lukasz, I guess you don't need to
investigate this any more.

Original comment by abpil...@gmail.com on 13 Oct 2008 at 8:00

GoogleCodeExporter commented 9 years ago

Original comment by abpil...@gmail.com on 11 Feb 2010 at 7:13