niklasb / dryscrape

[not actively maintained] A lightweight Python library that uses Webkit to enable easy scraping of dynamic, Javascript-heavy web pages
http://dryscrape.readthedocs.io/
MIT License
533 stars 67 forks source link

Method Not Allowed (error code 202) #8

Closed voxsim closed 12 years ago

voxsim commented 12 years ago

this is my code:

import dryscrape

# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://www.udacity.com/')

# there are some failing HTTP requests, so we need to enter
# a more error-resistant mode (like real browsers do)
sess.set_error_tolerant(True)

# we don't need images
sess.set_attribute('auto_load_images', False)

# visit homepage and log in
print "Logging in..."
sess.visit('/')

email_field = sess.at_xpath('//input[@name="email"]')
print email_field
password_field = sess.at_xpath('//input[@name="password"]')
print password_field

email_field.set(USERNAME)
password_field.set(PASSWORD)
email_field.form().submit()

and that is the output

Logging in...
<Node #/html/body/div[@id='not-footer']/div[@id='top_bin']/div[@id='top_content']/div/div[@id='user-topbar-button-overlay']/form[@id='signin-form']/div[1]/input[1]>
<Node #/html/body/div[@id='not-footer']/div[@id='top_bin']/div[@id='top_content']/div/div[@id='user-topbar-button-overlay']/form[@id='signin-form']/div[1]/input[2]>
<Node #/html/body/div[@id='not-footer']/div[@id='top_bin']/div[@id='top_content']/div/div[@id='user-topbar-button-overlay']/form[@id='signin-form']>
Traceback (most recent call last):
  File "prova.py", line 30, in <module>
    email_field.form().submit()
  File "/home/simon/projects/udacity_downloader/dryscrape/driver/webkit_server/__init__.py", line 97, in submit
    self.client.wait()
  File "/home/simon/projects/udacity_downloader/dryscrape/driver/webkit_server/__init__.py", line 224, in wait
    self.conn.issue_command("Wait")
  File "/home/simon/projects/udacity_downloader/dryscrape/driver/webkit_server/__init__.py", line 429, in issue_command
    return self._read_response()
  File "/home/simon/projects/udacity_downloader/dryscrape/driver/webkit_server/__init__.py", line 438, in _read_response
    raise InvalidResponseError, self._read_message()
dryscrape.driver.webkit_server.InvalidResponseError: Error while loading URL http://www.udacity.com/: Error downloading http://www.udacity.com/ - server replied: Method Not Allowed (error code 202)

any suggestion to resolve this problem?

voxsim commented 12 years ago

I found the problem: the submit of the form isn't the right way.. they hide how to submit the data.. I try to simulate the click of button "GO" but this don't do anything

niklasb commented 12 years ago

You're not the first to request more error tolerance to make these server-side errors non-fatal. I will try and find a solution as soon as I find the time.

voxsim commented 12 years ago

It's not your fault! It 's udacity fault and jQuery.ajax() .. I solved with javascript injection in the page (https://github.com/voxsim/udacity_downloader/blob/master/udacity.py)

niklasb commented 12 years ago

@voxsim: It's nice that you could fix it on your side in this case, but dryscrape is specifically designed to be able to scrape real-world web pages, and those have bugs (which you usually can't fix on the server side).

voxsim commented 12 years ago

@niklasb: Maybe you're right. Now i really don't understand how to debug dryscrape and webkit_server, i sniffed the packet traffic with wireshark. I intend to use dryscrape in various my projects, maybe I can help to fix something.

niklasb commented 12 years ago

@voxsim: I usually just use cout/cerr for C++ debugging, especially because in the case of Qt, a lot of multi-threading is going on. What we need in particular is a way to make failures on intermediate requests (CSS resources, Javascripts etc.) non-fatal, but still finish loading the page. Without looking into it myself, I can't tell you what the actual caveats might be here. I remember that the SetErrorTolerant command was doing something similar, but it was quite a hack (and doesn't seem to work as expected in many cases).

voxsim commented 12 years ago

@niklasb: ok i understand, now i have fork of dryscrape and webkit-server, if i found one way to fix some problems i can pull request and try to merge my patch, ok? (i'm new of github, but i'm old of git XD) I saw many developers working on capybara webkit-server, I'll see if they have already solved some problems.

niklasb commented 12 years ago

@voxsim: Yes, that's the way it works best :) It's best to create a topic branch for work like this. Pull requests are basically just notifications of changes on a branch to the original author. I actually have to check the current status of the "real" webkit_server myself. Have contributed quite a bit to it already, but they are constantly adding new features.

Thanks for your interest and participation by the way :) It's highly appreciated!

voxsim commented 12 years ago

@niklasb: Good :D I sent you an email just to talk about webkit-server and dryscrape when i'll have news and don't continue to talk about it here XD