niklasb / dryscrape

[not actively maintained] A lightweight Python library that uses Webkit to enable easy scraping of dynamic, Javascript-heavy web pages
http://dryscrape.readthedocs.io/
MIT License
533 stars 67 forks source link

Requesting a page, then visiting another causes issues #53

Open danrossi opened 8 years ago

danrossi commented 8 years ago

Sorry this is a question. There seems to be a problem requesting a page to scrape a special link, then choosing to visit that link. The page does not render or parse correctly. It seems I have to create a second session but xpath is not parsing it correctly.

ie

sess = dryscrape.Session(base_url = 'host')

# we don't need images
sess.set_attribute('auto_load_images', False)

# visit homepage and search for a term
sess.visit('/path')

links = sess.xpath('//a[contains .. ]')
link = links[0]["href"]

time.sleep(10)

sess = dryscrape.Session(base_url = 'host')

sess.visit(link)

 sess.xpath("//div[@class='searchitem']")

This is a problem I have to parse the whole body first. like

tree = fromstring(sess.body())

Unfortunately clicking on the link to visit does not work it has to choose to visit it with the visit method.

Is there a special way to reuse the session so xpath works ?

danrossi commented 8 years ago

I can't explain it but for some reason on ubuntu this same code that works on OSX doesn't work on Ubuntu. The new visited link is not registered properly on the site and therefore fails and the html parsing breaks.

It can extract the link from the first page but the second page has issues.

Any ideas ?