Open stdex opened 9 years ago
Hello. I also have this problem,and I have tried to salve it for several hours. Could you tell me how to do? Thank you.
I refused to use Ghost.py because there are many problems in it, and I'm unable to fix them or help to do something. Recently, I try to use PhantomJS (headless webkit) and python wrapper for it, example of use:
#!/usr/bin/env python
import re
from urllib.parse import urljoin
from urllib.parse import urlparse
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
link = 'https://m.avito.ru/sankt-peterburg/predlozheniya_uslug/almaznoe_burenie_almaznaya_rezka_usilenie_79225740'
class AvitoScraper(object):
def __init__(self):
self.driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true'])
self.driver.set_window_size(1120, 550)
def scrape_phone(self):
self.driver.get(link)
sleep(1)
self.driver.find_element_by_class_name("action-show-number").click()
sleep(1)
s = BeautifulSoup(self.driver.page_source, "lxml")
phone = s.find('a', {"class": "action-show-number"}).attrs['href']
print(phone)
def scrape(self):
self.scrape_phone()
self.driver.quit()
if __name__ == '__main__':
scraper = AvitoScraper()
scraper.scrape()
You didn't enumerate those problems, maybe someone could help you. The only thing I can't do now is uploading files that accept multiple uploads, others, till I discover another one, I've surmounted.
I used to work with CasperJS but I wanted something very cool and Ghost.py filled the need. As for me, right now, I stand with Ghost.py.
I've also tried the Selenium with PhantomJS but it cringes on some sites. To fill my needs, I've had to retouch Ghost.py in some places, maybe that's what you could have done too. On 2 Apr 2016 20:26, "Rostunov Sergey" notifications@github.com wrote:
I refused to use Ghost.py because there are many problems in it, and I'm unable to fix them or help to do something. Recently, I try to use PhantomJS (headless webkit) and python wrapper for it, example of use:
!/usr/bin/env python
import re from urllib.parse import urljoin from urllib.parse import urlparse
from selenium import webdriver from bs4 import BeautifulSoup from time import sleep
class AvitoScraper(object): def init(self): self.driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true']) self.driver.set_window_size(1120, 550)
def scrape_phone(self): self.driver.get(link) sleep(1) self.driver.find_element_by_class_name("action-show-number").click() sleep(1) s = BeautifulSoup(self.driver.page_source, "lxml") phone = s.find('a', {"class": "action-show-number"}).attrs['href'] print(phone) def scrape(self): self.scrape_phone() self.driver.quit()
if name == 'main': scraper = AvitoScraper() scraper.scrape()
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/jeanphix/Ghost.py/issues/272#issuecomment-204785371
Thank you for your answers.
And I have found a way to solve it. Maybe the codes are not very beautiful and the way is not perfect.
Following is the site :
And the link is:
I need to click the link to turn to next page.
The main part of codes is:
‘’‘python
def parse_geguyanbao(self,response):
item = response.meta['item']
day = {}
g = ghost.Ghost()
with g.start() as session:
session.display = True
session.wait_timeout = 999
session.download_images = False
page, extra_resources = session.open(response.url)
page, extra_resources = session.wait_for_page_loaded()
response = response.replace(body=session.content)
lo = 1
loo = []
while not loo:
dates, extra_resources = session.evaluate("""
(function () {
var i = 0;
for (i = 0; i < %s; i++){
var element = document.querySelector(%s);
var evt = document.createEvent("MouseEvents");
evt.initMouseEvent("click", true, true, window, 1, 1, 1, 1, 1,false, false, false, false, %s, element);
element.dispatchEvent(evt);
}
elems = document.getElementById('dt_1').getElementsByTagName('ul');
var dates = [];
for (i = 0; i < elems.length; i++) {
dates[i] = elems[i].getElementsByTagName('li')[0].innerText;
}
return dates;
})();
""" % (str(lo), repr('#PageCont > a:nth-child(9)'), str(0)))
page, extra_resources = session.wait_for_page_loaded()
response = response.replace(body=session.content)
session.show()
session.sleep(1)
if dates:
sStr1 = str(dates[0])
else:
sStr1 = "null"
geguyanbao = 0
for a in day:
if cmp(a,sStr1) == 0:
geguyanbao = day[str(sStr1)]
break
for data in dates:
if cmp(data,sStr1) == 0:
geguyanbao = geguyanbao + 1
else:
day[str(sStr1)] = geguyanbao
k = 0
for key in day:
k = k + 1
if k >= 30:
item['geguyanbao'] = day
return item
break
geguyanbao = 1
sStr1 = data
day[str(sStr1)] = geguyanbao
lo = lo + 1
if Selector(response).xpath('//*[@id="PageCont"]/a[@class="nolink"]').extract():
loo = False
else:
loo = True
item['geguyanbao'] = day
return item
'''
I do not understand the Russian letters but I cooked up something for you to at least stop the crashing. Please focus on the last 2 lines and a newer line with Session added. This is what has worked for me to stop crashes in sites that crash all the time.
Please not that I still got errors but this time,
ghost.ghost.TimeoutError: Can't find element matching ".button-green"
from bs4 import BeautifulSoup
from ghost import Ghost, Session
work_url = "https://m.avito.ru/sankt-peterburg/predlozheniya_uslug/almaznoe_burenie_almaznaya_rezka_usilenie_79225740"
timeout = 100
ghost = Ghost()
with ghost.start():
session = Session(ghost, display=True, wait_timeout=timeout)
page, extra_resources = session.open(work_url, timeout=timeout)
assert page.http_status == 200
session.click(".action-show-number", 0)
session.wait_for_selector(".button-green")
soup = BeautifulSoup(page.content, "lxml")
phone = soup.find('a', {"class": "action-show-number"}).attrs['href']
print(phone)
session.webview.setHtml('')
session.exit()
Hello. I'm try to emulate simple ajax request.
Scheme:
When I clicked on button, It must do GET request to server and return json, then render response information on page. I'm use wait_for_selector method for wait, when I should get response from ajax request, but it's not updated in DOM. Can someone help me with it?
My code:
Also, I have problem with this code. Sometimes it return:
Segmentation fault (core dumped)
Where I can find logs about it fault?