joenano / rpscrape

Scrape horse racing results data and racecards.
138 stars 57 forks source link

etree.ParserError( lxml.etree.ParserError: Document is empty) #92

Closed patem2 closed 2 years ago

patem2 commented 2 years ago

Hi,

When trying to run the racecards.py file it presents the following error:

python racecards.py today Traceback (most recent call last): File "racecards.py", line 430, in main() File "racecards.py", line 416, in main race_urls = get_race_urls(session, racecard_url) File "racecards.py", line 93, in get_race_urls doc = html.fromstring(r.content) File "C:\Users\markp\anaconda3\lib\site-packages\lxml\html__init__.py", line 875, in fromstring doc = document_fromstring(html, parser=parser, base_url=base_url, **kw) File "C:\Users\markp\anaconda3\lib\site-packages\lxml\html__init__.py", line 763, in document_fromstring raise etree.ParserError( lxml.etree.ParserError: Document is empty

Best wishes Mark

RobbieJamesBennett commented 2 years ago

Same here:

python racecards.py today
no race_type found for course scoop 6 Traceback (most recent call last): File "racecards.py", line 422, in main() File "racecards.py", line 412, in main races = parse_races(session, race_docs, date, types) File "racecards.py", line 327, in parse_races runners = get_runners(profile_urls, race['race_id']) File "racecards.py", line 141, in get_runners docs = asyncio.run(get_documents(profile_urls)) File "/Users/robbiebennett/.pyenv/versions/3.8.5/lib/python3.8/asyncio/runners.py", line 43, in run return loop.run_until_complete(main) File "/Users/robbiebennett/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete return future.result() File "racecards.py", line 42, in get_documents ret = await asyncio.gather(*[get_document(url, session) for url in urls]) File "racecards.py", line 50, in get_document return (url, html.fromstring(resp)) File "/Users/robbiebennett/.pyenv/versions/3.8.5/lib/python3.8/site-packages/lxml/html/init.py", line 875, in fromstring doc = document_fromstring(html, parser=parser, base_url=base_url, **kw) File "/Users/robbiebennett/.pyenv/versions/3.8.5/lib/python3.8/site-packages/lxml/html/init.py", line 763, in document_fromstring raise etree.ParserError( lxml.etree.ParserError: Document is empty

RobbieJamesBennett commented 2 years ago

After running the scraper and returning the above error, looks like the race card url returns a blank white screen. I think the site just blocks it straight away.

https://www.racingpost.com/racecards

RobbieJamesBennett commented 2 years ago

As a temporary fix, after replacing the references to asyncio.run(get_documents(race_urls)) with the following in racecards.py:

race_docs = []
for race_url in race_urls:
    resp = session.get(race_url)
    race_docs.append((race_url, html.fromstring(resp.text)))

I also had to bypass the reference to each runners form (thus reducing the number of calls being made to the server) which leaves out some fields from the race card I don't need and it now works, must be a throttling issue with racing post. Hope this helps someone else.

joenano commented 2 years ago

The requests are now severely limited, i havent tested to find the exact number of allowed requests but once you hit the limit, its 403 responses for hours, maybe even up to 12 hours.

Initially the requests were limited per minute, which was fine to just wait, the limit was then completely removed and the project changed to async requests to take advantage of that, seeing a massive speed increase.

Against my better judgement I made public the racecard scraping script and that appears to be have been heavily abused for a long time and now the whole project is crippled by the new limits.

RobbieJamesBennett commented 2 years ago

Looks like AtTheRaces is now the last Bastian of freely obtainable horse racing data. For what it's worth I doubt the use of racecards.py has really influenced things that much at rp. They display live odds on that url and it's natural that many people will try and scrape it. If CloudFlare security is behind this you might struggle to spoof them for any significant period of time either. You can't scrape OddsChecker anymore for the same reason and that started about 2 months ago as well. I have been quoted £12k a year for Timeform's database and race card API. Its the only addition to Betfair's own API for race card info and legitimate route I can see available for UK/IRE racing. seems pretty extortionate swell for what it is.

joenano commented 2 years ago

Yeah I had a library up for scraping Oddschecker but as you say they started using cloudfare recently so I just took it down, its never gained much traction any way but I was planning to use it eventually for my main project. I started working on a new odds scraping library but its private right now, havent quite finished it and focus has been elsewhere recently. That is ridiculous pricing for the data and API and the main reason why I started this project.

patem2 commented 2 years ago

Guys if you’re using this within acceptable limits e.g once a day scraping the race card and prior days data (like I do), I’ve found just recycling your VPN eliminates this error (if you have one that is).

I use Norton antivirus which comes with one bundled in, I previously had Rapid VPN but canned it when I clocked Norton comes bundled with with one anyhow.

gbettle commented 2 years ago

betfairlightweight and flumine are 2 repos worth investigating:

betfairlightweight - python wrapper for Betfair API-NG (with streaming) flūmine - Betfair trading framework

In regards to staying with RP & VPN issues, I'd consider trying pythonanywhere, hosted in Frankfurt. They offer a vm for 5 Euro per month, that includes scheduled tasks for i.e. web scraping.

Currently, I'm also trying out ScraperBox.com, which claim to offer an, "Undetectable Web Scraping API":

Rotating proxies that never get blocked Render Javascript with real chrome browsers You won't get stopped by robot-check captchas

joenano commented 2 years ago

Closing this as any discussion on this topic should be in the Access Denied issue.