mawa00006 / Doping-Detection-Based-on-Publicly-Available-Competition-Data-in-Professional-Road-Cycling

0 stars 0 forks source link

[bug] Scraper stopped with exceptions #13

Closed tony-hong closed 2 years ago

tony-hong commented 2 years ago

I ran python main.py but the process stopped with following exceptions. Please have a look.

Console output:

...

  self._context.run(self._callback, *self._args)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
/miniconda/envs/DSAI_data_scraping/lib/python3.10/asyncio/events.py:80: RuntimeWarning: coroutine 'launch' was never awaited
  self._context.run(self._callback, *self._args)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
daniele-colli in dict
blazej-janiaczyk
Traceback (most recent call last):
  File "/miniconda/envs/DSAI_data_scraping/lib/python3.10/site-packages/pyquery/pyquery.py", line 57, in fromstring
    result = getattr(etree, meth)(context)
  File "src/lxml/etree.pyx", line 3252, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1913, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1800, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/xhong/DSAI/DSAI_Project/Cycling/Data Scraping/main.py", line 206, in <module>
    main(years)
  File "/root/xhong/DSAI/DSAI_Project/Cycling/Data Scraping/main.py", line 108, in main
    df_oneday, stats_per_season_df = details_sps(df_oneday, stats_per_season_df)
  File "/root/xhong/DSAI/DSAI_Project/Cycling/Data Scraping/main.py", line 153, in details_sps
    details = scrape_rider_details('https://www.procyclingstats.com/rider/{}'.format(rider))
  File "/root/xhong/DSAI/DSAI_Project/Cycling/Data Scraping/scraping.py", line 540, in scrape_rider_details
    response.html.render()
  File "/root/xhong/DSAI/DSAI_Project/Cycling/Data Scraping/requests_html.py", line 655, in html
    self._html = HTML(session=self.session, url=self.url, html=self.content, default_encoding=self.encoding)
  File "/root/xhong/DSAI/DSAI_Project/Cycling/Data Scraping/requests_html.py", line 421, in __init__
    element=PyQuery(html)('html') or PyQuery(f'<html>{html}</html>')('html'),
  File "/miniconda/envs/DSAI_data_scraping/lib/python3.10/site-packages/pyquery/pyquery.py", line 217, in __init__
    elements = fromstring(context, self.parser)
  File "/miniconda/envs/DSAI_data_scraping/lib/python3.10/site-packages/pyquery/pyquery.py", line 61, in fromstring
    result = getattr(lxml.html, meth)(context)
  File "/miniconda/envs/DSAI_data_scraping/lib/python3.10/site-packages/lxml/html/__init__.py", line 875, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/miniconda/envs/DSAI_data_scraping/lib/python3.10/site-packages/lxml/html/__init__.py", line 763, in document_fromstring
    raise etree.ParserError(
lxml.etree.ParserError: Document is empty
mawa00006 commented 2 years ago

@tony-hong Where you able to extract output files for the races before the error? If this is the case please upload them to the drive and i will fix everything tomorrow :)

tony-hong commented 2 years ago

@tony-hong Where you able to extract output files for the races before the error? If this is the case please upload them to the drive and i will fix everything tomorrow :)

Done. They're here.

mawa00006 commented 2 years ago

I fixed the bug and pushed the updated code to GitHub along with some files I scraped during testing. When running the script on the cluster again it is important to keep the 'dict.csv' file in the riderdetaildict order. The rest o the files does not have to be included, just copy the folder structure. If there are any questions just message me

mawa00006 commented 2 years ago
tony-hong commented 2 years ago

Here is the newest output before crashed.

mawa00006 commented 2 years ago

@tony-hong I added more try and except statements and also a loop to keep the main function running even if some error slips through.