larsaars / KIP_EinfachErklaert

KI Projekt - Gruppe Einfach Erklärt
Apache License 2.0
0 stars 0 forks source link

invalid chars from mdr scraper causing problem with pandas data frames #4

Closed FelixDigitalis closed 5 months ago

FelixDigitalis commented 5 months ago
2024-06-01 12:30:56,376 - INFO - Saving https://mdr.de/nachrichten/deutschland/panorama/regen-starkregen-wetter-hochwasserwarnung-sachsen-anhalt-thueringen-100.html
Traceback (most recent call last):
  File "c:\Users\felix\OneDrive\Vorlesungen\6_Semester\KIP\KIP_EinfachErklaert\scrapers\mdr\current_news_scraper.py", line 93, in <module>
    MDRCurrentScraper().scrape()
  File "c:\Users\felix\OneDrive\Vorlesungen\6_Semester\KIP\KIP_EinfachErklaert\scrapers\mdr\current_news_scraper.py", line 87, in scrape
    self.matcher.match_by_hand(easy_article_url, hard_article_url)
  File "C:\Users/felix/OneDrive/Vorlesungen/6_Semester/KIP/KIP_EinfachErklaert\matchers\SimpleMatcher.py", line 29, in match_by_hand
    hard = self.data_handler.search_by("h", "url", hard)
  File "C:\Users/felix/OneDrive/Vorlesungen/6_Semester/KIP/KIP_EinfachErklaert\datahandler\DataHandler.py", line 122, in search_by
    return self.helper._search_url_in_lookup(dir, attribute_value)
  File "C:\Users/felix/OneDrive/Vorlesungen/6_Semester/KIP/KIP_EinfachErklaert\datahandler\DataHandler.py", line 296, in _search_url_in_lookup
    df = pd.read_csv(table)
  File "C:\Users\felix\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 948, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\felix\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 611, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Users\felix\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 1448, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "C:\Users\felix\anaconda3\lib\site-packages\pandas\io\parsers\readers.py", line 1723, in _make_engine
    return mapping[engine](f, **self.options)
  File "C:\Users\felix\anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 93, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "parsers.pyx", line 579, in pandas._libs.parsers.TextReader.__cinit__
  File "parsers.pyx", line 668, in pandas._libs.parsers.TextReader._get_header
  File "parsers.pyx", line 879, in pandas._libs.parsers.TextReader._tokenize_rows
  File "parsers.pyx", line 890, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "parsers.pyx", line 2050, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 1190: invalid start byte

probably this could be a windows error but not shure

FelixDigitalis commented 5 months ago

this is the lookup table file that causes the error. the invalid char is the � in the last line path

path,url
C:/Users/felix/OneDrive/Vorlesungen/6_Semester/KIP/KIP_EinfachErklaert\data\mdr\hard\2024-05-31-Brand_in_Suhler_Zentralklinikum_Sieben_Verletzte_-_150000_Euro_Schaden, https://mdr.de/nachrichten/thueringen/sued-thueringen/suhl/brand-zentralklinikum-evakuierung-verletzte-100.html
C:/Users/felix/OneDrive/Vorlesungen/6_Semester/KIP/KIP_EinfachErklaert\data\mdr\hard\2024-05-31-Melt-Festival_in_Ferropolis_kuendigt_Aus_an, https://mdr.de/nachrichten/sachsen-anhalt/dessau/dessau-rosslau/ferropolis-melt-festival-hoert-auf-kultur-news-102.html
C:/Users/felix/OneDrive/Vorlesungen/6_Semester/KIP/KIP_EinfachErklaert\data\mdr\hard\2024-05-30-SC_Magdeburg_zurrt_Meisterschaft_bei_Gensheimer-Abschied_felsenfest, https://mdr.de/sport/handball/bericht-rhein-neckar-loewen-sc-magdeburg-100.html
C:/Users/felix/OneDrive/Vorlesungen/6_Semester/KIP/KIP_EinfachErklaert\data\mdr\hard\2024-05-23-Wohin_am_Wochenende_Tipps_fuer_Bitterfeld-Wolfen_Jena_und_Meissen, https://mdr.de/kultur/ausflug-tipps/bitterfeld-wolfen-jena-meissen-wohin-am-wochenende-tipps-100.html
C:/Users/felix/OneDrive/Vorlesungen/6_Semester/KIP/KIP_EinfachErklaert\data\mdr\hard\2024-06-01-Heftiger_Regen_am_Wochenende_�_Wetterdienst_praezisiert_Vorhersage, https://mdr.de/nachrichten/deutschland/panorama/regen-starkregen-wetter-hochwasserwarnung-sachsen-anhalt-thueringen-100.html
FelixDigitalis commented 5 months ago

workaround is forbidding \u2013 in paths