hynky1999 / CmonCrawl

Common crawl extractor
https://hynky1999.github.io/CmonCrawl/
MIT License
69 stars 11 forks source link

RegExp not works #107

Open ZeusFSX opened 6 months ago

ZeusFSX commented 6 months ago

Hi I used your services and done all steps but when I run extract with regexp url which I wrote in config file not match urls. In logs I got the error, but when I manually match it in python everything ok:

My config file:

{
    "extractors_path": "./extractors",
    "routes": [
        {
            "regexes": ["^https://\\w*\\.{0,1}rozetka\\.com\\.ua/[^/]+/p\\d+/$", "^https://\\w*\\.{0,1}rozetka\\.com\\.ua/ua/[^/]+/p\\d+/$"],
            "extractors": [{
                "name": "rozetka_extractor",
                "since": "2023-01-01"
            }]
        }
    ]
}

Here the logs

2024-04-27 14:13:23,071 - synchronized.py:64 : ERROR - Failed to process https://rozetka.com.ua/88779405/p88779405/ with No route found for url: https://rozetka.com.ua/88779405/p88779405/ -> ADD_INFO: filename='crawl-data/CC-MAIN-2022-33/segme nts/1659882572043.2/warc/CC-MAIN-20220814143522-20220814173522-00500.warc.gz' url='https://rozetka.com.ua/88779405/p88779405/' offset=448027058 length=51078 digest='6VJW4LQ4VNDCUXRSKSYATPGJDRNHBJG' encoding='UTF-8' timestamp=datetime.datetime
(2022, 8, 14, 15, 29, 3)

but when i manually test it in python everything match:

>>> re.match("^https://\w*\.{0,1}rozetka\.com\.ua/[^/]+/p\d+/$", "https://rozetka.com.ua/88779405/p88779405/")
<re.Match object; span=(0, 42), match='https://rozetka.com.ua/88779405/p88779405/'>
ZeusFSX commented 6 months ago

Ohh, I saw my mistake It's not match by date. Maybe You can update logs for it, because it's not informative?

hynky1999 commented 6 months ago

Great that you managed to resolve your issue. I will take a look at logging, and see what is possible to do to prevent this problem from happening :)