laster04 / Idealista-scraper

2 stars 0 forks source link

URLs in startUrl array aren´t recognized as valid #2

Closed elreymon closed 1 year ago

elreymon commented 2 years ago

Example:

run_input = { "district": "Fuencarral, Madrid",

"maxItems": 3,

"startUrl": ["https://www.idealista.com/venta-viviendas/madrid/barrio-de-salamanca/castellana/con-precio-hasta_500000,con-solo-pisos,apartamentos,aticos,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,ascensor,ultimas-plantas,plantas-intermedias/"],
"proxy": {
    "useApifyProxy": True,
    "apifyProxyGroups": [
        "RESIDENTIAL"
    ],
    "apifyProxyCountry": "ES"
},

}

Obtained: "apify_client._errors.ApifyApiError: Input is not valid: Items in input.startUrl at positions [0] do not contain valid URLs"

laster04 commented 2 years ago

@elreymon Yes it's true because i use Input Schema for requestListSources Input Schema So correct property looks like this:

"startUrl": [
        {
            "url": "https://www.idealista.com/venta-viviendas/madrid/barrio-de-salamanca/castellana/con-precio-hasta_500000,con-solo-pisos,apartamentos,aticos,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,ascensor,ultimas-plantas,plantas-intermedias/"
        }
    ]
}

But when I check your url in browser Idealista didn't return any item

elreymon commented 2 years ago

Used Input Schema as you said: { "maxItems": 3, "proxy": { "useApifyProxy": true, "apifyProxyGroups": [ "RESIDENTIAL" ], "apifyProxyCountry": "ES" }, "startUrl": [ { "url": "https://www.idealista.com/venta-viviendas/madrid/barrio-de-salamanca/castellana/con-precio-hasta_500000,con-solo-pisos,apartamentos,aticos,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,ascensor,ultimas-plantas,plantas-intermedias/" } ] }

But no result obtained: 2022-11-03T15:21:50.868Z INFO Starting the crawl. 2022-11-03T15:21:50.961Z INFO CheerioCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":null}}} 2022-11-03T15:21:52.534Z INFO CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down. 2022-11-03T15:21:52.771Z INFO CheerioCrawler: Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":1903} 2022-11-03T15:21:52.772Z INFO Crawl finished.

laster04 commented 2 years ago

Hello, I tryed to put this url to the browser and the idealista site give me zero results so no results from actor are okay.

elreymon commented 2 years ago

That´s true. Sorry for the invalid example.

Here is one JSON with a URL that returns 35 elements but the crawler doesn´t retrive results.

{ "maxItems": 33, "proxy": { "useApifyProxy": true, "apifyProxyGroups": [ "RESIDENTIAL" ], "apifyProxyCountry": "ES" }, "startUrl": [ { "url": "https://www.idealista.com/venta-viviendas/madrid/barrio-de-salamanca/castellana/con-precio-hasta_500000/" } ] }

Additionally the actor can´t browse into paginated results, does it? I mean to scrape 10 pages shoul I provide the actor a 10 URLs array?

laster04 commented 2 years ago

Hello, I'm sorry it was my typo in code. Now it is working. Here is example run: https://console.apify.com/view/runs/QaZsf3saQRDpql0SH

laster04 commented 2 years ago

Let me know if you want to add some more features or if you have any ideas for upgrades. Thanks