Closed oscard0m closed 4 years ago
Describe the bug To be able to scrape Idealista bypassing their recaptcha check: Example URL being scraped: https://www.idealista.com/alquiler-viviendas/barcelona/ciutat-vella/con-precio-hasta_1300,metros-cuadrados-mas-de_80,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,publicado_ultimas-24-horas/
There is a captcha:
Fixes #13
To Reproduce To run the script npm start
npm start
Expected behavior To get as output all the available flats from Idealista for the given URL's
Screenshots I just run it in debug mode: npm run debug and got the following output:
npm run debug
> phantomjs --ssl-protocol=any --ignore-ssl-errors=true --debug=true index.js 2018-07-09T22:12:21 [DEBUG] CookieJar - Created but will not store cookies (use option '--cookies-file=<filename>' to enable persistent cookie storage) 2018-07-09T22:12:22 [DEBUG] Set "http" proxy to: "" : 1080 2018-07-09T22:12:22 [DEBUG] Phantom - execute: Configuration 2018-07-09T22:12:22 [DEBUG] 0 objectName : "" 2018-07-09T22:12:22 [DEBUG] 1 cookiesFile : "" 2018-07-09T22:12:22 [DEBUG] 2 diskCacheEnabled : "false" 2018-07-09T22:12:22 [DEBUG] 3 maxDiskCacheSize : "-1" 2018-07-09T22:12:22 [DEBUG] 4 diskCachePath : "" 2018-07-09T22:12:22 [DEBUG] 5 ignoreSslErrors : "true" 2018-07-09T22:12:22 [DEBUG] 6 localUrlAccessEnabled : "true" 2018-07-09T22:12:22 [DEBUG] 7 localToRemoteUrlAccessEnabled : "false" 2018-07-09T22:12:22 [DEBUG] 8 outputEncoding : "UTF-8" 2018-07-09T22:12:22 [DEBUG] 9 proxyType : "http" 2018-07-09T22:12:22 [DEBUG] 10 proxy : ":1080" 2018-07-09T22:12:22 [DEBUG] 11 proxyAuth : ":" 2018-07-09T22:12:22 [DEBUG] 12 scriptEncoding : "UTF-8" 2018-07-09T22:12:22 [DEBUG] 13 webSecurityEnabled : "true" 2018-07-09T22:12:22 [DEBUG] 14 offlineStoragePath : "" 2018-07-09T22:12:22 [DEBUG] 15 localStoragePath : "" 2018-07-09T22:12:22 [DEBUG] 16 localStorageDefaultQuota : "-1" 2018-07-09T22:12:22 [DEBUG] 17 offlineStorageDefaultQuota : "-1" 2018-07-09T22:12:22 [DEBUG] 18 printDebugMessages : "true" 2018-07-09T22:12:22 [DEBUG] 19 javascriptCanOpenWindows : "true" 2018-07-09T22:12:22 [DEBUG] 20 javascriptCanCloseWindows : "true" 2018-07-09T22:12:22 [DEBUG] 21 sslProtocol : "any" 2018-07-09T22:12:22 [DEBUG] 22 sslCiphers : "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES128-SHA:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-RC4-SHA:ECDHE-RSA-RC4-SHA:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:RC4-SHA:RC4-MD5" 2018-07-09T22:12:22 [DEBUG] 23 sslCertificatesPath : "" 2018-07-09T22:12:22 [DEBUG] 24 sslClientCertificateFile : "" 2018-07-09T22:12:22 [DEBUG] 25 sslClientKeyFile : "" 2018-07-09T22:12:22 [DEBUG] 26 sslClientKeyPassphrase : "" 2018-07-09T22:12:22 [DEBUG] 27 webdriver : ":" 2018-07-09T22:12:22 [DEBUG] 28 webdriverLogFile : "" 2018-07-09T22:12:22 [DEBUG] 29 webdriverLogLevel : "INFO" 2018-07-09T22:12:22 [DEBUG] 30 webdriverSeleniumGridHub : "" 2018-07-09T22:12:22 [DEBUG] Phantom - execute: Script & Arguments 2018-07-09T22:12:22 [DEBUG] script: "index.js" 2018-07-09T22:12:22 [DEBUG] Phantom - execute: Starting normal mode 2018-07-09T22:12:22 [DEBUG] WebPage - setupFrame "" 2018-07-09T22:12:22 [DEBUG] FileSystem - _open: ":/modules/fs.js" QMap(("mode", QVariant(QString, "r"))) 2018-07-09T22:12:22 [DEBUG] FileSystem - _open: ":/modules/system.js" QMap(("mode", QVariant(QString, "r"))) 2018-07-09T22:12:22 [DEBUG] FileSystem - _open: ":/modules/webpage.js" QMap(("mode", QVariant(QString, "r"))) ------------------------------------------------------------------- IDEALISTA ------------------------------------------------------------------- 2018-07-09T22:12:22 [DEBUG] WebPage - updateLoadingProgress: 10 2018-07-09T22:12:22 [DEBUG] WebPage - updateLoadingProgress: 50 2018-07-09T22:12:22 [DEBUG] Network - Resource request error: QNetworkReply::NetworkError(ContentOperationNotPermittedError) ( "Error downloading https://www.idealista.com/alquiler-viviendas/barcelona/ciutat-vella/con-precio-hasta_1300,metros-cuadrados-mas-de_80,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,publicado_ultimas-24-horas/ - server replied: Unauthorized" ) URL: "https://www.idealista.com/alquiler-viviendas/barcelona/ciutat-vella/con-precio-hasta_1300,metros-cuadrados-mas-de_80,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,publicado_ultimas-24-horas/" 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 50 2018-07-09T22:12:23 [DEBUG] WebPage - setupFrame "" TypeError: undefined is not an object (evaluating 's') phantomjs://code/index.js:71 in onError 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 51 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 53 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 55 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 57 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 61 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 63 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 65 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 68 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 71 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 73 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 73 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 75 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 78 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 80 2018-07-09T22:12:23 [DEBUG] WebPage - setupFrame "<!--framePath //<!--frame0-->-->" 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 81 2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 84 2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 88 2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 88 2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 88 2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 89 2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 100 ****Status: success**** 2018-07-09T22:12:24 [DEBUG] WebPage - evaluateJavaScript "(function() { return (function getDataApartmentsIdealista() {\n var items = document.querySelectorAll(\".item\");\n var recentApartments = [];\n for(var i = 0; i < items.length; i++) {\n var apartment = items[i].querySelector(\".item-link\");\n var price = items[i].querySelector(\".price-row\");\n var redText = items[i].querySelector(\".txt-highlight-red\");\n var url = items[i].querySelector(\".txt-highlight-red\");\n \n recentApartments.push({\n title: apartment && apartment.textContent,\n price: price && price.textContent,\n time: redText && redText.textContent,\n url: apartment && apartment.href\n });\n }\n return recentApartments;\n})(); })()" 2018-07-09T22:12:24 [DEBUG] WebPage - evaluateJavaScript result QVariant(QVariantList, ()) ------------------------------------------------------------------- CIUTAT VELLA: No new apartments ------------------------------------------------------------------- 2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 10 2018-07-09T22:12:24 [DEBUG] WebPage - setupFrame "" 2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 35 2018-07-09T22:12:24 [DEBUG] WebPage - setupFrame "" 2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 50 TypeError: undefined is not an object (evaluating 's') phantomjs://code/index.js:71 in onError 2018-07-09T22:12:24 [DEBUG] Network - Resource request error: QNetworkReply::NetworkError(ContentOperationNotPermittedError) ( "Error downloading https://www.idealista.com/alquiler-viviendas/barcelona/eixample/con-precio-hasta_1300,metros-cuadrados-mas-de_80,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,publicado_ultimas-24-horas/ - server replied: Unauthorized" ) URL: "https://www.idealista.com/alquiler-viviendas/barcelona/eixample/con-precio-hasta_1300,metros-cuadrados-mas-de_80,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,publicado_ultimas-24-horas/" 2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 100 2018-07-09T22:12:24 [DEBUG] WebPage - setupFrame "<!--framePath //<!--frame0-->-->" 2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 10 ****Status: success**** 2018-07-09T22:12:24 [DEBUG] WebPage - evaluateJavaScript "(function() { return (function getDataApartmentsIdealista() {\n var items = document.querySelectorAll(\".item\");\n var recentApartments = [];\n for(var i = 0; i < items.length; i++) {\n var apartment = items[i].querySelector(\".item-link\");\n var price = items[i].querySelector(\".price-row\");\n var redText = items[i].querySelector(\".txt-highlight-red\");\n var url = items[i].querySelector(\".txt-highlight-red\");\n \n recentApartments.push({\n title: apartment && apartment.textContent,\n price: price && price.textContent,\n time: redText && redText.textContent,\n url: apartment && apartment.href\n });\n }\n return recentApartments;\n})(); })()" 2018-07-09T22:12:24 [DEBUG] WebPage - evaluateJavaScript result QVariant(QVariantList, ()) ------------------------------------------------------------------- EIXAMPLE: No new apartments ------------------------------------------------------------------- 2018-07-09T22:12:24 [DEBUG] Network - Resource request error: QNetworkReply::NetworkError(OperationCanceledError) ( "Operation canceled" ) URL: "https://www.google.com/recaptcha/api/fallback?k=6Lcj-R8TAAAAABs3FrRPuQhLMbp5QrHsHufzLf7b&hl=en&v=v1529908317173&t=1&ff=true" 2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 100 2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 10 2018-07-09T22:12:24 [DEBUG] WebPage - setupFrame "" ****Status: fail**** Error extracting GRACIA phantomjs://code/index.js:100 in scrapPage 2018-07-09T22:12:24 [DEBUG] WebPage - setupFrame "" 2018-07-09T22:12:25 [DEBUG] Network - Resource request error: QNetworkReply::NetworkError(OperationCanceledError) ( "Operation canceled" ) URL: "https://www.idealista.com/alquiler-viviendas/barcelona/gracia/vila-de-gracia/con-precio-hasta_1300,metros-cuadrados-mas-de_80,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,amueblado_amueblados,publicado_ultimas-24-horas/" 2018-07-09T22:12:25 [DEBUG] WebPage - updateLoadingProgress: 100 2018-07-09T22:12:25 [DEBUG] WebPage - updateLoadingProgress: 10 2018-07-09T22:12:25 [DEBUG] WebPage - setupFrame "" 2018-07-09T22:12:25 [DEBUG] WebPage - updateLoadingProgress: 100 2018-07-09T22:12:25 [DEBUG] WebPage - setupFrame "" 2018-07-09T22:12:25 [DEBUG] FileSystem - _open: ":/modules/fs.js" QMap(("mode", QVariant(QString, "r"))) 2018-07-09T22:12:25 [DEBUG] FileSystem - _open: ":/modules/system.js" QMap(("mode", QVariant(QString, "r"))) 2018-07-09T22:12:25 [DEBUG] FileSystem - _open: ":/modules/webpage.js" QMap(("mode", QVariant(QString, "r"))) 2018-07-09T22:12:25 [DEBUG] WebPage - updateLoadingProgress: 10 2018-07-09T22:12:25 [DEBUG] WebPage - setupFrame "" 2018-07-09T22:12:25 [DEBUG] FileSystem - _open: ":/modules/fs.js" QMap(("mode", QVariant(QString, "r"))) 2018-07-09T22:12:25 [DEBUG] FileSystem - _open: ":/modules/system.js" QMap(("mode", QVariant(QString, "r"))) 2018-07-09T22:12:25 [DEBUG] FileSystem - _open: ":/modules/webpage.js" QMap(("mode", QVariant(QString, "r"))) 2018-07-09T22:12:25 [DEBUG] WebPage - updateLoadingProgress: 100
Seems there is a captcha in the middle... But I still have to check deeper.
2018-07-09T22:12:24 [DEBUG] Network - Resource request error: QNetworkReply::NetworkError(OperationCanceledError) ( "Operation canceled" ) URL: "https://www.google.com/recaptcha/api/fallback?k=6Lcj-R8TAAAAABs3FrRPuQhLMbp5QrHsHufzLf7b&hl=en&v=v1529908317173&t=1&ff=true"
Desktop
OS: macOS Sierra 10.12.6 iTerm2 3.1.7 phantomjs 2.2.1 npm 5.6.0
Smartphone: No Smartphone
Additional context Nothing to add here.
close due to inactivity
Describe the bug To be able to scrape Idealista bypassing their recaptcha check: Example URL being scraped: https://www.idealista.com/alquiler-viviendas/barcelona/ciutat-vella/con-precio-hasta_1300,metros-cuadrados-mas-de_80,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,publicado_ultimas-24-horas/
There is a captcha:
Fixes #13
To Reproduce To run the script
npm start
Expected behavior To get as output all the available flats from Idealista for the given URL's
Screenshots I just run it in debug mode:
npm run debug
and got the following output:Seems there is a captcha in the middle... But I still have to check deeper.
Desktop
OS: macOS Sierra 10.12.6 iTerm2 3.1.7 phantomjs 2.2.1 npm 5.6.0
Smartphone: No Smartphone
Additional context Nothing to add here.