oscard0m / rent-flat-scraper

Scraper with last published flats in idealista
MIT License
7 stars 2 forks source link

To avoid captcha to scrape Idealista flats #25

Closed oscard0m closed 4 years ago

oscard0m commented 6 years ago

Describe the bug To be able to scrape Idealista bypassing their recaptcha check: Example URL being scraped: https://www.idealista.com/alquiler-viviendas/barcelona/ciutat-vella/con-precio-hasta_1300,metros-cuadrados-mas-de_80,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,publicado_ultimas-24-horas/

There is a captcha: image

Fixes #13

To Reproduce To run the script npm start

Expected behavior To get as output all the available flats from Idealista for the given URL's

Screenshots I just run it in debug mode: npm run debug and got the following output:

> phantomjs --ssl-protocol=any --ignore-ssl-errors=true --debug=true index.js

2018-07-09T22:12:21 [DEBUG] CookieJar - Created but will not store cookies (use option '--cookies-file=<filename>' to enable persistent cookie storage)
2018-07-09T22:12:22 [DEBUG] Set  "http"  proxy to:  "" : 1080
2018-07-09T22:12:22 [DEBUG] Phantom - execute: Configuration
2018-07-09T22:12:22 [DEBUG]      0 objectName : ""
2018-07-09T22:12:22 [DEBUG]      1 cookiesFile : ""
2018-07-09T22:12:22 [DEBUG]      2 diskCacheEnabled : "false"
2018-07-09T22:12:22 [DEBUG]      3 maxDiskCacheSize : "-1"
2018-07-09T22:12:22 [DEBUG]      4 diskCachePath : ""
2018-07-09T22:12:22 [DEBUG]      5 ignoreSslErrors : "true"
2018-07-09T22:12:22 [DEBUG]      6 localUrlAccessEnabled : "true"
2018-07-09T22:12:22 [DEBUG]      7 localToRemoteUrlAccessEnabled : "false"
2018-07-09T22:12:22 [DEBUG]      8 outputEncoding : "UTF-8"
2018-07-09T22:12:22 [DEBUG]      9 proxyType : "http"
2018-07-09T22:12:22 [DEBUG]      10 proxy : ":1080"
2018-07-09T22:12:22 [DEBUG]      11 proxyAuth : ":"
2018-07-09T22:12:22 [DEBUG]      12 scriptEncoding : "UTF-8"
2018-07-09T22:12:22 [DEBUG]      13 webSecurityEnabled : "true"
2018-07-09T22:12:22 [DEBUG]      14 offlineStoragePath : ""
2018-07-09T22:12:22 [DEBUG]      15 localStoragePath : ""
2018-07-09T22:12:22 [DEBUG]      16 localStorageDefaultQuota : "-1"
2018-07-09T22:12:22 [DEBUG]      17 offlineStorageDefaultQuota : "-1"
2018-07-09T22:12:22 [DEBUG]      18 printDebugMessages : "true"
2018-07-09T22:12:22 [DEBUG]      19 javascriptCanOpenWindows : "true"
2018-07-09T22:12:22 [DEBUG]      20 javascriptCanCloseWindows : "true"
2018-07-09T22:12:22 [DEBUG]      21 sslProtocol : "any"
2018-07-09T22:12:22 [DEBUG]      22 sslCiphers : "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES128-SHA:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-RC4-SHA:ECDHE-RSA-RC4-SHA:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES128-SHA:AES256-SHA:DES-CBC3-SHA:RC4-SHA:RC4-MD5"
2018-07-09T22:12:22 [DEBUG]      23 sslCertificatesPath : ""
2018-07-09T22:12:22 [DEBUG]      24 sslClientCertificateFile : ""
2018-07-09T22:12:22 [DEBUG]      25 sslClientKeyFile : ""
2018-07-09T22:12:22 [DEBUG]      26 sslClientKeyPassphrase : ""
2018-07-09T22:12:22 [DEBUG]      27 webdriver : ":"
2018-07-09T22:12:22 [DEBUG]      28 webdriverLogFile : ""
2018-07-09T22:12:22 [DEBUG]      29 webdriverLogLevel : "INFO"
2018-07-09T22:12:22 [DEBUG]      30 webdriverSeleniumGridHub : ""
2018-07-09T22:12:22 [DEBUG] Phantom - execute: Script & Arguments
2018-07-09T22:12:22 [DEBUG]      script: "index.js"
2018-07-09T22:12:22 [DEBUG] Phantom - execute: Starting normal mode
2018-07-09T22:12:22 [DEBUG] WebPage - setupFrame ""
2018-07-09T22:12:22 [DEBUG] FileSystem - _open: ":/modules/fs.js" QMap(("mode", QVariant(QString, "r")))
2018-07-09T22:12:22 [DEBUG] FileSystem - _open: ":/modules/system.js" QMap(("mode", QVariant(QString, "r")))
2018-07-09T22:12:22 [DEBUG] FileSystem - _open: ":/modules/webpage.js" QMap(("mode", QVariant(QString, "r")))

-------------------------------------------------------------------
              IDEALISTA
-------------------------------------------------------------------

2018-07-09T22:12:22 [DEBUG] WebPage - updateLoadingProgress: 10
2018-07-09T22:12:22 [DEBUG] WebPage - updateLoadingProgress: 50
2018-07-09T22:12:22 [DEBUG] Network - Resource request error: QNetworkReply::NetworkError(ContentOperationNotPermittedError) ( "Error downloading https://www.idealista.com/alquiler-viviendas/barcelona/ciutat-vella/con-precio-hasta_1300,metros-cuadrados-mas-de_80,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,publicado_ultimas-24-horas/ - server replied: Unauthorized" ) URL: "https://www.idealista.com/alquiler-viviendas/barcelona/ciutat-vella/con-precio-hasta_1300,metros-cuadrados-mas-de_80,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,publicado_ultimas-24-horas/"
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 50
2018-07-09T22:12:23 [DEBUG] WebPage - setupFrame ""
TypeError: undefined is not an object (evaluating 's')

  phantomjs://code/index.js:71 in onError
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 51
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 53
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 55
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 57
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 61
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 63
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 65
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 68
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 71
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 73
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 73
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 75
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 78
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 80
2018-07-09T22:12:23 [DEBUG] WebPage - setupFrame "<!--framePath //<!--frame0-->-->"
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 81
2018-07-09T22:12:23 [DEBUG] WebPage - updateLoadingProgress: 84
2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 88
2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 88
2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 88
2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 89
2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 100
****Status: success****

2018-07-09T22:12:24 [DEBUG] WebPage - evaluateJavaScript "(function() { return (function getDataApartmentsIdealista() {\n    var items = document.querySelectorAll(\".item\");\n    var recentApartments = [];\n    for(var i = 0; i < items.length; i++) {\n        var apartment = items[i].querySelector(\".item-link\");\n        var price = items[i].querySelector(\".price-row\");\n        var redText = items[i].querySelector(\".txt-highlight-red\");\n        var url = items[i].querySelector(\".txt-highlight-red\");\n        \n        recentApartments.push({\n            title: apartment && apartment.textContent,\n            price: price && price.textContent,\n            time: redText && redText.textContent,\n            url: apartment && apartment.href\n        });\n    }\n    return recentApartments;\n})(); })()"
2018-07-09T22:12:24 [DEBUG] WebPage - evaluateJavaScript result QVariant(QVariantList, ())

-------------------------------------------------------------------
              CIUTAT VELLA: No new apartments
-------------------------------------------------------------------

2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 10
2018-07-09T22:12:24 [DEBUG] WebPage - setupFrame ""
2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 35
2018-07-09T22:12:24 [DEBUG] WebPage - setupFrame ""
2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 50
TypeError: undefined is not an object (evaluating 's')

  phantomjs://code/index.js:71 in onError
2018-07-09T22:12:24 [DEBUG] Network - Resource request error: QNetworkReply::NetworkError(ContentOperationNotPermittedError) ( "Error downloading https://www.idealista.com/alquiler-viviendas/barcelona/eixample/con-precio-hasta_1300,metros-cuadrados-mas-de_80,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,publicado_ultimas-24-horas/ - server replied: Unauthorized" ) URL: "https://www.idealista.com/alquiler-viviendas/barcelona/eixample/con-precio-hasta_1300,metros-cuadrados-mas-de_80,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,publicado_ultimas-24-horas/"
2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 100
2018-07-09T22:12:24 [DEBUG] WebPage - setupFrame "<!--framePath //<!--frame0-->-->"
2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 10
****Status: success****

2018-07-09T22:12:24 [DEBUG] WebPage - evaluateJavaScript "(function() { return (function getDataApartmentsIdealista() {\n    var items = document.querySelectorAll(\".item\");\n    var recentApartments = [];\n    for(var i = 0; i < items.length; i++) {\n        var apartment = items[i].querySelector(\".item-link\");\n        var price = items[i].querySelector(\".price-row\");\n        var redText = items[i].querySelector(\".txt-highlight-red\");\n        var url = items[i].querySelector(\".txt-highlight-red\");\n        \n        recentApartments.push({\n            title: apartment && apartment.textContent,\n            price: price && price.textContent,\n            time: redText && redText.textContent,\n            url: apartment && apartment.href\n        });\n    }\n    return recentApartments;\n})(); })()"
2018-07-09T22:12:24 [DEBUG] WebPage - evaluateJavaScript result QVariant(QVariantList, ())

-------------------------------------------------------------------
              EIXAMPLE: No new apartments
-------------------------------------------------------------------

2018-07-09T22:12:24 [DEBUG] Network - Resource request error: QNetworkReply::NetworkError(OperationCanceledError) ( "Operation canceled" ) URL: "https://www.google.com/recaptcha/api/fallback?k=6Lcj-R8TAAAAABs3FrRPuQhLMbp5QrHsHufzLf7b&hl=en&v=v1529908317173&t=1&ff=true"
2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 100
2018-07-09T22:12:24 [DEBUG] WebPage - updateLoadingProgress: 10
2018-07-09T22:12:24 [DEBUG] WebPage - setupFrame ""
****Status: fail****

Error extracting GRACIA

  phantomjs://code/index.js:100 in scrapPage
2018-07-09T22:12:24 [DEBUG] WebPage - setupFrame ""
2018-07-09T22:12:25 [DEBUG] Network - Resource request error: QNetworkReply::NetworkError(OperationCanceledError) ( "Operation canceled" ) URL: "https://www.idealista.com/alquiler-viviendas/barcelona/gracia/vila-de-gracia/con-precio-hasta_1300,metros-cuadrados-mas-de_80,de-dos-dormitorios,de-tres-dormitorios,de-cuatro-cinco-habitaciones-o-mas,amueblado_amueblados,publicado_ultimas-24-horas/"
2018-07-09T22:12:25 [DEBUG] WebPage - updateLoadingProgress: 100
2018-07-09T22:12:25 [DEBUG] WebPage - updateLoadingProgress: 10
2018-07-09T22:12:25 [DEBUG] WebPage - setupFrame ""
2018-07-09T22:12:25 [DEBUG] WebPage - updateLoadingProgress: 100
2018-07-09T22:12:25 [DEBUG] WebPage - setupFrame ""
2018-07-09T22:12:25 [DEBUG] FileSystem - _open: ":/modules/fs.js" QMap(("mode", QVariant(QString, "r")))
2018-07-09T22:12:25 [DEBUG] FileSystem - _open: ":/modules/system.js" QMap(("mode", QVariant(QString, "r")))
2018-07-09T22:12:25 [DEBUG] FileSystem - _open: ":/modules/webpage.js" QMap(("mode", QVariant(QString, "r")))
2018-07-09T22:12:25 [DEBUG] WebPage - updateLoadingProgress: 10
2018-07-09T22:12:25 [DEBUG] WebPage - setupFrame ""
2018-07-09T22:12:25 [DEBUG] FileSystem - _open: ":/modules/fs.js" QMap(("mode", QVariant(QString, "r")))
2018-07-09T22:12:25 [DEBUG] FileSystem - _open: ":/modules/system.js" QMap(("mode", QVariant(QString, "r")))
2018-07-09T22:12:25 [DEBUG] FileSystem - _open: ":/modules/webpage.js" QMap(("mode", QVariant(QString, "r")))
2018-07-09T22:12:25 [DEBUG] WebPage - updateLoadingProgress: 100

Seems there is a captcha in the middle... But I still have to check deeper.

2018-07-09T22:12:24 [DEBUG] Network - Resource request error: QNetworkReply::NetworkError(OperationCanceledError) ( "Operation canceled" ) URL: "https://www.google.com/recaptcha/api/fallback?k=6Lcj-R8TAAAAABs3FrRPuQhLMbp5QrHsHufzLf7b&hl=en&v=v1529908317173&t=1&ff=true"

Desktop

OS: macOS Sierra 10.12.6 iTerm2 3.1.7 phantomjs 2.2.1 npm 5.6.0

Smartphone: No Smartphone

Additional context Nothing to add here.

oscard0m commented 4 years ago

close due to inactivity