samyun / southwest-price-drop-bot

Bot that watches Southwest flights for price drops.
Other
74 stars 41 forks source link

Mongo, improved proxies, and updated scraping logic #49

Closed iloveitaly closed 5 years ago

iloveitaly commented 5 years ago

Lots of improvements!

samyun commented 5 years ago

Awesome! I’ll look through this and merge it in by tonight.

razzamatazm commented 5 years ago

This looks great. I'm have tried a previously working proxy setup (with both hostname and port) and one through illuminati.io and am getting the following errors along with the price not updating:

Jun 11 13:41:15 swacheck2 app/scheduler.9385: > southwest-price-drop-bot@3.1.4 task:check /app Jun 11 13:41:15 swacheck2 app/scheduler.9385: > node --trace-warnings tasks/check.js Jun 11 13:41:16 swacheck2 app/scheduler.9385: (node:23) UnhandledPromiseRejectionWarning: Error: Invalid "proxyUrl" option: the URL must contain both hostname and port. Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Object.anonymizeProxy (/app/node_modules/proxy-chain/build/anonymize_proxy.js:32:15) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at module.exports (/app/lib/browser.js:10:39) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at /app/tasks/check.js:12:23 Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Object. (/app/tasks/check.js:75:3) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Module._compile (internal/modules/cjs/loader.js:774:30) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Object.Module._extensions..js (internal/modules/cjs/loader.js:785:10) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Module.load (internal/modules/cjs/loader.js:641:32) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Function.Module._load (internal/modules/cjs/loader.js:556:12) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Function.Module.runMain (internal/modules/cjs/loader.js:837:10) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at internal/main/run_main_module.js:17:11 Jun 11 13:41:16 swacheck2 app/scheduler.9385: at emitWarning (internal/process/promises.js:120:15) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at processPromiseRejections (internal/process/promises.js:168:7) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at processTicksAndRejections (internal/process/task_queues.js:90:32) Jun 11 13:41:16 swacheck2 app/scheduler.9385: (node:23) Error: Invalid "proxyUrl" option: the URL must contain both hostname and port. Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Object.anonymizeProxy (/app/node_modules/proxy-chain/build/anonymize_proxy.js:32:15) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at module.exports (/app/lib/browser.js:10:39) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at /app/tasks/check.js:12:23 Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Object. (/app/tasks/check.js:75:3) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Module._compile (internal/modules/cjs/loader.js:774:30) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Object.Module._extensions..js (internal/modules/cjs/loader.js:785:10) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Module.load (internal/modules/cjs/loader.js:641:32) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Function.Module._load (internal/modules/cjs/loader.js:556:12) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at Function.Module.runMain (internal/modules/cjs/loader.js:837:10) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at internal/main/run_main_module.js:17:11 Jun 11 13:41:16 swacheck2 app/scheduler.9385: (node:23) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code. Jun 11 13:41:16 swacheck2 app/scheduler.9385: at emitDeprecationWarning (internal/process/promises.js:134:13) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at emitWarning (internal/process/promises.js:127:3) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at processPromiseRejections (internal/process/promises.js:168:7) Jun 11 13:41:16 swacheck2 app/scheduler.9385: at processTicksAndRejections (internal/process/task_queues.js:90:32) Jun 11 13:41:16 swacheck2 app/scheduler.9385: mongo successfully connected!

razzamatazm commented 5 years ago

I was able to get the proxy working by including http:// in front of the url. That being said, now it's having issues scraping. See logs:

Jun 11 16:12:28 swacheck2 app/scheduler.5302: mongo successfully connected! Jun 11 16:12:29 swacheck2 app/scheduler.5302: found 1 alerts, checking... Jun 11 16:12:29 swacheck2 app/scheduler.5302: lock has available permits: 5 Jun 11 16:12:29 swacheck2 app/scheduler.5302: Entered lock, available permits: 4 Jun 11 16:12:30 swacheck2 app/scheduler.5302: Retrieving URL: https://www.southwest.com/air/booking/select.html?originationAirportCode=LAX&destinationAirportCode=PVR&returnAirportCode=&departureDate=2019-08-22&departureTimeOfDay=ALL_DAY&returnDate=&returnTimeOfDay=ALL_DAY&adultPassengersCount=1&seniorPassengersCount=0&fareType=USD&passengerType=ADULT&tripType=oneway&promoCode=&reset=true&redirectToVision=true&int=HOMEQBOMAIR&leapfrogRequest=true Jun 11 16:14:32 swacheck2 app/scheduler.5302: Unable to get flights - trying again Jun 11 16:14:32 swacheck2 app/scheduler.5302: Jun 11 16:14:32 swacheck2 app/scheduler.5302: { Jun 11 16:14:32 swacheck2 app/scheduler.5302: status: '200', Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'content-type': 'text/html; charset=UTF-8', Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'x-ion-hop': '1', Jun 11 16:14:32 swacheck2 app/scheduler.5302: expires: '0', Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'cache-control': 'no-cache, no-store, must-revalidate', Jun 11 16:14:32 swacheck2 app/scheduler.5302: pragma: 'no-cache', Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'content-encoding': 'gzip', Jun 11 16:14:32 swacheck2 app/scheduler.5302: vary: 'Accept-Encoding', Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'x-akamai-transformed': '9 64888 0 pmb=mNONE,1', Jun 11 16:14:32 swacheck2 app/scheduler.5302: date: 'Tue, 11 Jun 2019 23:12:30 GMT', Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'content-length': '58822', Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'set-cookie': 'akavpau_prod_fullsite=1560294780~id=ec6648d3009d2cd2e75488f337c82749; ' + Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'Path=/', Jun 11 16:14:32 swacheck2 app/scheduler.5302: 'strict-transport-security': 'max-age=600' Jun 11 16:14:32 swacheck2 app/scheduler.5302: } Jun 11 16:14:32 swacheck2 app/scheduler.5302: 200 Jun 11 16:16:32 swacheck2 app/scheduler.5302: Error: ERROR! Unknown error! Unable to find flight information on page: https://www.southwest.com/air/booking/select.html?originationAirportCode=LAX&destinationAirportCode=PVR&returnAirportCode=&departureDate=2019-08-22&departureTimeOfDay=ALL_DAY&returnDate=&returnTimeOfDay=ALL_DAY&adultPassengersCount=1&seniorPassengersCount=0&fareType=USD&passengerType=ADULT&tripType=oneway&promoCode=&reset=true&redirectToVision=true&int=HOMEQBOMAIR&leapfrogRequest=true Jun 11 16:16:32 swacheck2 app/scheduler.5302: html: Jun 11 16:16:32 swacheck2 app/scheduler.5302: at getPage (/app/lib/bot/get-price.js:212:17) Jun 11 16:16:32 swacheck2 app/scheduler.5302: at processTicksAndRejections (internal/process/task_queues.js:89:5) Jun 11 16:16:32 swacheck2 app/scheduler.5302: at async getFlights (/app/lib/bot/get-price.js:47:14) Jun 11 16:16:32 swacheck2 app/scheduler.5302: at async getPriceForFlight (/app/lib/bot/get-price.js:8:20) Jun 11 16:16:32 swacheck2 app/scheduler.5302: at async Alert.getLatestPrice (/app/lib/bot/alert.js:172:19) Jun 11 16:16:32 swacheck2 app/scheduler.5302: at async /app/tasks/check.js:33:9 Jun 11 16:16:32 swacheck2 app/scheduler.5302: at async Promise.all (index 0) Jun 11 16:16:32 swacheck2 app/scheduler.5302: at async /app/tasks/check.js:69:5 Jun 11 16:16:32 swacheck2 app/scheduler.5302: No flights found! Jun 11 16:16:32 swacheck2 app/scheduler.5302: Min price: Infinity Jun 11 16:16:32 swacheck2 app/scheduler.5302: Got price: 8/22/2019|LAX|PVR|110 { time: 1560294749217, price: Infinity } Jun 11 16:16:32 swacheck2 app/scheduler.5302: 8/22/2019 #110 LAX → PVR not cheaper Jun 11 16:16:32 swacheck2 heroku/scheduler.5302: State changed from up to complete Jun 11 16:16:33 swacheck2 heroku/scheduler.5302: Process exited with status 0

iloveitaly commented 5 years ago

@razzamatazm

...previously working proxy setup

How recently was this working? If you revert to your previous setup are you able to scrape successfully?

Since I posted this PR it looks like SW is blocking requests (from a proxy or my local connection). It looks like they've updated their bot detection system, and it's gotten much much better.

razzamatazm commented 5 years ago

@iloveitaly It had been working prior to when their bot detection was first implemented. That being said, I was able to move past the error I was receiving in my first post by including "http://" in the proxy var. That being said, now the app is having trouble scraping the price. I was initially searching an international flight booked with points, so to test I tried a US flight booked with cash and it's still having issues.

samyun commented 5 years ago

I'm seeing the same thing - looks like an Akamai block.

razzamatazm commented 5 years ago

@samyun and @iloveitaly - I setup a proxy server at my homelab and still run into the issues - no problems accessing the southwest site through a browser. Not sure if it's Akamai in this case.

razzamatazm commented 5 years ago

https://github.com/pyro2927/SouthwestCheckin/ <-- This is working as of now. I wonder if we can pull some of the techniques used. It uses the mobile api.

iloveitaly commented 5 years ago

@razzamatazm ah, interesting! I didn't realize there was a mobile API. Looks like the flight cost endpoint hasn't been figured out yet. Any ideas on how to hit it?

@samyun I'm pretty sure it's not a Akamai block. Here's why:

  1. curl https://www.southwest.com/air/booking/select.html There's some analytics code, some obfucated code, and then a snippet that hits a unique token on the root SW domain when the page has loaded and then reloads the page. swa-common is loaded on this page as well, but I'm not sure if it's a duplicate of the inline JS or not (my hunch is it is).
  2. If you run the obfuscated code through jsnice.org you'll see they are doing some really fancy obfuscation. Find the var assigned to Object.create(null), find where it's actively used and add a debugger call next to it. You'll need to do some fiddling to find the right place. You can pull the code into a standalone HTML file to fiddle with it locally.
  3. If you do that, you'll see a list of the properties that are being checked. This is helpful, but it's hard to figure out exactly what is being checked and how. I think what happens is they are checked, serialized into some sort of string, and then added as a header which is then send to the southwest.com/TOKEN URL specified in the initial page load. Fancy stuff!

I went ahead and did this one last time and realized the flags I had to disable the WebGL/GPU stuff was causing the issue. This is now working again!

iloveitaly commented 5 years ago

Hmm, now it's not working for me. No idea why. Can you guys try HEAD and see if it works for you?

razzamatazm commented 5 years ago

I'm getting build errors on Heroku

info fsevents@1.2.9: The platform "linux" is incompatible with this module.

   info "fsevents@1.2.9" is an optional dependency and failed

compatibility check. Excluding it from installation.

   error fsevents@2.0.7: The platform "linux" is incompatible with

this module.

   error Found incompatible module.

   info Visit https://yarnpkg.com/en/docs/cli/install for

documentation about this command.

-----> Build failed

          We're sorry this build is failing! You can troubleshoot

common issues here:

   https://devcenter.heroku.com/articles/troubleshooting-node-deploys

          Some possible problems:

          - Dangerous semver range (>) in engines.node

     https://devcenter.heroku.com/articles/nodejs-support#specifying-a-node-js-version

          Love,

   Heroku

    !     Push rejected, failed to compile Node.js app.

! Push failed

On Thu, Jun 13, 2019 at 8:53 AM Michael Bianco notifications@github.com wrote:

Hmm, now it's not working for me. No idea why. Can you guys try HEAD and see if it works for you?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samyun/southwest-price-drop-bot/pull/49?email_source=notifications&email_token=AFOEFJUQTKMZNRVUJ5FI6RLP2JUQDA5CNFSM4HTMMMSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXUEYZA#issuecomment-501763172, or mute the thread https://github.com/notifications/unsubscribe-auth/AFOEFJRRCAGBTUPH6IMHL2TP2JUQDANCNFSM4HTMMMSA .

razzamatazm commented 5 years ago

New, different, exciting errors :)

Jun 13 11:52:55 swacheck3 heroku/router: at=info method=GET path="/style.css" host=swacheck3.herokuapp.com request_id=1fad41d7-311b-41c0-9876-8bd743dc5526 fwd="67.53.122.46" dyno=web.1 connect=0ms service=6ms status=304 bytes=269 protocol=https Jun 13 11:52:55 swacheck3 heroku/router: at=info method=GET path="/logo.png" host=swacheck3.herokuapp.com request_id=3af111c9-21b6-4717-b8be-3e059f01326a fwd="67.53.122.46" dyno=web.1 connect=0ms service=10ms status=304 bytes=271 protocol=https Jun 13 11:52:55 swacheck3 app/web.1: Retrieving URL: https://www.southwest.com/air/booking/select.html?originationAirportCode=LAX&destinationAirportCode=PHX&returnAirportCode=&departureDate=2019-08-22&departureTimeOfDay=ALL_DAY&returnDate=&returnTimeOfDay=ALL_DAY&adultPassengersCount=1&seniorPassengersCount=0&fareType=USD&passengerType=ADULT&tripType=oneway&promoCode=&reset=true&redirectToVision=true&int=HOMEQBOMAIR&leapfrogRequest=true Jun 13 11:52:58 swacheck3 app/web.1: PAGE LOG: Failed to load resource: net::ERR_FAILED Jun 13 11:52:58 swacheck3 app/web.1: PAGE LOG: Failed to load resource: the server responded with a status of 403 () Jun 13 11:54:57 swacheck3 app/web.1: Unable to get flights - trying again Jun 13 11:54:57 swacheck3 app/web.1: Jun 13 11:54:57 swacheck3 app/web.1: { Jun 13 11:54:57 swacheck3 app/web.1: status: '200', Jun 13 11:54:57 swacheck3 app/web.1: 'content-type': 'text/html; charset=UTF-8', Jun 13 11:54:57 swacheck3 app/web.1: 'x-ion-hop': '1', Jun 13 11:54:57 swacheck3 app/web.1: expires: '0', Jun 13 11:54:57 swacheck3 app/web.1: 'cache-control': 'no-cache, no-store, must-revalidate', Jun 13 11:54:57 swacheck3 app/web.1: pragma: 'no-cache', Jun 13 11:54:57 swacheck3 app/web.1: 'content-encoding': 'gzip', Jun 13 11:54:57 swacheck3 app/web.1: vary: 'Accept-Encoding', Jun 13 11:54:57 swacheck3 app/web.1: 'x-akamai-transformed': '9 - 0 pmb=mNONE,1', Jun 13 11:54:57 swacheck3 app/web.1: date: 'Thu, 13 Jun 2019 18:52:56 GMT', Jun 13 11:54:57 swacheck3 app/web.1: 'content-length': '58937', Jun 13 11:54:57 swacheck3 app/web.1: 'set-cookie': 'akavpau_prod_fullsite=1560452006~id=bf704764f98270f44819cda28444db01; ' + Jun 13 11:54:57 swacheck3 app/web.1: 'Path=/', Jun 13 11:54:57 swacheck3 app/web.1: 'strict-transport-security': 'max-age=600' Jun 13 11:54:57 swacheck3 app/web.1: } Jun 13 11:54:57 swacheck3 app/web.1: 200 Jun 13 11:54:58 swacheck3 app/web.1: PAGE LOG: Failed to load resource: the server responded with a status of 403 () Jun 13 11:56:58 swacheck3 app/web.1: Error: ERROR! Unknown error! Unable to find flight information on page: https://www.southwest.com/air/booking/select.html?originationAirportCode=LAX&destinationAirportCode=PHX&returnAirportCode=&departureDate=2019-08-22&departureTimeOfDay=ALL_DAY&returnDate=&returnTimeOfDay=ALL_DAY&adultPassengersCount=1&seniorPassengersCount=0&fareType=USD&passengerType=ADULT&tripType=oneway&promoCode=&reset=true&redirectToVision=true&int=HOMEQBOMAIR&leapfrogRequest=true Jun 13 11:56:58 swacheck3 app/web.1: html: Jun 13 11:56:58 swacheck3 app/web.1: at getPage (/app/lib/bot/get-price.js:240:17) Jun 13 11:56:58 swacheck3 app/web.1: at processTicksAndRejections (internal/process/task_queues.js:89:5) Jun 13 11:56:58 swacheck3 app/web.1: at async getFlights (/app/lib/bot/get-price.js:51:14) Jun 13 11:56:58 swacheck3 app/web.1: at async getPriceForFlight (/app/lib/bot/get-price.js:8:20) Jun 13 11:56:58 swacheck3 app/web.1: at async Alert.getLatestPrice (/app/lib/bot/alert.js:172:19) Jun 13 11:56:58 swacheck3 app/web.1: at async /app/lib/apps/app.js:72:3 Jun 13 11:56:58 swacheck3 app/web.1: No flights found! Jun 13 11:56:58 swacheck3 app/web.1: Min price: Infinity Jun 13 11:56:58 swacheck3 app/web.1: Got price: 8/22/2019|LAX|PHX|1121 { time: 1560451974957, price: Infinity }

iloveitaly commented 5 years ago

@razzamatazm yup, the 403 is SW blocking us. No idea how to get around this. I think it has something to do with the IP used, but I can't be sure.

razzamatazm commented 5 years ago

I can reach the site using chrome at my home, via the same proxy. So strange.

On Thu, Jun 13, 2019 at 1:24 PM Michael Bianco notifications@github.com wrote:

@razzamatazm https://github.com/razzamatazm yup, the 403 is SW blocking us. No idea how to get around this. I think it has something to do with the IP used, but I can't be sure.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samyun/southwest-price-drop-bot/pull/49?email_source=notifications&email_token=AFOEFJU3XFJDRPH6CS7RQKDP2KUIPA5CNFSM4HTMMMSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXU5VKQ#issuecomment-501865130, or mute the thread https://github.com/notifications/unsubscribe-auth/AFOEFJU2AMF4HX7QIZ6W6NDP2KUIPANCNFSM4HTMMMSA .

iloveitaly commented 5 years ago

@razzamatazm is that using this repo, or by manually accessing it via standard chrome?

I think what's going on is SW is associating a browser fingerprint with an IP and then blocking that IP. I know somewhere in the SW code they are checking the __webdriver_script_fn var which is not hidden using the evasions currently implemented.

https://allinonescript.com/index.php/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver?sort=creation

I think the best option is to use the mobile API, but it doesn't look like the price check endpoint has been figured out yet (and I don't have the time to tinker with it).

In any case, this is a huge improvement over what was there, although it doesn't actually work :(

samyun commented 5 years ago

I went ahead and merged this in - I found some other evasion repos I'm going to try to work in. Thanks for your help!

iloveitaly commented 5 years ago

@samyun awesome! It's worth noting that this is now working locally again. I think there is some sort of IP block triggered by repeated requests for the same flight (or something alone those lines... just guessing really). Keep us posted on what you find!