scrapinghub / splash

Lightweight, scriptable browser as a service with an HTTP API
BSD 3-Clause "New" or "Revised" License
4.06k stars 513 forks source link

Splash not loading search results #1069

Open decolon opened 4 years ago

decolon commented 4 years ago

I am trying to use Splash to render the following page

https://www.wholefoodsmarket.com/stores

My goal is to load the page, type in the text "tampa," hit enter, and then get the resulting data

Here is my Lua script

function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(0.5))

  --splash:select('.NavSearch-Input--3VNY4'):focus()
  splash:select('.wfm-search-bar--input'):focus()
  splash:send_text("tampa")
  splash:send_keys("<Return>")
  assert(splash:wait(8))

  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

Im running Splash 3.5 locally in a docker container splash

I expect something like this, where all the store data has been loaded wf

Instead I get this render

and this HTML www.wholefoodsmarket.com(1).txt

Where no data has been loaded.

How can I have the Lua script type in the text, and then wait for WF to update all the data?

decolon commented 3 years ago

Small update. I was able to get product searches working on another website (https://www.target.com) using this script

function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(1))
  local search = splash:select('.searchInputForm')
  assert(search:fill({searchTerm='first'}))
  assert(search:submit())
  assert(splash:wait(5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

I assume this is because target uses a form, so I was able to use the element:submit() method.

Unfortunately WF does not use a form, they simple execute JS as you type into the text field. That JS is not executing in Splash.

I tried again with this script

function main(splash, args)
  splash.private_mode_enabled = false
  assert(splash:go(args.url))
  assert(splash:wait(5))
  splash:select('.wfm-search-bar--input'):focus()
  splash:send_text("9430")
  splash:send_keys("<Enter>")
  assert(splash:wait(5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

But no luck

From the screenshots Splash is sending me, it does not look like the search is actually being executed when I send the Enter key but I dont know how to execute the search since there is no form to submit.

decolon commented 3 years ago

Another update,

I looked through the HAR files generated by Splash, and by Chrome when I manually navigate to the page, type in the text and press enter.

The HAR files looks the same except one key difference

When I press enter in Chrome, it sends a POST message to "https://www.wholefoodsmarket.com/stores/search" with the query content set to

{
  "query": [
"AYAAFHPKKa5iv562J/dCZ4RKi8cALgACABFvcmlnaW5hbEZpZWxkTmFtZQAFcXVlcnkADWZyYWdtZW50SW5kZXgAATAAAQAGc2k6bWQ1ACBjNjEyNzMwZmM4NzA0NjYyMThkZjRiMjc1YWFlYTJiNQEAkhHZULKR+mEiGpm2mSZ1bxtY4fijAPYlcc6HRDPtXHI4LD3lVKVzVTp3vuPsokbTS3xsgFZmKMFnJjNflGk3Oo2kBK2f9vgx7LbFBrgNm/lAz2gIKtLxAcz4LxLLqG0oUVoPQhBxVa4AObhENoxGuXHsHoFcHh12WvxAVg6O9/3/xY+lY1QrEgevfTSUfcG3oeFDvrqtyiTMVANAWXYeh5KnKMNT893phhKAdbhLRZFQtqpQ423NrE6PH/H63WrYadZnbFb9YNClH5p+O5POVmGKwRwjPMceRerXWa9RH5+J18l015U70XYHK2qviNfHfMxWwteZySES1XcyzG1irQIAAAAADAAAACYAAAAAAAAAAAAAAAB6RSG+yMRwbWO8FvxfIxka/////wAAAAEAAAAAAAAAAAAAAAEAAAAlV1luYaYc0SiCIwi9O3fGtYu0w8I50hG9VA3fgLvZfVyXdawm9wnr8aD8qk7gaMbK0bxeFGk="
  ]
}

When I look at Splashes HAR file, the query content is left blank.

Why would Splash not send the query content? Does it stop some JS from being run after loading, even if I interact with the page through a LUA script?

decolon commented 3 years ago

Another update. Big shout out to @phrfpeixoto from Scrapinghub for helping me get this far.

Looks like handling cookies did not fix the issue. Looking into the JS does not reveal many leads (see the unobfuscated js file @phrfpeixoto was able to put together below) and it seems like the query data is being encrypted, so we can not simply mimic the POST request. Even trying through Splash's kernel in interactive mode did not help.

wholefoods.txt

Everything seems to point to this site not being compatible with Splash, which is surprising since it works easily in tools like Selenium.

I have also tried some other sites, and they also do not seem to render JS after the initial page loads. For instance, no matter how long I wait, https://www.ip2location.com/ will not fill in the location values, while it does right away with Selenium

Does Splash stop executing site JS after the initial load?

manueltg89 commented 3 years ago

I have the same doubt as you. My splash example isn`t running ajax calls after the initial load. I think splash has that problem, it is not like a complete headless browser.