ulixee / hero

The web browser built for scraping
MIT License
652 stars 32 forks source link

[Question] How to intercept HTTP GET responses? #170

Closed massafilippo97 closed 1 year ago

massafilippo97 commented 1 year ago

Hi! while scraping websites I need to also save all the received text-based content inside the local system. In puppeteer, I can do something like this in order to intercept all HTTP responses received by the web browser:

page.on('response', async response => { 
   const status = response.status()
   if (!((status >= 300) && (status <= 399))) { //avoid redirects (Error: Response body is unavailable for redirect responses)
   try{
    const buffer = await response.buffer() 
    if(Buffer.compare(Buffer.from(buffer.toString(),'utf8') , buffer) === 0) {//verify that it doesn't contain a binary file
      var header = response.headers();
      var mime_format = header['content-type'];  
      //console.log(mime_format)
      switch(mime_format){ //match the correct extension format before locally saving the file
        case 'application/javascript':
          mime_format = ".js"
          break
        case 'text/javascript':
          mime_format = ".js"
          break
        case undefined:
          mime_format = ""
          break
        default:
          mime_format = "."+mime_format.split("/")[1].split(";")[0]
          break
      } 
      fse.outputFileSync(resultsPath+"/files/"+(Math.random()+1).toString(36).substring(2)+mime_format, buffer); 
    }
  } 
  catch(e){
    console.log("Resource download failed")
  }
}});

My problem here is that I don't know how to do the same while using Hero. I've already tried to check the documentation, but at this point, I don't know if it's possible or not (and if yes, how should I do it?)

blakebyrnes commented 1 year ago

You have a few options.

  1. All resources are stored in your session database (https://ulixee.org/docs/hero/advanced-concepts/sessions). NOTE: they're stored in the original encoding they came in (eg, they might be compressed)
  2. You can do hero.activeTab.on('resource', resource => ... and access them that way
  3. You can waitForResource/s if you want it to be in your flow
  4. You can findResources if you want to look them up at the end of your script
massafilippo97 commented 1 year ago

The second option you provided is the perfect solution to my problem! I'll paste the adapted piece of code for reference if someone else will also need it in the future:

hero.activeTab.on('resource', async (resource) =>  {  
        if (!resource.isRedirect) { //avoid redirects (Error: Response body is unavailable for redirect responses)
        try{
         const buffer = await resource.buffer 
         if(Buffer.compare(Buffer.from(buffer.toString(),'utf8') , buffer) === 0) {//verify that it doesn't contain a binary file
           var header = resource.response.headers;
           var mime_format = header['content-type'];  
           //console.log(mime_format)
           switch(mime_format){ //match the correct extension format before locally saving the file
             case 'application/javascript':
               mime_format = ".js"
               break
             case 'text/javascript':
               mime_format = ".js"
               break
             case undefined:
               mime_format = ""
               break
             default:
               mime_format = "."+mime_format.split("/")[1].split(";")[0]
               break
           }
           fse.outputFileSync(results_dir+"/files/"+(Math.random()+1).toString(36).substring(2)+mime_format, buffer); 
         }
       } 
       catch(e){
        console.log(e)
         console.log("Resource download failed")
       }
     }
})