sayem314 / hooman

http interceptor to hoomanize cloudflare requests
https://www.npmjs.com/package/hooman
MIT License
148 stars 18 forks source link

How to download an image behind cloudflare? #3

Closed linaspasv closed 4 years ago

linaspasv commented 4 years ago

I am trying the following and get 415 (Unsupported Media Type) error.

const got = require("hooman")

got('https://c.pxhere.com/images/11/49/74e4a31de6abe70227fa1cb22d37-1612083.jpg!d')
    .then(response => {
        console.log(response);
    })
    .catch(error => {
        console.error(error);
    });
sayem314 commented 4 years ago

use responseType as buffer

Docs: https://github.com/sindresorhus/got#responsetype

Example:

const { body}  = await got(url, { responseType: 'buffer' });
sayem314 commented 4 years ago

You can also pipe stream to file.

const { createWriteStream } = require("fs");
const got = require("hooman");

(async () => {
  const image = createWriteStream("image.jpg");
  got
    .stream("https://c.pxhere.com/images/11/49/74e4a31de6abe70227fa1cb22d37-1612083.jpg!d")
    .pipe(image);
})();
linaspasv commented 4 years ago

Hm, still getting 415 (Unsupported Media Type) :/

const got = require("hooman");

let url = "https://c.pxhere.com/images/11/49/74e4a31de6abe70227fa1cb22d37-1612083.jpg!d"

got(url, { responseType: 'buffer' }).then(response => {
    console.log(response.body)
})
sayem314 commented 4 years ago

@linaspasv can you please try stream example I posted later? Also, this library is just a wrapper around got to bypass Cloudflare js-challenge, request related issues are best to first test with got library and open issue over there.

linaspasv commented 4 years ago

With the stream example I get 503 (Service Unavailable) error.

linaspasv commented 4 years ago

The image I am trying to access is under cloudflare anti-ddos protection page. I have tried your script with the regular HTML page and it works perfectly but fails for a direct image download. I am not sure if it's issue with this library or got library.

image

sayem314 commented 4 years ago

I have locally tested it and it works fine for me. In fact, I have gone ahead and added this on the test and it seems to be passing as well. https://github.com/sayem314/hooman/commit/f9aa0e5b048bdff50cb6a0b8de31ebc18ee83110

Edit: Travis > https://travis-ci.org/github/sayem314/hooman/jobs/685101120

linaspasv commented 4 years ago

Do you get challenged by cloudflare? When I run the same code (see below) on an image that has no cloudflare protection it works perfectly. pxhere.com does not give a challenge page for residential IP addresses but when you run this on some servers you get challenged. :-)

const got = require('hooman');
const fs = require('fs');

(async () => {

    //let resource = 'https://c.pxhere.com/images/11/49/74e4a31de6abe70227fa1cb22d37-1612083.jpg!d';
    let resource = 'https://explorecams.com/storage/photos/GdmaI9FIbe_1600.jpg';

    got.stream(resource)
        .on('error', err => console.log(err))
        .pipe(
            fs.createWriteStream('image.jpg')
        )
})();
sayem314 commented 4 years ago

I have tested on residential ip where their main domain did not throw js-challenge and so I have gone ahead and tested on my own Cloudflare challenge activated domain where js-challenges are always thrown.

Test code:

const test = require("tape");
const scrape = require("hooman");
const { writeFileSync, statSync } = require("fs");

const jsChallengePage = "https://cf-js-challenge.sayem.eu.org";

// Test image download
test("sample image download", async t => {
  console.time("image download");
  const { body } = await scrape(jsChallengePage + "/images/background.jpg", {
    responseType: "buffer"
  });
  console.timeEnd("image download");

  // Write to file
  t.ok(Buffer.isBuffer(body));
  writeFileSync("image.jpg", body);

  // Check image size
  const { size } = statSync("image.jpg");
  t.equal(size, 31001);
});

Note that I have removed other tests for fair testing result.

Test result with console log for easy debugging: image

Here is updated test code and results: https://github.com/sayem314/hooman/commit/9776a070fc12b2c69ce6169ceaf14671363c8dde

sayem314 commented 4 years ago

A possible fix for you. I'm not sure what's causing you issue but give this a try:

const got = require('hooman');
const fs = require('fs');

(async () => {
    await got('https://explorecams.com') // init cookie

    let resource = 'https://explorecams.com/storage/photos/GdmaI9FIbe_1600.jpg';
    got.stream(resource)
        .on('error', err => console.log(err))
        .pipe(
            fs.createWriteStream('image.jpg')
        )
})();
linaspasv commented 4 years ago

No luck. Also, I have tried to run the same without hooman (see the source code below) and I end up with the same Response code 503 (Service Temporary Unavailable) error. image

I have also tried to just curl and I get the challenge page code... so it seems your plugin is not triggered to solve the challenge page when I run this particular URL.

image

const got = require('got');
const fs = require('fs');

(async () => {
    let resource = 'https://c.pxhere.com/images/11/49/74e4a31de6abe70227fa1cb22d37-1612083.jpg!d';

    got.stream(resource)
        .on('error', err => console.log(err))
        .pipe(
            fs.createWriteStream('image.jpg')
        )
})();
sayem314 commented 4 years ago

Can you send me the HTML of the challenge page?

linaspasv commented 4 years ago

Okay, so it seems my challenge page ends up in .on('error') and your plugin does not pick it somehow. The challenge page for IMAGE is the same as for a regular HTML page and it works perfectly with your library!

const got = require('hooman');
const fs = require('fs');

(async () => {
    let resource = 'https://c.pxhere.com/images/11/49/74e4a31de6abe70227fa1cb22d37-1612083.jpg!d';

    got.stream(resource)
        .on('error', err => console.log(err.response.body))
        .pipe(
            fs.createWriteStream('image.jpg')
        )
})();

I get the follow output now. cf-challenge.txt

Also, attaching received headers for that page. image

linaspasv commented 4 years ago

It seems this might be the issue why your hook at afterResponse is not being triggered and I am seeing the following results.

image

sayem314 commented 4 years ago

I see .streams() are unsupported unfortunately. But did you try with responseType: 'buffer' as shown in test.js of hooman? Btw your HTML is okay, hooman should be able to solve it without issue.

https://github.com/sayem314/hooman/blob/master/test.js#L41-L48

linaspasv commented 4 years ago

First of all to make your library work with 'buffer' one needs to convert buffer to the string inside the afterResponse hook first.

if (
          // If site is not hosted on cloudflare skip
          response.statusCode === 503 &&
          response.headers.server === "cloudflare" &&
          response.body.includes("jschl-answer")
        ) {
            let body = response.body instanceof Buffer
                ? response.body.toString()
                : response.body

            const data = await solve(response.url, body);

While this part is resolved I still get 415 (Unsupported Media Type) error when this line runs - https://github.com/sayem314/hooman/blob/9776a070fc12b2c69ce6169ceaf14671363c8dde/index.js#L42

sayem314 commented 4 years ago

Convert is not necessary on hooks since it should match only when it's an HTML page. Something must be wrong on your end, I have tested it on multiple datacenter IP and VPN, and for me, it works every time. Something must be wrong on your end :(

As you can see Travis CI tests are passing as well which are done from shared datacenter IP and my domain throws Cloudflare challenge regardless of how clean your IP is with a custom filter.

andress134 commented 4 years ago

// Fixed

sayem314 commented 4 years ago

@andress134 can you be more specific what you are trying to achieve? Btw I guess your question is not related to this issue, for further discussion please open new issue with more details. Also your code was unreadable so I had to edit it a little.

Here is how you use proxy btw as per your code example.

const fs = require('fs'),
  got = require('hooman'),
  path = require('path'),
  HttpsProxyAgent = require('https-proxy-agent');

const target = process.argv[2],
  time = process.argv[3],
  req_per_ip = process.argv[4];

let proxies = fs
  .readFileSync(process.argv[5], 'utf-8')
  .replace(/\r/gi, '')
  .split('\n')
  .filter(Boolean);

function send_req() {
  let proxy = proxies[Math.floor(Math.random() * proxies.length)];
  proxy = new HttpsProxyAgent('http://' + proxy);

  return new Promise((resolve, reject) => {
    got(target, {
      agent: {
        https: proxy,
      },
      cloudflareRetry: 10,
    })
      .then((response) => {
        console.log(response.body);
        resolve(response);
      })
      .catch((error) => {
        let obj_v = proxies.indexOf(proxy);
        proxies.splice(obj_v, 1);
        console.log(error.message);
        return reject(error.message);
      });
  });
}

Proxy docs: https://github.com/sindresorhus/got#proxies Proxy module: https://www.npmjs.com/package/https-proxy-agent

andress134 commented 4 years ago

// fixed

sayem314 commented 4 years ago

@andress134 the mentioned sites works fine on tests.

image

And please don't continue any further discussion about this in this issue, create a new issue and I'm happy to assist you.

Site is returning Cloudflare challenge on me on the browser and I have verified that hooman successfully bypassed it.

sayem314 commented 4 years ago

Closing this issue as I was unable to re-produce. BTW I was also able to get .stream() to work, I will update the instruction on the readme.

linaspasv commented 4 years ago

@sayem314 thank you for your help. Looking forward to try this with .stream(). :-)

sayem314 commented 4 years ago

@linaspasv docs updated for stream https://github.com/sayem314/hooman#pipe-stream