laurengarcia / url-metadata

NPM module: Request a url and scrape the metadata from its HTML using Node.js or the browser.
https://www.npmjs.com/package/url-metadata
MIT License
166 stars 43 forks source link

`npm run test` fails on checked out master #77

Closed ajmas closed 7 months ago

ajmas commented 7 months ago

I ran into issue while testing my code (trying to make a commit for issue #76) and noticed it seems to be a problem on master, at the current head (d6aa7ba110ddfe4e20464ecae9baa03f35a756c8).

Running npm run test I get the following output:

» npm run test                                                                                     ajmas@ghostwalker-echo

> url-metadata@3.5.2 test
> jest --testPathIgnorePatterns=/test-debug/ && standard

 PASS  test/robots.test.js
 PASS  test/fail.test.js
 PASS  test/citations.test.js
 PASS  test/og.test.js
 FAIL  test/basic.test.js
  ● favicons

    expect(received).toBe(expected) // Object.is equality

    Expected: undefined
    Received: [Error: response code 403]

      59 |     expect(metadata.favicons[4].color).toBe('#000000')
      60 |   } catch (err) {
    > 61 |     expect(err).toBe(undefined)
         |                 ^
      62 |   }
      63 | })
      64 |

      at Object.toBe (test/basic.test.js:61:17)

 FAIL  test/options.test.js
  ● option: `ensureSecureImageRequest` edge cases

    expect(received).toBe(expected) // Object.is equality

    Expected: undefined
    Received: [Error: response code 403]

      41 |     })
      42 |   } catch (err) {
    > 43 |     expect(err).toBe(undefined)
         |                 ^
      44 |   }
      45 | })
      46 |

      at Object.toBe (test/options.test.js:43:17)

 PASS  test/json-ld.test.js
 PASS  test/decode.test.js

Test Suites: 2 failed, 6 passed, 8 total
Tests:       2 failed, 21 passed, 23 total
Snapshots:   0 total
Time:        2.98 s, estimated 5 s
Ran all test suites.

I looked into this and noticed while the code is showing a 403 response, during the tests, the page works fine when testing in Chrome. I am wondering whether it is down to a header it is expecting or something else?

Environment:

ajmas commented 7 months ago

I'm really not sure how to change the behaviour, since I am running both node.js and Chrome on the same machine. Some sites suggest this may be an anti-bot mechanism, but I have tried passing in the same headers that Safari is sending (based on www.whatismybrowser.com):

const userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.3.1 Safari/605.1.15'
const requestHeaders = {
  'User-Agent': userAgent,
  Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en-GB,en;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  Connection: 'keep-alive',
  'Upgrade-Insecure-Requests': 1,
  'Sec-Fetch-Dest': 'document',
  'Sec-Fetch-Mode': 'navigate',
  'Sec-Fetch-Site': 'none',
  'Cache-Control': 'max-age=0',
  'Access-Control-Allow-Origin': '*',
  'Access-Control-Allow-Methods': '"GET, POST, PUT, DELETE, OPTIONS"'
}
test('no error when favicons missing from page', async () => {
  const url = 'https://www.crypto51.app/'
  try {
    const metadata = await urlMetadata(url, { requestHeaders })
    expect(metadata.favicons.length).toBe(0)
  } catch (err) {
    expect(err).toBe(undefined)
  }
})

test('favicons', async () => {
  const url = 'https://www.bbc.com/news/uk-england-somerset-68179350'
  try {
    const metadata = await urlMetadata(url, { requestHeaders })
    console.log(metadata);
    expect(metadata.favicons.length).toBe(5)
    expect(metadata.favicons[0].rel).toBe('apple-touch-icon')
    // Safari pinned tab 'mask-icons' can have 'color' attribute:
    expect(metadata.favicons[4].rel).toBe('mask-icon')
    expect(metadata.favicons[4].color).toBe('#000000')
  } catch (err) {
    expect(err).toBe(undefined)
  }
})

BTW in the meantime I skipped the Husky tests (git commit index.d.ts -n -m "...") for my PR in the other issue, since the error is likely just local?

laurengarcia commented 7 months ago

Yeah this is probably bc the test urls think you're botting. It works for me locally, but i have noticed intermittent issues as the test suite has gotten larger.

One thing i can do is consolidate the tests a bit so its not querying the same url multiple times, i noticed they got flaky for me too once the test suite got bigger, but i never trigger bot detection (403/404 errors) when running from local cli.

laurengarcia commented 7 months ago

I can't seem to reproduce the errors you're seeing, it seems like the problem is only with the news sites linked for the tests that think you're botting. Your ip address might have been jailed if you were running the puppeteer scripts you were showing earlier on those urls. But i have no idea really. If you have more details on reproducing that you think could help feel free to reopen.

ajmas commented 7 months ago

I wasn’t touching those URLs beyond the test suit, and I have seen similar issues with some sites that work locally, but not in our kubernetes cluster.

But given the nature of the issue, closing this ticket is fair.

laurengarcia commented 7 months ago

My top guess would be #1 on this list: https://dev.to/princepeterhansen/7-ways-to-avoid-getting-blocked-or-blacklisted-when-web-scraping-45ii