postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.4k stars 442 forks source link

mercury times out trying to access www.nasdaq.com (likely user-agent-ish related) #600

Open thoraxe opened 3 years ago

thoraxe commented 3 years ago

~/node_modules/.bin/mercury-parser https://www.nasdaq.com/articles/voyager-digital-reports-75-revenue-rise-in-q4-cites-increased-crypto-adoption-2021-01-05

Expected Behavior

should extract the webpage

Current Behavior

Mercury Parser encountered a problem trying to parse that resource.

Error: ESOCKETTIMEDOUT
    at ClientRequest.<anonymous> (/home/thoraxe/node_modules/postman-request/request.js:1094:19)
    at Object.onceWrapper (events.js:421:28)
    at ClientRequest.emit (events.js:315:20)
    at TLSSocket.emitRequestTimeout (_http_client.js:784:9)
    at Object.onceWrapper (events.js:421:28)
    at TLSSocket.emit (events.js:327:22)
    at TLSSocket.Socket._onTimeout (net.js:483:8)
    at listOnTimeout (internal/timers.js:554:17)
    at processTimers (internal/timers.js:497:7) {
  code: 'ESOCKETTIMEDOUT',
  connect: false
}

Steps to Reproduce

  1. ~/node_modules/.bin/mercury-parser https://www.nasdaq.com/articles/voyager-digital-reports-75-revenue-rise-in-q4-cites-increased-crypto-adoption-2021-01-05

Detailed Description

it appears that nasdaq may somehow be identifying us as a "non browser" and prohibiting the page from being loaded. curl gets an "Access denied'. Lynx never loads the page. The link definitely works.

Possible Solution

Not sure what the nasdaq server is doing to "identify" that we're not a real browser, but it's definitely not working. I also tried with spoofing a user agent:

~/node_modules/.bin/mercury-parser --header.User-Agent="Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1" https://www.nasdaq.com/articles/voyager-digital-reports-75-revenue-rise-in-q4-cites-increased-crypto-adoption-2021-01-05

I got the same timeout.