Closed fappelman closed 5 years ago
@fappelman I wasn't able to reproduce this — it parsed as expected when I tried it. Is there any other information you can give me that might be able to narrow it down? Or... is anyone else able to reproduce this?
I tried this out, both with the yarn global executable and a locally built version. Worked with both:
$ mercury-parser https://tweakers.net/nieuws/149154/nederlandse-supercomputerbouwer-clustervision-is-failliet.html
{
"title": "Nederlandse supercomputerbouwer ClusterVision is failliet",
"author": null,
"date_published": null,
"dek": null,
"lead_image_url": "https://tweakers.net/i/7XYVzx_PYqxvpvf-IPUZKSIu-lY=/fit-in/67x67/i/1340023128.png?f=fpa",
"content": "<div class=\"article largeWidth\"><p class=\"lead\">De Nederlandse supercomputerbouwer ClusterVision is door de rechter failliet verklaard. Het bedrijf heeft meer dan twintig keer een notering gehaald in de Top500 van 's werelds krachtigste computers.</p><p>In het insolventieregister is te zien dat de rechter zowel de <a href=\"https://insolventies.rechtspraak.nl/#!/details/13.ams.19.55.F.1300.1.19\">besloten vennootschap ClusterVision</a> als de <a href=\"https://insolventies.rechtspraak.nl/#!/details/13.ams.19.54.F.1300.1.19\">gelijknamige holding</a> op 12 februari failliet heeft verklaard. De rechter heeft ook het faillissement uitgesproken over het op hetzelfde adres gevestigde bedrijf <a href=\"https://insolventies.rechtspraak.nl/#!/details/13.ams.19.56.F.1300.1.19\">MPCI Group</a>, een groothandel in computers, randapparatuur en software. De door de rechtbank aangestelde curator was onbereikbaar voor commentaar.</p><p>ClusterVision begon in 2002 met het bouwen van supercomputers, <a href=\"https://www.sprout.nl/lijsten/challenger-50-van-2012/clustervision\">schrijft Sprout</a>. Het bedrijf koppelde pc's of servers aan elkaar om de supercomputers te bouwen. Het hoofdkantoor <a href=\"https://www.clustervision.com/about-the-company/\">van ClusterVision</a> staat in Amsterdam en het bedrijf heeft kantoren in het Verenigd Koninkrijk, Frankrijk en Duitsland. Volgens het bedrijf hebben hun supercomputers ook in de top twintig gestaan.</p><p>Ongeveer acht jaar geleden had het bedrijf zeventig werknemers en een omzet van tussen de tien en vijftig miljoen euro. Deze cijfers zijn wel inclusief zusterbedrijf Bright Computing, gespecialiseerd in software voor de supercomputers. Bright Computing is voor zover bekend niet failliet verklaard. Volgens <a href=\"https://nl.linkedin.com/in/alex-ninaber-20b9331\">de LinkedIn-pagina van CEO en medeoprichter Alex Ninaber</a> heeft het bedrijf vierhonderd installaties gebouwd bij tweehonderd klanten.</p><p>In november <a href=\"https://www.clustervision.com/nsc-linkoping-top500/\">kondigde het bedrijf aan</a> te beginnen aan de volgende fase van Tetralith, een supercomputer van de Zweedse Universiteit van Linköping. Na de volledige installatie had het de krachtigste supercomputer van de Noordse landen moeten worden, met een theoretische maximum kracht van vier petaflops. De bouw had in januari moeten beginnen. Het is onduidelijk wat het faillissement voor gevolgen heeft voor de bouw van de supercomputer van de universiteit.</p></div>",
"next_page_url": null,
"url": "https://tweakers.net/nieuws/149154/nederlandse-supercomputerbouwer-clustervision-is-failliet.html",
"domain": "tweakers.net",
"excerpt": "De Nederlandse supercomputerbouwer ClusterVision is door de rechter failliet verklaard. Het bedrijf heeft meer dan twintig keer een notering gehaald in de Top500 van 's werelds krachtigste computers.",
"word_count": 267,
"direction": "ltr",
"total_pages": 1,
"rendered_pages": 1
}
Thanks for coming back to me and this is weird. I have created a small Docker example that shows the problem.
$ tree .
.
├── docker-compose.yml
└── mercury-test
└── Dockerfile
The the docker-compose file has the following content:
$ cat docker-compose.yml
version: '3'
services:
mercury-test:
build: mercury-test
container_name: mercury-test
hostname: mercury-test
And the docker file has the following content:
$ cat mercury-test/Dockerfile
FROM node:8
RUN yarn add @postlight/mercury-parser
CMD [ "/node_modules/.bin/mercury-parser", \
"https://tweakers.net/nieuws/149154/nederlandse-supercomputerbouwer-clustervision-is-failliet.html" ]
And then if I run it:
$ docker-compose stop mercury-test && docker-compose rm -f mercury-test && docker-compose build && docker-compose up
Going to remove mercury-test
Removing mercury-test ... done
Building mercury-test
Step 1/3 : FROM node:8
---> 4f01e5319662
Step 2/3 : RUN yarn add @postlight/mercury-parser
---> Using cache
---> c4da85118659
Step 3/3 : CMD [ "/node_modules/.bin/mercury-parser", "https://tweakers.net/nieuws/149154/nederlandse-supercomputerbouwer-clustervision-is-failliet.html" ]
---> Using cache
---> 6b612b77e3c6
Successfully built 6b612b77e3c6
Successfully tagged mercury-test_mercury-test:latest
Creating mercury-test ... done
Attaching to mercury-test
mercury-test |
mercury-test | Mercury Parser encountered a problem trying to parse that resource.
mercury-test |
mercury-test | TypeError: $ is not a function
mercury-test | at /node_modules/@postlight/mercury-parser/dist/mercury.js:6079:12
mercury-test | at Array.find (native)
mercury-test | at detectByHtml (/node_modules/@postlight/mercury-parser/dist/mercury.js:6078:46)
mercury-test | at getExtractor (/node_modules/@postlight/mercury-parser/dist/mercury.js:6090:60)
mercury-test | at Object._callee$ (/node_modules/@postlight/mercury-parser/dist/mercury.js:6458:27)
mercury-test | at tryCatch (/node_modules/regenerator-runtime/runtime.js:62:40)
mercury-test | at Generator.invoke [as _invoke] (/node_modules/regenerator-runtime/runtime.js:288:22)
mercury-test | at Generator.prototype.(anonymous function) [as next] (/node_modules/regenerator-runtime/runtime.js:114:21)
mercury-test | at asyncGeneratorStep (/node_modules/@babel/runtime-corejs2/helpers/asyncToGenerator.js:5:24)
mercury-test | at _next (/node_modules/@babel/runtime-corejs2/helpers/asyncToGenerator.js:27:9)
mercury-test |
mercury-test | If you believe this was an error, please file an issue at:
mercury-test |
mercury-test | https://github.com/postlight/mercury-parser/issues/new
mercury-test |
mercury-test exited with code 1
which confirms the problem. I have attached a ZIP file with the above source code.
@adampash I've started parsing a lot of content with self-hosted mercury and am now seeing this error frequently. Here's one that triggers it in production and locally for me:
mercury-parser https://glo.bo/2Gv3o3I
@fappelman I see the same error, it seems to be because the server returns HTTP 202 instead of HTTP 200. I tried a quick and dirty hack to make validateResponse accept 202 as well, it fixed the error, but now the content is just a message in Dutch about cookies.
@benubois Huh, that link isn't triggering it for me. Are you or @fappelman seeing the same thing that @prgm767 found here? Any chance your error is also cropping up due to a non-200 response?
Fwiw, @fappelman, I did try the docker setup you shared above and somehow it's working fine for me.
Creating network "test-merc_default" with the default driver
Creating mercury-test ... done
Attaching to mercury-test
mercury-test | {
mercury-test | "title": "Nederlandse supercomputerbouwer ClusterVision is failliet",
mercury-test | "author": null,
mercury-test | "date_published": null,
mercury-test | "dek": null,
mercury-test | "lead_image_url": "https://tweakers.net/i/7XYVzx_PYqxvpvf-IPUZKSIu-lY=/fit-in/67x67/i/1340023128.png?f=fpa",
mercury-test | "content": "<div class=\"article largeWidth\"><p class=\"lead\">De Nederlandse supercomputerbouwer ClusterVision is door de rechter failliet verklaard. Het bedrijf heeft meer dan twintig keer een notering gehaald in de Top500 van 's werelds krachtigste computers.</p><p>In het insolventieregister is te zien dat de rechter zowel de <a href=\"https://insolventies.rechtspraak.nl/#!/details/13.ams.19.55.F.1300.1.19\">besloten vennootschap ClusterVision</a> als de <a href=\"https://insolventies.rechtspraak.nl/#!/details/13.ams.19.54.F.1300.1.19\">gelijknamige holding</a> op 12 februari failliet heeft verklaard. De rechter heeft ook het faillissement uitgesproken over het op hetzelfde adres gevestigde bedrijf <a href=\"https://insolventies.rechtspraak.nl/#!/details/13.ams.19.56.F.1300.1.19\">MPCI Group</a>, een groothandel in computers, randapparatuur en software. De door de rechtbank aangestelde curator was onbereikbaar voor commentaar.</p><p>ClusterVision begon in 2002 met het bouwen van supercomputers, <a href=\"https://www.sprout.nl/lijsten/challenger-50-van-2012/clustervision\">schrijft Sprout</a>. Het bedrijf koppelde pc's of servers aan elkaar om de supercomputers te bouwen. Het hoofdkantoor <a href=\"https://www.clustervision.com/about-the-company/\">van ClusterVision</a> staat in Amsterdam en het bedrijf heeft kantoren in het Verenigd Koninkrijk, Frankrijk en Duitsland. Volgens het bedrijf hebben hun supercomputers ook in de top twintig gestaan.</p><p>Ongeveer acht jaar geleden had het bedrijf zeventig werknemers en een omzet van tussen de tien en vijftig miljoen euro. Deze cijfers zijn wel inclusief zusterbedrijf Bright Computing, gespecialiseerd in software voor de supercomputers. Bright Computing is voor zover bekend niet failliet verklaard. Volgens <a href=\"https://nl.linkedin.com/in/alex-ninaber-20b9331\">de LinkedIn-pagina van CEO en medeoprichter Alex Ninaber</a> heeft het bedrijf vierhonderd installaties gebouwd bij tweehonderd klanten.</p><p>In november <a href=\"https://www.clustervision.com/nsc-linkoping-top500/\">kondigde het bedrijf aan</a> te beginnen aan de volgende fase van Tetralith, een supercomputer van de Zweedse Universiteit van Linköping. Na de volledige installatie had het de krachtigste supercomputer van de Noordse landen moeten worden, met een theoretische maximum kracht van vier petaflops. De bouw had in januari moeten beginnen. Het is onduidelijk wat het faillissement voor gevolgen heeft voor de bouw van de supercomputer van de universiteit.</p></div>",
mercury-test | "next_page_url": null,
mercury-test | "url": "https://tweakers.net/nieuws/149154/nederlandse-supercomputerbouwer-clustervision-is-failliet.html",
mercury-test | "domain": "tweakers.net",
mercury-test | "excerpt": "De Nederlandse supercomputerbouwer ClusterVision is door de rechter failliet verklaard. Het bedrijf heeft meer dan twintig keer een notering gehaald in de Top500 van 's werelds krachtigste computers.",
mercury-test | "word_count": 267,
mercury-test | "direction": "ltr",
mercury-test | "total_pages": 1,
mercury-test | "rendered_pages": 1
mercury-test | }
mercury-test exited with code 0
Whatever is causing this, it'd be nice to narrow it down and if nothing else handle the error more gracefully.
If I check the URL with curl I get a normal 200 and not 202. It is really weird that the docker setup didn't work out. Maybe somehow the docker host makes the difference here? I am puzzled about that one.
I'm seeing inconsistent results. curl always says 200. wget sometimes says 200 and sometimes 202. Maybe there are multiple servers behind a load-balancer.
Just to be certain I just run wget 10 times and got 200 in all cases.
@adampash,
Looks like the error does come and go. The example I gave works for me now too. I'll add some logging to production to see if there's a pattern with the response code.
Yes, indeed it comes and goes on that webserver...
I can however reproduce it reliably by deliberately returning a 202:
#!/usr/bin/env perl
use v5.24;
use strict;
use warnings;
use utf8;
use Dancer;
use autodie;
get '/fail' => sub {
status 'Accepted';
return <<EOF;
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>Title</title>
</head>
<body>
<h1>Title</h1>
body
<address>address</address>
</body></html>
EOF
};
get '/ok' => sub {
status 'ok';
return <<EOF;
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>Title</title>
</head>
<body>
<h1>Title</h1>
body
<address>address</address>
</body></html>
EOF
};
dance;
In my case it does not seem a 202 problem. But maybe if that problem is fixed it will fix it in my case as well. I will also try to run this on a different docker host and see if that somehow changes the problem for me.
Looks like this is related to non-200 status codes. In production, every time I see the error, it's preceded by a 4xx
or 5xx
status code. A 202
status code might include valid data? If so it could be added here.
202 ACCEPTED The request has been accepted for processing, but the processing has not been completed. The request might or might not eventually be acted upon, as it might be disallowed when processing actually takes place.
There's nothing mercury can do about 4xx-5xx errors, but there is something wrong with the error handling.
I'm not familiar with how babel works, but I'm guessing the try/catch
block is being mangled in some way so it doesn't actually return the error.
I have a failing test case here that I'll look into a bit more.
This would also explain why the error is intermittent/machine dependent. I see a lot of 403 errors pop in and out depending on where the request is coming from. For example my home machine might get a 200
but a Digital Ocean droplet with a flagged IP might get a 403
.
Thanks @benubois, really appreciate you looking into this.
Turns out it was way simpler than that. Mercury was proceeding even if the http request resulted in an error. The function needed to return earlier.
It appears that #285 has been merged in postlight:master 7 hours ago. I just rebuild my Docker image by cloning the repository:
RUN git clone https://github.com/postlight/mercury-parser.git
RUN yarn cache clean
RUN yarn add file:./mercury-parser
That should have given me that updated version, shouldn't it?
The build still ends in the same error. Based on a visual inspection of the stack trace I would say they didn't change.
I don’t think a new release has been cut though.
Maybe try running yarn build
after cloning?
If the error still happens after this, what you’re seeing is probably unrelated to what I found.
The last statement yarn add file:./mercury-parser
does actually a build. So I believe this is indeed not the same problem. Just the same error message.
@fappelman you would still need to create a build locally. Because a new release has not been made yet, the committed distributable doesn't contain this fix.
I can confirm that building it using the steps:
RUN yarn cache clean
RUN git clone https://github.com/postlight/mercury-parser.git
RUN cd mercury-parser && yarn
RUN cd mercury-parser && yarn install
RUN cd mercury-parser && yarn build
RUN cd mercury-parser && yarn add file:.
Does indeed change the error message:
mercury-test | {
mercury-test | "error": true,
mercury-test | "messages": "The url parameter passed does not look like a valid URL. Please check your data and try again.",
mercury-test | "failed": true
mercury-test | }
mercury-test exited with code 0
The URL that I use https://tweakers.net/nieuws/149154/nederlandse-supercomputerbouwer-clustervision-is-failliet.html is valid as far as I can tell. Curl has no issue with it.
@fappelman are you consistently getting this error, with no successful parse attempts? I wasn't able to reproduce this issue on the URL that you provided.
I have run it about 10 times and they all failed consistently.
I checked the response code using Curl which is 200.
Should I reopen a ticket? This ticket is now closed but not resolved.
@fappelman is it just pages on this site? Are there any other URLs that are consistently failing and ones that are consistently successfully parsed? I'm not able to reproduce this
I just run a test which included 1111 URL's. Of those 178 failed and 175 of those failed ones were indeed on the same website. So somehow the website seems to be the issue.
The other 3 failing URL's were all ending in a 404.
I have not seen a single succeeding URL from that domain. I manually tested a few using curl and they all returned with a 200 and data.
Here are a few other failing URL's from that domain
https://tweakers.net/nieuws/150158/htc-belooft-android-9-updates-voor-zijn-u11-u11+-en-u12+-smartphones.html
https://tweakers.net/nieuws/150156/kpn-krijgt-last-onder-dwangsom-voor-onterechte-toeslagen-bij-gespreksafgifte.html
https://tweakers.net/nieuws/150162/vn-expert-uit-forse-kritiek-op-nieuwe-eu-auteursrechtrichtlijn-en-uploadfilters.html
https://tweakers.net/nieuws/150160/belgische-werkgeversorganisatie-vbo-dient-klacht-in-tegen-privacyregels.html
https://tweakers.net/reviews/6926/straf-voor-te-langzame-updates-interview-met-google-over-android-one.html
https://tweakers.net/nieuws/150166/lek-in-zwitsers-stemsysteem-maakt-heimelijk-wijzigen-van-stemmen-mogelijk.html
https://tweakers.net/nieuws/150170/xiaomi-komt-officieel-naar-de-benelux.html
https://tweakers.net/geek/150164/het-wereldwijde-web-is-dertig-jaar-geleden-bedacht.html
https://tweakers.net/geek/150168/bugatti-onthult-3d-geprinte-elektrische-versie-van-klassieke-raceauto-uit-1924.html
https://tweakers.net/nieuws/150178/imec-ontwikkelt-3d-nanokippengaas-voor-efficientere-accus.html
This webserver is sending different responses to the same request, depending on some unknown factor, perhaps cookie, user-agent or just random based on which backend the load-balancer picked.
How about changing mercury-parser to return the actual error it got from the server?
This would make it possible to debug what the problem is.
Preferably return status code, status message, response-body, headers. Ideally also return the exact request that was used.
I think this would be a great idea in general. However looking at the error message "The url parameter passed does not look like a valid URL." would a message like this be the result of website response? I would have expected that this would be generated by the package and not the result of a client response.
Hi guys I face the same error for node version of the parser. I think this issue is not fixed for sure. Some urls to help from my log: TypeError: $ is not a function - http://jp.wazap.com/find/forumMessage.wz?id=18199155 TypeError: $ is not a function - https://www.dealgott.de/2019/24-zoll-full-hd-led-monitor-dell-s2419h-fuer-12990euro-vergleich-14871euro/ TypeError: $ is not a function - https://www.dealgott.de/2019/weinvorteil-10-extra-rabatt-auf-mehr-als-500-weine/ TypeError: $ is not a function - https://www.dealgott.de/2019/playmobil-action-polar-ranger-hauptquartier-9055-fuer-2194euro-vergleich-2995euro/ TypeError: $ is not a function - https://www.dealgott.de/2019/hot-wheels-fabrik-rennbahn-fuer-2094euro-vergleich-2994euro/
In truth thousands of urls failed with same error. I think expected response will be 4xx and some error message why it can't be parsed.
Thanks for the thorough tests @fappelman ! I agree with @prgm767 that this is in fact a web server issue, and I feel like the country and ISP are possible factors here. The user agent on its own is not a sufficient factor since all Mercury parse requests are sent with the same user agent string.
I'm working on a change for returning more specific error responses, which would in fact help in debugging failed parse attempts in the future.
@fappelman , the error message "The url parameter passed does not look like a valid URL." is in fact being returned for failed parse attempts as well (not only invalid input URLs, such as "foo.x"), which is also the case for broken links, invalid HTTP status codes and other server errors: https://github.com/postlight/mercury-parser/blob/master/src/resource/utils/fetch-resource.js#L122 , so that's what I'll be changing.
@yuri-karkh the mentioned fix for the TypeError: $ is not a function
issue has not been released yet, though it has been merged into master. Are you using the current release of the package ( v2.0.0 https://github.com/postlight/mercury-parser/releases/tag/v2.0.0 ) or did you clone the repository and create your own build locally?
I am running my tests using a clone from master. I just retested one of the problem URL's and I don't see anything different. The response is identical. Should I have seen something different?
@fappelman the change that I had mentioned in my comment to you is not done yet and hasn't been merged. I will post an update here once that is done.
My last comment, to yuri-karkh, was related to the first fix that has already been merged, which handles the first type of error TypeError: $ is not a function
that was being returned (a reference to my previous comment:
https://github.com/postlight/mercury-parser/issues/279#issuecomment-469162788 ), since, technically, that specific error isn't supposed to be returned anymore.
Appreciated.
I have v2.0. @toufic-m , just checked.
"dependencies": { "@postlight/mercury-parser": "^2.0.0", "babel-preset-env": "^1.7.0", "babel-register": "^6.26.0", "request": "^2.88.0" }
The change for returning more specific errors on failed parse attempts has been merged into master and will be included in the next release.
Thanks. I just rerun my test against master which indeed returns a much brighter error message:
mercury-test | {
mercury-test | "error": true,
mercury-test | "message": "Resource returned a response status code of 202 and resource was instructed to reject non-2xx level status codes.",
mercury-test | "failed": true
mercury-test | }
mercury-test exited with code 0
Which does lead to the question. Shouldn't that have passed? If not is there a way to instruct the parser to accept 202?
Should I open a new ticket for this follow up question? The return code is 202 and then the message is that the resource was instructed to reject non-2xx. So given that 202 is not a non-2xx I think it should not have been rejected? Is that correct?
@fappelman I just updated the error message to indicate that the "resource was instructed to reject non-200 status codes" (#342).
The 202 status code is normally used on async
operations that are typically requested using POST
or DELETE
, and is used to indicate that the processing for your request is not done/fulfilled. The body for 202 responses is expected to contain:
the request's current status and point to (or embed) a status monitor that can provide the user with an estimate of when the request will be fulfilled
as opposed to the actual content.
In the cases where a 202 status is returned on GET
requests, it would be that the server perhaps needs some time to finish processing, and has a limited time to respond to your request. This could be a result of using a load balancer for example.
Thanks. Appreciated. I think this closes the topics for me.
Just want to point out that the non public version of mercury - the one that will soon go out of production in fact does work with the example URL. So apparently we are on a different branch.
Platform:
Darwin Mac-Pro.local 18.2.0 Darwin Kernel Version 18.2.0: Thu Dec 20 20:46:53 PST 2018; root:xnu-4903.241.1~1/RELEASE_X86_64 x86_64
Linux rss 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64 x86_64 x86_64 GNU/Linux
Mercury Parser Version: Latest. Installed with
yarn global add @postlight/mercury-parser
Node Version (if a Node bug): v11.10.0 (I don't think this is a node bug)
Also shows in serverless
Browser Version (if a browser bug): n.a.
Expected Behavior
A parsed page and not an error
Current Behavior
Steps to Reproduce
Also in API
Not certain if this is helpful or not. This problem also happens when using the API variant of mercury. As far as I know pages from this site DID work for the online version.