Closed davidhq closed 3 years ago
Response {
size: 0,
timeout: 0,
[Symbol(Body internals)]: {
body: Gunzip {
_writeState: [Uint32Array],
_readableState: [ReadableState],
_events: [Object: null prototype],
_eventsCount: 5,
_maxListeners: undefined,
_writableState: [WritableState],
allowHalfOpen: true,
bytesWritten: 0,
_handle: [Zlib],
_outBuffer: <Buffer 00 00 00 00 00 00 00 00 e0 00 00 00 00 00 00 00 00 82 00 ff ea 7f 00 00 44 00 04 00 04 00 04 00 e0 e5 05 ff ea 7f 00 00 20 e0 05 ff ea 7f 00 00 00 00 ... 16334 more bytes>,
_outOffset: 0,
_chunkSize: 16384,
_defaultFlushFlag: 2,
_finishFlushFlag: 2,
_defaultFullFlushFlag: 3,
_info: undefined,
_maxOutputLength: 4294967296,
_level: -1,
_strategy: 0,
[Symbol(kCapture)]: false,
[Symbol(kCallback)]: null,
[Symbol(kError)]: null
},
disturbed: false,
error: null
},
[Symbol(Response internals)]: {
url: 'https://www.amazon.com/gp/product/1732265178/ref=ox_sc_act_image_1',
status: 200,
statusText: 'OK',
headers: Headers { [Symbol(map)]: [Object: null prototype] },
counter: 0
}
}
<!DOCTYPE html>
<!--[if lt IE 7]> <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->
<!--[if IE 7]> <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->
<!--[if IE 8]> <html lang="en-us" class="a-no-js a-lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="a-no-js" lang="en-us"><!--<![endif]--><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title dir="ltr">Amazon.com</title>
<meta name="viewport" content="width=device-width">
<link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">
<script>
if (true === true) {
var ue_t0 = (+ new Date()),
ue_csm = window,
ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
ue_furl = "fls-na.amazon.com",
ue_mid = "ATVPDKIKX0DER",
ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],
ue_sn = "opfcaptcha.amazon.com",
ue_id = 'RAQNHNVQQJA9RJJ72HGH';
}
</script>
</head>
<body>
<!--
To discuss automated access to Amazon data please contact api-services-support@amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
-->
<!--
Correios.DoNotSend
-->
<div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important">
<div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto">
<div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>
<div class="a-box a-alert a-alert-info a-spacing-base">
<div class="a-box-inner">
<i class="a-icon a-icon-alert"></i>
<h4>Enter the characters you see below</h4>
<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>
</div>
</div>
<div class="a-section">
<div class="a-box a-color-offset-background">
<div class="a-box-inner a-padding-extra-large">
<form method="get" action="/errors/validateCaptcha" name="">
<input type=hidden name="amzn" value="4KRs5TchISf0cuUZ5onZdw==" /><input type=hidden name="amzn-r" value="/gp/product/1732265178/ref=ox_sc_act_image_1" />
<div class="a-row a-spacing-large">
<div class="a-box">
<div class="a-box-inner">
<h4>Type the characters you see in this image:</h4>
<div class="a-row a-text-center">
<img src="https://images-na.ssl-images-amazon.com/captcha/bcxmjlko/Captcha_ygofoltlqs.jpg">
</div>
<div class="a-row a-spacing-base">
<div class="a-row">
<div class="a-column a-span6">
</div>
<div class="a-column a-span6 a-span-last a-text-right">
<a onclick="window.location.reload()">Try different image</a>
</div>
</div>
<input autocomplete="off" spellcheck="false" placeholder="Type characters" id="captchacharacters" name="field-keywords" class="a-span12" autocapitalize="off" autocorrect="off" type="text">
</div>
</div>
</div>
</div>
<div class="a-section a-spacing-extra-large">
<div class="a-row">
<span class="a-button a-button-primary a-span12">
<span class="a-button-inner">
<button type="submit" class="a-button-text">Continue shopping</button>
</span>
</span>
</div>
</div>
</form>
</div>
</div>
</div>
</div>
<div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div>
<div class="a-text-center a-spacing-small a-size-mini">
<a href="https://www.amazon.com/gp/help/customer/display.html/ref=footer_cou?ie=UTF8&nodeId=508088">Conditions of Use</a>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<a href="https://www.amazon.com/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=468496">Privacy Policy</a>
</div>
<div class="a-text-center a-size-mini a-color-secondary">
© 1996-2014, Amazon.com, Inc. or its affiliates
<script>
if (true === true) {
document.write('<img src="https://fls-na.amaz'+'on.com/'+'1/oc-csi/1/OP/requestId=RAQNHNVQQJA9RJJ72HGH&js=1" />');
};
</script>
<noscript>
<img src="https://fls-na.amazon.com/1/oc-csi/1/OP/requestId=RAQNHNVQQJA9RJJ72HGH&js=0" />
</noscript>
</div>
</div>
<script>
if (true === true) {
var head = document.getElementsByTagName('head')[0],
prefix = "https://images-na.ssl-images-amazon.com/images/G/01/csminstrumentation/",
elem = document.createElement("script");
elem.src = prefix + "csm-captcha-instrumentation.min.js";
head.appendChild(elem);
elem = document.createElement("script");
elem.src = prefix + "rd-script-6d68177fa6061598e9509dc4b5bdd08d.js";
head.appendChild(elem);
}
</script>
</body></html>
{ title: 'Amazon.com', favicon: 'https://www.amazon.com/favicon.ico' }
Looks like amazon is sometimes serving this (or "Something went wrong etc." instead of the real page ..
Investigating further...
Maybe I need to try with some browser agent ....
<!--
To discuss automated access to Amazon data please contact api-services-support@amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
-->
:)
Seems to work better:
import { unfurl } from './unfurl/dist';
const userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36';
const result = unfurl('https://www.amazon.com/gp/product/1732265178/ref=ox_sc_act_image_1', { userAgent });
result.then(console.log);
Results:
{
description: 'The Art of Doing Science and Engineering: Learning to Learn [Richard W. Hamming, Bret Victor] on Amazon.com. *FREE* shipping on qualifying offers. The Art of Doing Science and Engineering: Learning to Learn',
keywords: [
'Richard W. Hamming',
'Bret Victor',
'The Art of Doing Science and Engineering: Learning to Learn',
'Stripe Press',
'1732265178',
'Science / Technology',
'Science / Technology'
],
title: 'The Art of Doing Science and Engineering: Learning to Learn: Richard W. Hamming, Bret Victor: 9781732265172: Amazon.com: Books',
favicon: 'https://www.amazon.com/favicon.ico'
}
But it's still strange that the actual complete data got downloaded / parsed out on one occassion...
{
"open_graph": {
"url": "https://www.amazon.com/dp/1732265178/ref=tsm_1_fb_lk",
"title": "The Art of Doing Science and Engineering: Learning to Learn",
"description": "The Art of Doing Science and Engineering: Learning to Learn",
"images": [
{
"url": "https://images-na.ssl-images-amazon.com/images/I/21leVtAEhAL._SR600%2c315_PIWhiteStrip%2cBottomLeft%2c0%2c35_PIStarRatingFOURANDHALF%2cBottomLeft%2c360%2c-6_SR600%2c315_ZA520%2c445%2c290%2c400%2c400%2cAmazonEmberBold%2c12%2c4%2c0%2c0%2c5_SCLZZZZZZZ_FMpng_BG255%2c255%2c255.jpg",
"width": 600,
"height": 315
}
],
"type": "book",
"site_name": "Amazon.com"
},
"description": "The Art of Doing Science and Engineering: Learning to Learn [Richard W. Hamming, Bret Victor] on Amazon.com. *FREE* shipping on qualifying offers. The Art of Doing Science and Engineering: Learning to Learn",
"keywords": [
"Richard W. Hamming",
"Bret Victor",
"The Art of Doing Science and Engineering: Learning to Learn",
"Stripe Press",
"1732265178",
"Science / Technology",
"Science / Technology"
],
"title": "The Art of Doing Science and Engineering: Learning to Learn: Richard W. Hamming, Bret Victor: 9781732265172: Amazon.com: Books",
"favicon": "https://www.amazon.com/favicon.ico"
}
I hope amazon is really returning such different data... if not, I'll report further, maybe unfurl fails to parse correct opengraph metadata in some cases...
What do you think about this: https://github.com/davidhq/unfurl/commit/36aeda9dcdb91d163f255806d4f3d599dc1195cf
{
httpStatus: 503,
title: 'Sorry! Something went wrong!',
favicon: 'https://www.amazon.com/favicon.ico'
}
or maybe don't parse html except when status is 200 (return {}
when other than 200) ?
I noticed that Amazon is very unpredictable and returns 503 with "Sorry!" or it can return 200 with
{ title: 'Amazon.com', favicon: 'https://www.amazon.com/favicon.ico' }
so I guess part of the ugrade for this lib would indeed be not to return metadata except on 200
and then client of library can decide further if that is true metadata or not (for example title: 'Amazon.com' is not).. I'd probably decide on the basis if 'open_graph' was present... but this is already another story.
The point is that some kind of update is needed for this to be useful in my case:
Which one would you choose? On the basis of your suggestion I'll update the code in my fork..
if you want this upstream I can send the pull request afterwards. thnx
Interesting. I think we should throw if we get a 4xx or 5xx and add the status code to the error object. If you could make a PR for that it'd be great! 👍
You are more likely to see crawling errors on large sites like amazon, google, wikipedia, etc. A few things you can try to get around this:
Interesting. I think we should throw if we get a 4xx or 5xx and add the status code to the error object. If you could make a PR for that it'd be great! 👍
Is this suitable ? https://github.com/davidhq/unfurl/commit/1e40869e15438b8880b97c828953191a9db3c641
You are more likely to see crawling errors on large sites like amazon, google, wikipedia, etc. A few things you can try to get around this:
- If you are using AWS/GCP/AZURE then your IP will be in an ASN owned by those organisations. An organisation may chose to block IP ranges in cloud provider ASNs to reduce bot/crawling traffic. You can use residential IP proxy to get around this but I'd suggest only retrying failed crawls on the proxy, else it will be cost prohibitive. See https://brightdata.com/
- Bot traffic may also be detected from lack of any user interaction on the page. This is a less common scenario and also harder to workaround. There are probably purpose built tools for this but you could try to script puppeteer to fake some interactions - e.g. mouse clicks and movement.
Thank you for this info. I suspected this might be the case, I was scraping a few amazon links from Digital Ocean and it all failed immediately while scraping from my home IP worked... at first... but looks like they added it to the list since yesterday :) now I get 403, 429 & 503 !
Also from other sites:
{
httpStatus: 403,
title: 'Attention Required! | Cloudflare',
favicon: 'https://internetcomputer.org/favicon.ico'
}
{
httpStatus: 429,
title: 'Too Many Requests',
favicon: 'https://www.reddit.com/favicon.ico'
}
I have around 700 links in my personal testing database and they are all around the place, not just amazon, reddit etc. Looks like cloudflare is protecting a few of the sites and pings to these count as a "single attacker".
Also: even yesterday Amazon was returning sporadic data but when I used Google spider userAgent it immediately was returning complete data... until next day (today).
I wasn't aware of brightdata.com ... might be useful but not sure exactly what to do now. Here is little background about our project, maybe you have some input into thinking.. ZetaSeek is a decentralized search engine with ZetaSeek.com being just one of the nodes. Anyone can run their own node and collect their own links. The idea was for these links to be scraped for basic metadata (using your library already)... so people that have hundreds or thousands of links to store (but not millions) could do so. Zeta nodes connect between themselves with permanent websocket connections (if someone decides 'to follow' another node then this ws connection gets established). Each search request to some node also goes in parallel to all the nodes that node is following.
Now the idea of having perpetual autoscan
in the background for links will fail unless people running their search nodes also get a subscription to brightdata (?). Another idea I had if scanning (scraping) from residential IPs worked is that public node could offload work to one's laptop .. or even to other friendly Zeta nodes in the interconnected cluster...
I wonder if they will unblock my home ip.. if not, then such dynamic scenario where nodes help each other do work while some of them are not permitted to scan would not work.
Quick update:
'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
to 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
Amazon again returns 200
from my local IP ... other services as well...BUT
Amazon does not return the complete data even with 200
, open_graph
and twitter_card
are missing.
In any case there might be a chance that some way of permutating user agents, going slower in scraping and sharing the work inside a network there might be a chance that for thousands of links per node (over time) scraping can mostly succeed.
So for this project it seems that using a 2rd party paid service like brightdata
or smartly working around it are two possible options... still not sure how to more or less reliably fetch the entire metadata for Amazon, this might be a problem... possibly for others as well... too bad they decide not to return it once they figure out that request is legit. I wonder if they added Twitter and Facebook ASN so that social previews can be generated for sharing to these networks ?
I hope this empiric info is somewhat useful ... I'm posting because it's interesting to me and also to collect some input, of course, but don't feel obliged to do it, you already helped a lot with explanation above.
I also hope that this addition to unfurl
(reject unsuccessful http status responses) is good for the project.
Amazon does not return the complete data even with 200, open_graph and twitter_card are missing.
If you can create a minimal reproducible test case i'd be happy to take a look!
Also, we should now be throwing when we see non-200 response codes (see https://github.com/jacktuck/unfurl/pull/78). This is available in unfurl.js@5.3.0
Here it is: https://gist.github.com/davidhq/dc097bf6eeeaefee47443cdf5dde9cfa
but it will probably behave differently from your IP (?)
I saved the .html retrieved: https://uniqpath.com/temp/result_unfurl_amazon_test.html
and it doesn't seem to contain open_graph metadata, so nothing to be done probably..
you can try running the script with the other user agent (mimicking google spider)
const userAgent = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)';
and either get non 200 (error) or the full response at first.
Possibilities according to my tests:
I've not been able to reproduce getting open graph data locally
curl https://www.amazon.co.uk/Life-After-Google-Blockchain-Economy/dp/1621575764 -H "User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" | grep "<meta"
Result:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta name="viewport" content="width=device-width">
https://dashboard.opengraph.io/debug (need an account, but its free) also confirms there is no open graph tags (their api does infer them, though, i think)
But anyhow, i expect you will need to play with the user agent header and also use a proxy (ideally residential ips).
Might be worth giving this a read too https://blog.hartleybrody.com/scrape-amazon/ . I'll see if there's anything more we can do in unfurl library to help with it but I suspect there isn't.
I have no issues when user agent is "facebookexternalhit" (which is what unfurl.js will default to using)
curl https://www.amazon.co.uk/Life-After-Google-Blockchain-Economy/dp/1621575764 -H "User-Agent: facebookexternalhit/1.1" | grep "<meta"
<meta property="og:url" content="https://www.amazon.co.uk/dp/1621575764/ref=tsm_1_fb_lk" xmlns:og="http://opengraphprotocol.org/schema/" />
<meta property="og:title" content="Life After Google: The Fall of Big Data and the Rise of the Blockchain Economy" xmlns:og="http://opengraphprotocol.org/schema/" />
<meta property="og:description" content="Life After Google: The Fall of Big Data and the Rise of the Blockchain Economy" xmlns:og="http://opengraphprotocol.org/schema/" />
<meta property="og:image" content="https://images-eu.ssl-images-amazon.com/images/I/51Xy6y7I-JL._SR600%2c315_PIWhiteStrip%2cBottomLeft%2c0%2c35_PIStarRatingFOURANDHALF%2cBottomLeft%2c360%2c-6_SR600%2c315_ZA1%252C746%2c445%2c290%2c400%2c400%2cAmazonEmberBold%2c12%2c4%2c0%2c0%2c5_SCLZZZZZZZ_FMpng_BG255%2c255%2c255.jpg" xmlns:og="http://opengraphprotocol.org/schema/" />
<meta property="og:type" content="book" xmlns:og="http://opengraphprotocol.org/schema/" />
<meta property="og:site_name" content="Amazon.co.uk" xmlns:og="http://opengraphprotocol.org/schema/" />
<meta property="fb:app_id" content="465632727431967" xmlns:fb="http://www.facebook.com/2008/fbml" />
<meta property="og:image:width" content="600" xmlns:og="http://opengraphprotocol.org/schema/" />
<meta property="og:image:height" content="315" xmlns:og="http://opengraphprotocol.org/schema/" />
... and so on
❯ npx ts-node -e "require('./src/index.ts').unfurl('https://www.amazon.co.uk/Life-After-Google-Blockchain-Economy/dp/1621575764').then(console.log, console.log)"
{ open_graph:
{ url: 'https://www.amazon.co.uk/dp/1621575764/ref=tsm_1_fb_lk',
title:
'Life After Google: The Fall of Big Data and the Rise of the Blockchain Economy',
description:
'Life After Google: The Fall of Big Data and the Rise of the Blockchain Economy',
images: [ [Object] ],
type: 'book',
site_name: 'Amazon.co.uk' },
So I think once you have a proxy in place everything should work.
Wow ok, so this is also new info. I think I started messing with userAgent
because of problems doing this from Digital Ocean.. and then I mixed up everything, now it's clear what was happening.. So I think userAgent is best left as it was and I'll focus on residential ip and then ship the metadata to digital ocean server instance dynamically. Later play with proxies if needed.
Amazon did indeed sporadically return open_graph results on custom user agents but it looks unpredictable.
Yes, I believe nothing more can / should be done inside unfurl... I'll play further and just hope residential IPs don't casually get blocked with very non-intensive scanning.
What I do still see problematic is this:
suppose I have a "decentralized social network", maybe something like Mastodon and I have a publish box like on twitter, when I paste something in there, I won't be able to get a social card because reddit, amazon and others block any requests from Digital Ocean ASN... so is this expected and normal? So even in this case there needs to be a proxy of some sort just to get a preview a few times per day or even week.
suppose I have a "decentralized social network", maybe something like Mastodon and I have a publish box like on twitter, when I paste something in there, I won't be able to get a social card because reddit, amazon and others block any requests from Digital Ocean ASN... so is this expected and normal?
Yes that is expected. FWIW Slack also doesn't do unfurling for Amazon links as far as i can tell.
So even in this case there needs to be a proxy of some sort just to get a preview a few times per day or even week.
Yes if you want to support unfurling sites like amazon, reddit, etc. I'd recommend approaching this from a progressive enhancement perspective instead though and not worry about sites that are restricting crawling.
👌
Hello again!
unfurl.js
returnedat one occassion when scraping https://www.amazon.com/gp/product/1732265178/ref=ox_sc_act_image_1
now it works better and returns correct metadate
My question is: did all of that come from Amazon? I don't think that title is something that is coming from unfurl or any associated libraries... so if that's the case, nothing can be done except to retry, right?
The problem is that I cannot know which data to retry (automatically)... is there a good way to detect such cases based on something else, perhaps possible HTTP code coming from the server ?
I'd like to check for that and not save any metadata like this example above to the database.
thank you
UPDATE:
Amazon returns 503 with HTML which contains above title "Sorry! ..."