google / physical-web

The Physical Web: walk up and use anything
http://physical-web.org
Apache License 2.0
6k stars 665 forks source link

Physical web and server requests #871

Open ajitam opened 7 years ago

ajitam commented 7 years ago

Hello,

we've set up a small project here in the city using physical web.

Setup is:

So request are something like this

{google ip} - - [{date}] "GET /123/some-message HTTP/1.1" 200 339 "-" "Mozilla/5.0 (Google-PhysicalWeb)"
{user ip} - - [{date}] "GET /123/some-message HTTP/1.1" 200 500 "-" "Mozilla/5.0 (Linux; Android 6.0.1; SM-G935F ..."

After a day of launch we notice something strange - all the messages "received" about 4 hits in the time span of ~ 1 hour. We checked the log and this were the requests:

{google ip} - - [{date}] "GET /123/some-message HTTP/1.1" 200 500 "http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0DQQASAA&url=https%3A%2F%2Fplatform.com%2Fb%2Fabc&ei=sf7JsWLfyMrmylQHVOA&usg=ssd&jsAKDDAQfa8SdEV730vcPvKvrQ" "Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG SM-G930P ..."

Things to notice:

Can somebody shed some light on this thing?

p.s.: we have a similar project but with only two beacons which is running for three weeks now and we never got such requests.

adriancretu commented 7 years ago

I think there was a similar issue regarding hits from Google IPs but with a browser User-Agent. Don't remember if there was a definitive answer, but I suspect that Google will try to crawl (and periodically reindex) your website using a mobile user-agent to determine if the page is truly mobile-optimized. As we should all know by now, Google ranks down sites not mobile-oriented - and I really can't see how they can do that without touching it somehow, e.g. with a mobile browser user-agent so you can't distinguish it server-side and do middle-man processing just for Google scans.

mmocny commented 7 years ago

Great observations. I think this question is similar to #785, and you can look at my answer here.

Summarizing:

Most likely, this is just an expected side effect of some of our backend systems. We do periodic scans to make sure we have up to date content, and to help filter out shady stuff. All of the behaviours you describe are not unexpected.

Specifically, we use the Google-PhysicalWeb User agent for periodic scans which come from bots, but periodically we index from other sources. We may do more of that in the future.

ajitam commented 7 years ago

Hello,

thank you for the replay. We completely support scanning pages to test for "mobile-friendly-nes", response time, which technologies are used,... After all - we all want to build better, faster, safer web.

I just wanted to confirm that that is google's bot/script/process and not some 3th party user app (or maybe PW in Chrome application)

I would still love if somebody could explain referral part of the request - just for the general understanding.

thx. matija

p.s.: maybe you can mark this issue as 'question'

adriancretu commented 7 years ago

There seems to be a lot of "noise" coming from Google's bots after a beacon is detected, which may induce the feeling that people are actively observing it and interacting, when in reality nobody actually opens the link. Let's take a simple example:

Results:

Conclusion: from an analytics perspective this would falsely give the impression the beacon is somehow popular, but in fact zero interactions occured.

scottjenson commented 7 years ago

Thanks Adrian, I'm passing this along to others who might know more.

Scott

On Tue, Feb 21, 2017 at 8:57 AM, Adrian Crețu notifications@github.com wrote:

There seems to be a lot of "noise" coming from Google's bots after a beacon is detected, which may induce the feeling that people are actively observing it and interacting, when in reality nobody actually opens the link. Let's take a simple example:

  • advertise a URL (let's call it originalURL) that redirects to a secret finalURL
  • wait for the first PWS hit
  • stop the beacon and disable originalURL so it doesn't work any longer

Results:

  • PWS will try to access originalURL from time to time for days to come, even if return code is 404
  • finalURL will start being crawled a lot by AdsBot-Google and GoogleBot
  • more critically, finalURL will be periodically crawled by all kinds of mobile user-agents: iOS Safari, Samsung Internet, Chrome Mobile, with varying device model names, browser versions, Android versions, etc. The only thing in common is Google's large IP range.

Conclusion: from an analytics perspective this would falsely give the impression the beacon is somehow popular, but in fact zero interactions occured.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/physical-web/issues/871#issuecomment-281405613, or mute the thread https://github.com/notifications/unsubscribe-auth/ABAburQUI3tYWf9-ENyFfbVEDP5dearLks5rexdkgaJpZM4LOHje .

scottjenson commented 7 years ago

Adrian,

It's taken me a while to track down the correct information, sorry for the delay. It appears once you get noticed by the Google crawlers, there are a few services that are checking your site for a range of issues (e.g. malware, page weight, etc.) This is true for any site that is in the Google index and, according to my contact, shouldn't be a heavy burden on your site. If you are seeing large traffic, please let us know and we can investigate further.

Scott

On Wed, Feb 22, 2017 at 9:15 PM, Scott Jenson scott@jenson.org wrote:

Thanks Adrian, I'm passing this along to others who might know more.

Scott

On Tue, Feb 21, 2017 at 8:57 AM, Adrian Crețu notifications@github.com wrote:

There seems to be a lot of "noise" coming from Google's bots after a beacon is detected, which may induce the feeling that people are actively observing it and interacting, when in reality nobody actually opens the link. Let's take a simple example:

  • advertise a URL (let's call it originalURL) that redirects to a secret finalURL
  • wait for the first PWS hit
  • stop the beacon and disable originalURL so it doesn't work any longer

Results:

  • PWS will try to access originalURL from time to time for days to come, even if return code is 404
  • finalURL will start being crawled a lot by AdsBot-Google and GoogleBot
  • more critically, finalURL will be periodically crawled by all kinds of mobile user-agents: iOS Safari, Samsung Internet, Chrome Mobile, with varying device model names, browser versions, Android versions, etc. The only thing in common is Google's large IP range.

Conclusion: from an analytics perspective this would falsely give the impression the beacon is somehow popular, but in fact zero interactions occured.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/physical-web/issues/871#issuecomment-281405613, or mute the thread https://github.com/notifications/unsubscribe-auth/ABAburQUI3tYWf9-ENyFfbVEDP5dearLks5rexdkgaJpZM4LOHje .

aakashjain commented 7 years ago

Scott, this issue is affecting me as well.

As Adrian said, it causes traffic to a beacon's URL, and there's no definitive way to figure out whether that traffic is from a crawler or from users getting in range of the beacon.

Even though it's a small number of hits per day, it makes a lot of difference on beacons that experience a small number of actual users getting in range. This is messing with my team's analytics.

It would be really great if there was some documented way to tell if a hit is coming from Google's crawlers.

adriancretu commented 7 years ago

I guess from Google's perspective it's just a feature you'd have to live with ("this is how the internet should work") or find workarounds that make sure an actual browser is visiting the page (and pray that the Googlebot doesn't also simulate this). For tighter validation you'd have to mess around with some BLE 2-way communication (if the phone is still anywhere near the beacon). From a developer perspective, yes it sucks that we have no idea what happens to a beacon's URL after it ends up in Google's index. From a business perspective, it downgrades the beacon to just a channel to pass over a public URL, as if you were to type it in a browser or scan it via a QR code... The bottom line is that once a beacon's URL is known, it's no longer a beacon URL, but a publicly known cached website...