google / physical-web

The Physical Web: walk up and use anything
http://physical-web.org
Apache License 2.0
6k stars 666 forks source link

LinkedIn Browser Bug #580

Closed rochforp closed 8 years ago

rochforp commented 8 years ago

I've noticed that when broadcasting an Eddystone-URL for some linkedIn profiles when using a url shortner like bit.ly or goo.gl, the title, description and favicon are not being populated in some Physical Web browsers like Chrome and PhyWeb. I'm trying to figure out why this is but it appears that this is something that's occuring a little bit differently depending on the mobile OS and the browser. Looked a little further and thought it could be occuring in PWMetadataRequest.m somewhere but then happened upon this stack overflow post (https://stackoverflow.com/questions/32450569/linkedin-isnt-letting-me-google-users-anymore-sentinel-org-block#) talking about the problems with the sentinal org tool they are using. Has anyone else noticed this or have any ideas about fixing this?

mmocny commented 8 years ago

Can you produce a sample URL which you have on a beacon?

Then I can give you a curl command which uses our Physical Web Service (PWS) to resolve the url into metadata. All the PW clients use a PWS to actually get page metadata so the errors shouldn't be client specific.

If LinkedIn is doing something specific to block google-bot from crawling its pages, this could explain the issue.

Jerren34 commented 8 years ago

Can the source code be loaded to my google app engine? @mmocny

mmocny commented 8 years ago

@Jerren34 you can find a sample PWS hosted right in this project.

It is written using python GAE, which you can run yourself locally or publish to your GAE account.

The Physical Web standalone apps use a version of this sample PWS hosted by us on GAE, but the Physical Web feature of Chrome uses a different PWS backend.

Jerren34 commented 8 years ago

I'm trying to upload it on app ego e right now but I'm having trouble. Do you mind helping me? Can get it to load and don't know what I'm doing wrong

On Thursday, January 14, 2016, Michal Mocny notifications@github.com wrote:

@Jerren34 https://github.com/Jerren34 you can find a sample PWS https://github.com/google/physical-web/tree/master/web-service hosted right in this project.

It is written using python GAE, which you can run yourself locally or publish to your GAE account.

The Physical Web standalone apps use a version of this sample PWS hosted by us on GAE, but the Physical Web feature of Chrome uses a different PWS backend.

— Reply to this email directly or view it on GitHub https://github.com/google/physical-web/issues/580#issuecomment-171865479 .

BT3 Viral Marketing Jerren Harrison CEO, Founder 313 704 3444

Jerren34 commented 8 years ago

I have a Google appspot account I'm using now. I have downloaded app engine launcher, python 2.7, and app whine SDK but I can't get it up running. Mind helping me?

On Thursday, January 14, 2016, Jerren Harrison jerren34@gmail.com wrote:

I'm trying to upload it on app ego e right now but I'm having trouble. Do you mind helping me? Can get it to load and don't know what I'm doing wrong

On Thursday, January 14, 2016, Michal Mocny <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

@Jerren34 https://github.com/Jerren34 you can find a sample PWS https://github.com/google/physical-web/tree/master/web-service hosted right in this project.

It is written using python GAE, which you can run yourself locally or publish to your GAE account.

The Physical Web standalone apps use a version of this sample PWS hosted by us on GAE, but the Physical Web feature of Chrome uses a different PWS backend.

— Reply to this email directly or view it on GitHub https://github.com/google/physical-web/issues/580#issuecomment-171865479 .

BT3 Viral Marketing Jerren Harrison CEO, Founder 313 704 3444

BT3 Viral Marketing Jerren Harrison CEO, Founder 313 704 3444

rochforp commented 8 years ago

@mmocny sure, this is a shortened linkedin profile page goo.gl/v6QFk9 . As you can see through the Physical Web app, it doesn't render the metadata like other broadcasted, shortened site urls. For iOS PhyWeb is showing the correct url but not pulling any metadata ios_linkedin_bug . Android doesn't even show the link in the list of Physical Web beacons and Chrome enabled on iOS today screen is doing the same thing where it's not rendering any of the data.

kevinahuber commented 8 years ago

@rochforp Have you seen examples of this with anything other than LinkedIn?

tolson2000 commented 8 years ago

That particular landing page for goo.gl/v6QFk9 does not have the required meta data... On the landing page, there is no meta data for description. meta name="description" content=".... Without that you won't get the main thing, description, displayed. It does have the other tag needed.. ... But it is buried in boocoo other meta data and scripts; the scraper is stuck it seems.

Observations: PW list on iOS shows shortened URL, yet sometimes shows LANDING URL. Don't know why. PW list on Android shows LANDING URL normally, but on this one only shortened URL with loading... Seems stuck.

It seems the LinkedIn site isn't optimized for PW queries, yet.

kevinahuber commented 8 years ago

@tolson2000 How would that differ from a bot coming along? Could we implement the same methods?

If LinkedIn is not optimized for a title/description/favicon PW query, we should probably augment the PW to include older query methods. I would suspect that LinkedIn is not unique.

rochforp commented 8 years ago

@kahjav So far it's been isolated to just linkedin but I'm concerned that it's possibly indicative of a problem that might exist on other sites as well.

mmocny commented 8 years ago

First, just a quick aside: Chrome for Android will only show results which link to https pages. It looks like your short url does redirect to https://www.linkedin.com/in/tlytle so this should be fine. Chrome for iOS still supports non-https but we will be making a switch soon. The Physical Web app will likely continue to show all results.

Second, some of our clients will show the raw URL if we fail to fetch page metadata. The intention was that local-intranet-only or local-development-machine-only URLs, which our PWS could not fetch, should still be available. However, our direction these days is to just filter these results out and we will add some "advanced" views where you will be able to see all results. This is less developer friendly but more user friendly. This may explain why iOS Physical Web app shows just the green link, while other clients show nothing.

Finally, the root of the problem appears to be that the PWS is not resolving your url:

$ curl -k -s -H "Content-Type: application/json" -d '{"objects":[{"url":"http://goo.gl/v6QFk9"}]}' http://url-caster.appspot.com/resolve-scan | python -m json.tool
{
    "metadata": []
}

I will attempt to debug and resolve this issue. Thank you for raising it!

mmocny commented 8 years ago

This may not be the only bug, but it appears that LinkedIn returns a different page depending on the requesting user agent. The page being served to PWS right now may be an empty page with only inline javascript and no semantic page information at all.

$ curl -L http://goo.gl/v6QFk9
<html><head>
<script type="text/javascript">
window.onload = function() {
  // Parse the tracking code from cookies.
  var trk = "sentinel_org_block";
  var cookies = document.cookie.split("; ");
  for (var i = 0; i < cookies.length; ++i) {
    if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {
      trk = cookies[i].substring(8);
    }
  }

  // Get the protocol for the redirect url.
  var protocol = "http:";
  if (window.location.protocol == "https:") {
    protocol = "https:";
  } else {
    // If "sl" cookie is set, redirect to https.
    for (var i = 0; i < cookies.length; ++i) {
      if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {
        window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);
        return;
      }
    }
  }

  // Get the new domain. For touch.www.linkedin.com or tablet.www.linkedin.com
  // we strip "touch." and "tablet.". For international domains such as
  // fr.linkedin.com, we convert it to www.linkedin.com
  var domain = location.host;
  if (domain.substr(0, 6) == "touch.") {
    domain = domain.substr(6);
  } else if (domain.substr(0, 7) == "tablet.") {
    domain = domain.substr(7);
  } else if (domain.charAt(2) == ".") {
    domain = "www" + domain.substr(2);
  }

  // D8E90337EA is the tracking code proposed by Harsh, representing guest request redirected
  // to login.
  window.location.href = protocol + "//" + domain + "/uas/login?trk=" + trk + "&session_redirect=" +
      encodeURIComponent(protocol + "//" + domain +
      window.location.href.substr(window.location.href.search(window.location.host) +
                                  window.location.host.length));
}
</script>
</head></html>

Our current PWS does not evaluate JavaScript. We do not run the page through a headless browser.

However, once I change the User-Agent header I get the real data:

$ curl -L -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" -H "user-agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.48 Safari/537.36" https://www.linkedin.com/in/tlytle
<...Lots of HTML...>

We already set a custom user agent, but I will see about adjusting it slightly.

mmocny commented 8 years ago

FYI: Our current User Agent

kevinahuber commented 8 years ago

@mmocny When does/will Chrome check for https? Would non-https URL shorteners still be able to be utilized if the destination is https?

tolson2000 commented 8 years ago

@kahjav The PW is specifically looking for the "description" for a page. Places like LinkedIn, Facebook, don't bother making a "description" for a 'users' info page. In the case of Facebook when they are hit by a bot their server will substitute a generic description for bots for their users. Affectively just saying JoeBlow is on Facebook. And then throw a commercial at you. Join now to connect. Kind of useless for a PW concept. Companies like realtors don't always have a "description" field for their employee pages. And for bots will substitute maybe something again like JobBlow is a realtor for ... If you want a page for your resume you need to make one specifically with the needed tags and meta data fields to work with PW. Probably what we need to do is ask LinkedIn to provide a meaningful "description" meta data.

Here's my test page. It's a shortend URL with http which redirects to a https on another server. Here again, though, on my iPOD is shows the shortened URL and on Android is shows the landing page URL. http://io.ivt.com/to9

EDIT: Forgot to mention on that test page there is no "description" meta data. Yet the PWS made a description from my h1 and other hl links. Not sure how that works.

mmocny commented 8 years ago

@kahjav Coincidentally I just got this question! Yes, at the moment you can use http redirector to link to https destination. The https requirement is only on the final URL, and it is to protect the user (so compromised network cannot replace the page contents, effectively having Chrome send users to the wrong destination). Since the redirect loop is done on PWS we are less concerned about compromised networks.

tolson2000 commented 8 years ago

Yeah, I don't know if I agree with having to have the landing page be https: That isn't always possible. Will you be allowing self-signed certs?

mmocny commented 8 years ago

Alright, it looks like we are failing to resolve LinkedIn because they are explicitly blocking crawling: HTTP/1.1 999 Request denied. I haven't nor intend to pursue ways to circumvent this.

mmocny commented 8 years ago

@tolson2000 We don't currently do much certificate verification, but if you send a user to such a page the browser will likely put up a big red flag, so is unlikely to be a good idea outside of local testing. And for local testing we hope to make it easier to just show all results.

To be clear: the https requirement is not a big change to physical web or the Eddystone-URL BLE frame format. It is only a policy decision by the developers of Chrome browser for its integration of the Physical Web feature, and meant to safeguard its users.