logsearch / logsearch-filters-seo

Apache License 2.0
2 stars 0 forks source link

Parse Googlebot logs from MaxCDN #7

Open mrdavidlaing opened 10 years ago

mrdavidlaing commented 10 years ago

Its possible to get logs of googlebot traffic to MaxCDN via the MaxCDN api. This gives source logs in the following format:

{  "bytes": 46953, "client_asn": "AS15169 Google Inc.", "client_city": "Mountain View", "client_continent": "NA", "client_country": "US", "client_dma": "0", "client_ip": "66.249.67.220", "client_latitude": 37.38600158691406, "client_longitude": -122.08380126953125, "client_state": "CA", "company_id": 85, "cache_status": "MISS", "hostname": "cdn.yoast.com", "method": "GET", "origin_time": 0.024, "pop": "vir", "protocol": "HTTP/1.1", "query_string": "", "referer": "-", "scheme": "https", "status": 200, "time": "2014-06-30T08:40:45.159Z", "uri": "/wp-content/uploads/2009/10/apple-404.png", "user_agent": "Googlebot-Image/1.0", "zone_id": 33008     }

These should be parsed into a format that makes analysing them easy

mrdavidlaing commented 10 years ago

A very basic json filter gives the following:

'@type': googlebot-maxcdn
  '@message': '{"bytes":0,"client_asn":"AS16509 Amazon.com, Inc.","client_city":"-","client_continent":"EU","client_country":"IE","client_dma":"0","client_ip":"54.247.60.162","client_latitude":53,"client_longitude":-8,"client_state":"-","company_id":85,"cache_status":"MISS","hostname":"cdn.yoast.com","method":"HEAD","origin_time":0.471,"pop":"lhr","protocol":"HTTP\/1.1","query_string":"","referer":"-","scheme":"https","status":200,"time":"2014-07-01T05:10:50.388Z","uri":"\/wp-content\/uploads\/2007\/12\/blogmetrics02.png","user_agent":"Googlebot\/2.1
    (+http:\/\/www.google.com\/bot.html)","zone_id":33008}'
  '@version': '1'
  '@timestamp': 2014-07-01 06:10:50.388000000 +01:00
  bytes: 0
  client_asn: AS16509 Amazon.com, Inc.
  client_city: '-'
  client_continent: EU
  client_country: IE
  client_dma: '0'
  client_ip: 54.247.60.162
  client_latitude: 53
  client_longitude: -8
  client_state: '-'
  company_id: 85
  cache_status: MISS
  hostname: cdn.yoast.com
  method: HEAD
  origin_time: 0.471
  pop: lhr
  protocol: HTTP/1.1
  query_string: ''
  referer: '-'
  scheme: https
  status: 200
  time: '2014-07-01T05:10:50.388Z'
  uri: /wp-content/uploads/2007/12/blogmetrics02.png
  user_agent: Googlebot/2.1 (+http://www.google.com/bot.html)
  zone_id: 33008

Compared to @type:googlebot which has the following shape:

  '@type': googlebot
  '@message': '{ "content_type": "text/xml; charset=UTF-8", "@timestamp": "2014-06-19T21:54:20-07:00",
    "remote_addr": "66.249.69.45", "body_bytes_sent": 38704, "request_time": 1.539,
    "status": 200, "robots": "noindex,follow", "redirect_location": "-", "request_method":
    "GET", "scheme": "https", "server_name": "yoast.com", "request_uri": "/cat/wordpress/feed/",
    "document_uri": "/index.php", "http_user_agent": "Mozilla/5.0 (compatible; Googlebot/2.1;
    +http://www.google.com/bot.html)" }'
  '@version': '1'
  '@timestamp': 2014-06-20 04:54:20.000000000 Z
  content_type:
    charset: utf-8
    type: text/xml
  remote_addr: 66.249.69.45
  body_bytes_sent: 38704
  request_time: 1.539
  status: 200
  robots: noindex,follow
  redirect_location: '-'
  request_method: GET
  scheme: https
  server_name: yoast.com
  request_uri: /cat/wordpress/feed/
  document_uri: /index.php
  http_user_agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  remote_addr_dns: crawl-66-249-69-45.googlebot.com

I think we should rename the @type:googlebot-maxcdn fields to match those of @type:googlebot

@jdevalk - do you agree?