Split charset out of content-type header

mrdavidlaing commented 10 years ago

The content-type header could do with some server-side filtering, right now it's stored as is which causes this: just lowercasing it all and then stripping out charset and ending semicolon would be cool. Perhaps storing the charset in a separate variable would be nice too :)

dpb587 commented 10 years ago

Something like this may be of interest...

grok {
  match => [ 'content_type', '(?<m_content_type[type]>[^/]+)/(?<m_content_type[subtype]>[^\s;]+)(;\s+%{GREEDYDATA:m_content_type[parameter]})?' ]
}

kv {
  source => "m_content_type[parameter]"
  target => "m_content_type[parameter]"
}

Which would result in...

{
  "content_type": "text/html; charset=\"utf8\"; param1=other",
  "m_content_type": {
    "type": "text",
    "subtype": "html",
    "parameter": {
      "charset": "utf8",
      "param1": "other"
    }
  }
}

mrdavidlaing commented 10 years ago

With the change above, we're attempting to insert parsed data like this:

{
             "@version" => "1",
           "@timestamp" => "2014-06-20T12:03:05.000Z",
         "@source.host" => "yoast.wsynth.net",
         "@source.path" => "/var/log/nginx/yoast.com-googlebot-access.json",
       "@source.offset" => "32672994",
                "@type" => "googlebot",
             "@message" => "{ \"@timestamp\": \"2014-06-20T05:03:05-07:00\", \"remote_addr\": \"66.220.152.115\", \"body_bytes_sent\": 19403, \"request_time\": 0.857, \"status\": 200, \"robots\": \"-\", \"redirect_location\": \"-\", \"request_method\": \"GET\", \"scheme\": \"https\", \"server_name\": \"yoast.com\", \"request_uri\": \"/google-panda-robots-css-js/?utm_content=buffer80caa&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer\", \"content_type\": \"text/html; charset=UTF-8\", \"document_uri\": \"/index.php\", \"http_user_agent\": \"facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)\" }",
          "remote_addr" => "66.220.152.115",
      "body_bytes_sent" => 19403,
         "request_time" => 0.857,
               "status" => 200,
               "robots" => "-",
    "redirect_location" => "-",
       "request_method" => "GET",
               "scheme" => "https",
          "server_name" => "yoast.com",
          "request_uri" => "/google-panda-robots-css-js/?utm_content=buffer80caa&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer",
         "content_type" => {
        "charset" => "utf-8",
           "type" => "text/html"
    },
         "document_uri" => "/index.php",
      "http_user_agent" => "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)",
      "remote_addr_dns" => "66.220.152.115"
}

But ES is throwing this error:

[2014-06-20 12:02:28,786][DEBUG][action.bulk              ] [Jack Flag] [logstash-2014.06.20][3] failed to execute bulk item (index) index {[logstash-2014.06.20][googlebot][5pdGMnXyR8a09-fBmiSYTQ], source[{"@version":"1","@timestamp":"2014-06-20T12:03:07.000Z","@source.host":"yoast.wsynth.net","@source.path":"/var/log/nginx/yoast.com-googlebot-access.json","@source.offset":"32673553","@type":"googlebot","@message":"{ \"@timestamp\": \"2014-06-20T05:03:07-07:00\", \"remote_addr\": \"66.220.152.117\", \"body_bytes_sent\": 19381, \"request_time\": 0.879, \"status\": 200, \"robots\": \"-\", \"redirect_location\": \"-\", \"request_method\": \"GET\", \"scheme\": \"https\", \"server_name\": \"yoast.com\", \"request_uri\": \"/google-panda-robots-css-js/?utm_content=buffer80caa&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer\", \"content_type\": \"text/html; charset=UTF-8\", \"document_uri\": \"/index.php\", \"http_user_agent\": \"facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)\" }","remote_addr":"66.220.152.117","body_bytes_sent":19381,"request_time":0.879,"status":200,"robots":"-","redirect_location":"-","request_method":"GET","scheme":"https","server_name":"yoast.com","request_uri":"/google-panda-robots-css-js/?utm_content=buffer80caa&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer","content_type":{"charset":"utf-8","type":"text/html"},"document_uri":"/index.php","http_user_agent":"facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)","remote_addr_dns":"66.220.152.117"}]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [content_type]
    at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:418)
    at org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:517)
    at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:459)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:515)
    at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:462)
    at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:371)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:400)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:153)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:556)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:426)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: unknown property [charset]
    at org.elasticsearch.index.mapper.core.StringFieldMapper.parseCreateFieldForString(StringFieldMapper.java:331)
    at org.elasticsearch.index.mapper.core.StringFieldMapper.parseCreateField(StringFieldMapper.java:277)
    at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:408)
    ... 12 more

mrdavidlaing commented 10 years ago

Which is because we changed the field type of content_type from a String to an Object.

This won't happen on new indexes

mrdavidlaing commented 10 years ago

Content-Type is now parsed into a hash like this:

 "content_type" => {
        "charset" => "utf-8",
           "type" => "text/html"
    }

I've gone through and updated the affected dashboards to display content_type.type

Note that this means that those dashboards / dashboard panels won't display any data prior to the change (ie, prior to 2014-06-20)

mrdavidlaing commented 10 years ago

@jdevalk - Does this update address your original issue sufficiently?

jdevalk commented 10 years ago

Yeah looks like it :) Makes it a hell of a lot easier to read ;)

Almost makes me sad I use a CDN for serving images ;)

mrdavidlaing commented 10 years ago

@jdevalk - Does your CDN give you logs?

It should be fairly straight forward to write a script that downloads & imports those (say every hour?)

jdevalk commented 10 years ago

I use MaxCDN, they have a raw logs API: https://docs.maxcdn.com/#raw-logs-api

jdevalk commented 10 years ago

(it can even match on user-agent it seems)

mrdavidlaing commented 10 years ago

@jdevalk - could you write a script that gets MaxCDN logs periodically and writes them to a file on yoast.com.

Since the MaxCDN logs look like they come in JSON format (yay!) shipping them into Logsearch should be very straight forward.

jdevalk commented 10 years ago

Have created a script that grabs the MaxCDN logs every 15 minutes, they're in /var/log/maxcdn/

It already filters for Googlebot, should save some overhead.

mrdavidlaing commented 10 years ago

@jdevalk - just taking a look at /var/log/maxcdn/googlebot-logs-2014-06-30.json

eg:

 {
      "bytes": 29803,
      "client_asn": "AS15169 Google Inc.",
      "client_city": "Mountain View",
      "client_continent": "NA",
      "client_country": "US",
      "client_dma": "0",
      "client_ip": "66.249.67.220",
      "client_latitude": 37.38600158691406,
      "client_longitude": -122.08380126953125,
      "client_state": "CA",
      "company_id": 85,
      "cache_status": "HIT",
      "hostname": "cdn.yoast.com",
      "method": "GET",
      "origin_time": 0,
      "pop": "vir",
      "protocol": "HTTP/1.1",
      "query_string": "",
      "referer": "-",
      "scheme": "https",
      "status": 200,
      "time": "2014-06-30T08:50:26.189Z",
      "uri": "/wp-content/uploads/2008/04/Permalink-Settings.jpg",
      "user_agent": "Googlebot-Image/1.0",
      "zone_id": 33008
    },
    {
      "bytes": 46953,
      "client_asn": "AS15169 Google Inc.",
      "client_city": "Mountain View",
      "client_continent": "NA",
      "client_country": "US",
      "client_dma": "0",
      "client_ip": "66.249.67.220",
      "client_latitude": 37.38600158691406,
      "client_longitude": -122.08380126953125,
      "client_state": "CA",
      "company_id": 85,
      "cache_status": "MISS",
      "hostname": "cdn.yoast.com",
      "method": "GET",
      "origin_time": 0.024,
      "pop": "vir",
      "protocol": "HTTP/1.1",
      "query_string": "",
      "referer": "-",
      "scheme": "https",
      "status": 200,
      "time": "2014-06-30T08:40:45.159Z",
      "uri": "/wp-content/uploads/2009/10/apple-404.png",
      "user_agent": "Googlebot-Image/1.0",
      "zone_id": 33008
    },

Is there any way we could get the script to output each log entry on a separate line; and strip the , from each line.

Eg, the above would become:

{  "bytes": 29803, "client_asn": "AS15169 Google Inc.", "client_city": "Mountain View", "client_continent": "NA", "client_country": "US", "client_dma": "0", "client_ip": "66.249.67.220", "client_latitude": 37.38600158691406, "client_longitude": -122.08380126953125, "client_state": "CA", "company_id": 85, "cache_status": "HIT", "hostname": "cdn.yoast.com", "method": "GET", "origin_time": 0, "pop": "vir", "protocol": "HTTP/1.1", "query_string": "", "referer": "-", "scheme": "https", "status": 200, "time": "2014-06-30T08:50:26.189Z", "uri": "/wp-content/uploads/2008/04/Permalink-Settings.jpg", "user_agent": "Googlebot-Image/1.0", "zone_id": 33008 }
{  "bytes": 46953, "client_asn": "AS15169 Google Inc.", "client_city": "Mountain View", "client_continent": "NA", "client_country": "US", "client_dma": "0", "client_ip": "66.249.67.220", "client_latitude": 37.38600158691406, "client_longitude": -122.08380126953125, "client_state": "CA", "company_id": 85, "cache_status": "MISS", "hostname": "cdn.yoast.com", "method": "GET", "origin_time": 0.024, "pop": "vir", "protocol": "HTTP/1.1", "query_string": "", "referer": "-", "scheme": "https", "status": 200, "time": "2014-06-30T08:40:45.159Z", "uri": "/wp-content/uploads/2009/10/apple-404.png", "user_agent": "Googlebot-Image/1.0", "zone_id": 33008     }

mrdavidlaing commented 10 years ago

Or even better; point me at the script, and I'll do it :smile:

jdevalk commented 10 years ago

Do a crontab -l and you'll see where it is ;-)

On Monday, June 30, 2014, David Laing notifications@github.com wrote:

Or even better; point me at the script, and I'll do it [image: :smile:]

— Reply to this email directly or view it on GitHub https://github.com/logsearch/logsearch-filters-seo/issues/1#issuecomment-47512842 .

mrdavidlaing commented 10 years ago

Aha!

I'll make changes to that, and then get the data shipped in.

And, I see,

0 */4 * * * /etc/init.d/logstash-forwarder restart >> /root/crontab.log 2>&1

Which explains why logstash-forwarder has been more "stable" recently. :) Were you a Windows Sysadmin in a past life?

jdevalk commented 10 years ago

No but I'm very much a "simple solutions" kinda guy ;-)

On Monday, June 30, 2014, David Laing notifications@github.com wrote:

Aha!

I'll make changes to that, and then get the data shipped in.

And, I see,

0 /4 * * \ /etc/init.d/logstash-forwarder restart >> /root/crontab.log 2>&1

Which explains why logstash-forwarder has been more "stable" recently. :) Were you a Windows Sysadmin in a past life?

— Reply to this email directly or view it on GitHub https://github.com/logsearch/logsearch-filters-seo/issues/1#issuecomment-47529860 .

mrdavidlaing commented 10 years ago

@jdevalk I made the following changes to the script (after turning the folder into a git repo) root@yoast.wsynth.net:/var/www/yoast.com/research/maxcdn# git diff HEAD^ HEAD

diff --git a/logs.php b/logs.php
index ca47e4b..5d0a223 100644
--- a/logs.php
+++ b/logs.php
@@ -13,10 +13,14 @@ $params = array(
        'status' => '200,301,302,303,307,404,503'
 );

-$logs = $api->get('/v3/reporting/logs.json', $params);
+$logs_text = $api->get('/v3/reporting/logs.json', $params);
+$logs_json = json_decode ( $logs_text );

 $file = fopen( '/var/log/maxcdn/googlebot-logs-'.date('Y-m-d').'.json', 'a' );
-fwrite( $file, $logs . "\n" );
+foreach ($logs_json->records as $hit_number => $hit) {
+  fwrite( $file, json_encode ($hit ) . "\n" );
+}
+

 fclose();

I've added /var/log/maxcdn/googlebot-logs-*.json to the files being shipped.

And made a very basic dashboard - MaxCDN.

Currently, the data isn't being parsed correctly, but at least we have some ;)

jdevalk commented 10 years ago

Can I do anything on the parsing side? :)

mrdavidlaing commented 10 years ago

@jdevalk - I'm relocating this conversation to https://github.com/logsearch/logsearch-filters-seo/pull/7

You can definitely help - feel free to jump in and contribute code; or just provide direction.

logsearch / logsearch-filters-seo

Split charset out of content-type header #1