Closed mrdavidlaing closed 10 years ago
Something like this may be of interest...
grok {
match => [ 'content_type', '(?<m_content_type[type]>[^/]+)/(?<m_content_type[subtype]>[^\s;]+)(;\s+%{GREEDYDATA:m_content_type[parameter]})?' ]
}
kv {
source => "m_content_type[parameter]"
target => "m_content_type[parameter]"
}
Which would result in...
{
"content_type": "text/html; charset=\"utf8\"; param1=other",
"m_content_type": {
"type": "text",
"subtype": "html",
"parameter": {
"charset": "utf8",
"param1": "other"
}
}
}
With the change above, we're attempting to insert parsed data like this:
{
"@version" => "1",
"@timestamp" => "2014-06-20T12:03:05.000Z",
"@source.host" => "yoast.wsynth.net",
"@source.path" => "/var/log/nginx/yoast.com-googlebot-access.json",
"@source.offset" => "32672994",
"@type" => "googlebot",
"@message" => "{ \"@timestamp\": \"2014-06-20T05:03:05-07:00\", \"remote_addr\": \"66.220.152.115\", \"body_bytes_sent\": 19403, \"request_time\": 0.857, \"status\": 200, \"robots\": \"-\", \"redirect_location\": \"-\", \"request_method\": \"GET\", \"scheme\": \"https\", \"server_name\": \"yoast.com\", \"request_uri\": \"/google-panda-robots-css-js/?utm_content=buffer80caa&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer\", \"content_type\": \"text/html; charset=UTF-8\", \"document_uri\": \"/index.php\", \"http_user_agent\": \"facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)\" }",
"remote_addr" => "66.220.152.115",
"body_bytes_sent" => 19403,
"request_time" => 0.857,
"status" => 200,
"robots" => "-",
"redirect_location" => "-",
"request_method" => "GET",
"scheme" => "https",
"server_name" => "yoast.com",
"request_uri" => "/google-panda-robots-css-js/?utm_content=buffer80caa&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer",
"content_type" => {
"charset" => "utf-8",
"type" => "text/html"
},
"document_uri" => "/index.php",
"http_user_agent" => "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)",
"remote_addr_dns" => "66.220.152.115"
}
But ES is throwing this error:
[2014-06-20 12:02:28,786][DEBUG][action.bulk ] [Jack Flag] [logstash-2014.06.20][3] failed to execute bulk item (index) index {[logstash-2014.06.20][googlebot][5pdGMnXyR8a09-fBmiSYTQ], source[{"@version":"1","@timestamp":"2014-06-20T12:03:07.000Z","@source.host":"yoast.wsynth.net","@source.path":"/var/log/nginx/yoast.com-googlebot-access.json","@source.offset":"32673553","@type":"googlebot","@message":"{ \"@timestamp\": \"2014-06-20T05:03:07-07:00\", \"remote_addr\": \"66.220.152.117\", \"body_bytes_sent\": 19381, \"request_time\": 0.879, \"status\": 200, \"robots\": \"-\", \"redirect_location\": \"-\", \"request_method\": \"GET\", \"scheme\": \"https\", \"server_name\": \"yoast.com\", \"request_uri\": \"/google-panda-robots-css-js/?utm_content=buffer80caa&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer\", \"content_type\": \"text/html; charset=UTF-8\", \"document_uri\": \"/index.php\", \"http_user_agent\": \"facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)\" }","remote_addr":"66.220.152.117","body_bytes_sent":19381,"request_time":0.879,"status":200,"robots":"-","redirect_location":"-","request_method":"GET","scheme":"https","server_name":"yoast.com","request_uri":"/google-panda-robots-css-js/?utm_content=buffer80caa&utm_medium=social&utm_source=facebook.com&utm_campaign=buffer","content_type":{"charset":"utf-8","type":"text/html"},"document_uri":"/index.php","http_user_agent":"facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)","remote_addr_dns":"66.220.152.117"}]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [content_type]
at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:418)
at org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:517)
at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:459)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:515)
at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:462)
at org.elasticsearch.index.shard.service.InternalIndexShard.prepareCreate(InternalIndexShard.java:371)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:400)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:153)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:556)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:426)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: unknown property [charset]
at org.elasticsearch.index.mapper.core.StringFieldMapper.parseCreateFieldForString(StringFieldMapper.java:331)
at org.elasticsearch.index.mapper.core.StringFieldMapper.parseCreateField(StringFieldMapper.java:277)
at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:408)
... 12 more
Which is because we changed the field type of content_type
from a String
to an Object
.
This won't happen on new indexes
Content-Type
is now parsed into a hash like this:
"content_type" => {
"charset" => "utf-8",
"type" => "text/html"
}
I've gone through and updated the affected dashboards to display content_type.type
Note that this means that those dashboards / dashboard panels won't display any data prior to the change (ie, prior to 2014-06-20)
@jdevalk - Does this update address your original issue sufficiently?
Yeah looks like it :) Makes it a hell of a lot easier to read ;)
Almost makes me sad I use a CDN for serving images ;)
@jdevalk - Does your CDN give you logs?
It should be fairly straight forward to write a script that downloads & imports those (say every hour?)
I use MaxCDN, they have a raw logs API: https://docs.maxcdn.com/#raw-logs-api
(it can even match on user-agent it seems)
@jdevalk - could you write a script that gets MaxCDN logs periodically and writes them to a file on yoast.com.
Since the MaxCDN logs look like they come in JSON format (yay!) shipping them into Logsearch should be very straight forward.
Have created a script that grabs the MaxCDN logs every 15 minutes, they're in /var/log/maxcdn/
It already filters for Googlebot, should save some overhead.
@jdevalk - just taking a look at /var/log/maxcdn/googlebot-logs-2014-06-30.json
eg:
{
"bytes": 29803,
"client_asn": "AS15169 Google Inc.",
"client_city": "Mountain View",
"client_continent": "NA",
"client_country": "US",
"client_dma": "0",
"client_ip": "66.249.67.220",
"client_latitude": 37.38600158691406,
"client_longitude": -122.08380126953125,
"client_state": "CA",
"company_id": 85,
"cache_status": "HIT",
"hostname": "cdn.yoast.com",
"method": "GET",
"origin_time": 0,
"pop": "vir",
"protocol": "HTTP/1.1",
"query_string": "",
"referer": "-",
"scheme": "https",
"status": 200,
"time": "2014-06-30T08:50:26.189Z",
"uri": "/wp-content/uploads/2008/04/Permalink-Settings.jpg",
"user_agent": "Googlebot-Image/1.0",
"zone_id": 33008
},
{
"bytes": 46953,
"client_asn": "AS15169 Google Inc.",
"client_city": "Mountain View",
"client_continent": "NA",
"client_country": "US",
"client_dma": "0",
"client_ip": "66.249.67.220",
"client_latitude": 37.38600158691406,
"client_longitude": -122.08380126953125,
"client_state": "CA",
"company_id": 85,
"cache_status": "MISS",
"hostname": "cdn.yoast.com",
"method": "GET",
"origin_time": 0.024,
"pop": "vir",
"protocol": "HTTP/1.1",
"query_string": "",
"referer": "-",
"scheme": "https",
"status": 200,
"time": "2014-06-30T08:40:45.159Z",
"uri": "/wp-content/uploads/2009/10/apple-404.png",
"user_agent": "Googlebot-Image/1.0",
"zone_id": 33008
},
Is there any way we could get the script to output each log entry on a separate line; and strip the ,
from each line.
Eg, the above would become:
{ "bytes": 29803, "client_asn": "AS15169 Google Inc.", "client_city": "Mountain View", "client_continent": "NA", "client_country": "US", "client_dma": "0", "client_ip": "66.249.67.220", "client_latitude": 37.38600158691406, "client_longitude": -122.08380126953125, "client_state": "CA", "company_id": 85, "cache_status": "HIT", "hostname": "cdn.yoast.com", "method": "GET", "origin_time": 0, "pop": "vir", "protocol": "HTTP/1.1", "query_string": "", "referer": "-", "scheme": "https", "status": 200, "time": "2014-06-30T08:50:26.189Z", "uri": "/wp-content/uploads/2008/04/Permalink-Settings.jpg", "user_agent": "Googlebot-Image/1.0", "zone_id": 33008 }
{ "bytes": 46953, "client_asn": "AS15169 Google Inc.", "client_city": "Mountain View", "client_continent": "NA", "client_country": "US", "client_dma": "0", "client_ip": "66.249.67.220", "client_latitude": 37.38600158691406, "client_longitude": -122.08380126953125, "client_state": "CA", "company_id": 85, "cache_status": "MISS", "hostname": "cdn.yoast.com", "method": "GET", "origin_time": 0.024, "pop": "vir", "protocol": "HTTP/1.1", "query_string": "", "referer": "-", "scheme": "https", "status": 200, "time": "2014-06-30T08:40:45.159Z", "uri": "/wp-content/uploads/2009/10/apple-404.png", "user_agent": "Googlebot-Image/1.0", "zone_id": 33008 }
Or even better; point me at the script, and I'll do it :smile:
Do a crontab -l
and you'll see where it is ;-)
On Monday, June 30, 2014, David Laing notifications@github.com wrote:
Or even better; point me at the script, and I'll do it [image: :smile:]
— Reply to this email directly or view it on GitHub https://github.com/logsearch/logsearch-filters-seo/issues/1#issuecomment-47512842 .
Aha!
I'll make changes to that, and then get the data shipped in.
And, I see,
0 */4 * * * /etc/init.d/logstash-forwarder restart >> /root/crontab.log 2>&1
Which explains why logstash-forwarder
has been more "stable" recently. :) Were you a Windows Sysadmin in a past life?
No but I'm very much a "simple solutions" kinda guy ;-)
On Monday, June 30, 2014, David Laing notifications@github.com wrote:
Aha!
I'll make changes to that, and then get the data shipped in.
And, I see,
0 /4 * * \ /etc/init.d/logstash-forwarder restart >> /root/crontab.log 2>&1
Which explains why logstash-forwarder has been more "stable" recently. :) Were you a Windows Sysadmin in a past life?
— Reply to this email directly or view it on GitHub https://github.com/logsearch/logsearch-filters-seo/issues/1#issuecomment-47529860 .
@jdevalk I made the following changes to the script (after turning the folder into a git repo)
root@yoast.wsynth.net:/var/www/yoast.com/research/maxcdn# git diff HEAD^ HEAD
diff --git a/logs.php b/logs.php
index ca47e4b..5d0a223 100644
--- a/logs.php
+++ b/logs.php
@@ -13,10 +13,14 @@ $params = array(
'status' => '200,301,302,303,307,404,503'
);
-$logs = $api->get('/v3/reporting/logs.json', $params);
+$logs_text = $api->get('/v3/reporting/logs.json', $params);
+$logs_json = json_decode ( $logs_text );
$file = fopen( '/var/log/maxcdn/googlebot-logs-'.date('Y-m-d').'.json', 'a' );
-fwrite( $file, $logs . "\n" );
+foreach ($logs_json->records as $hit_number => $hit) {
+ fwrite( $file, json_encode ($hit ) . "\n" );
+}
+
fclose();
I've added /var/log/maxcdn/googlebot-logs-*.json
to the files being shipped.
And made a very basic dashboard - MaxCDN.
Currently, the data isn't being parsed correctly, but at least we have some ;)
Can I do anything on the parsing side? :)
@jdevalk - I'm relocating this conversation to https://github.com/logsearch/logsearch-filters-seo/pull/7
You can definitely help - feel free to jump in and contribute code; or just provide direction.
The content-type header could do with some server-side filtering, right now it's stored as is which causes this: just lowercasing it all and then stripping out charset and ending semicolon would be cool. Perhaps storing the charset in a separate variable would be nice too :)