rkalla/cloudfront-log-parser

Amazon CloudFront Log Parser

Changelog

1.4

Fixed Issue #11 - IllegalArgumentException while parsing newer CF log format.
Fixed Issue #12 - Supporting new CF log fields.

1.3

Improved performance by using new feature in tbm-common-parser-lib that allows a tokenizer to re-use the same IToken for every token event instead of creating a new one.

1.2

More secure exception handling inside of Parse ensures cleanup of any temporary input streams.
Improved the exception semantics of the parse method. https://github.com/rkalla/cloudfront-log-parser/issues/2
Parser is more verbal with MalformedContentException errors when the content of the log file doesn't match the Amazon-defined DOWNLOAD or STREAMING CloudFront log formats. https://github.com/rkalla/cloudfront-log-parser/issues/2
Fixed incorrect bounds check for insertion of fields beyond what ILogEntry will allow. https://github.com/rkalla/cloudfront-log-parser/issues/5
Fixed a handful of "tightening" bugs that didn't break the parser previously but could lead to leaks or bugs down the road.
Addition Javadoc to the source was added to make it more clear what certain constructs are for.
Added a new file to the benchmark that matches a worst-case-scenario log file from CloudFront (50MB uncompressed, ~174k entries)
Added library to The Buzz Media's Maven repo.

1.1

Initial public release.

License

This library is released under the Apache 2 License. See LICENSE.

Description

CloudFront Log Parser is a Java library offering a low-complexity, high-performance, adaptive CloudFront log parser for both the Download and Streaming CloudFront log formats.

The library is "adaptive" in the sense that it determines the format of the log files at parse time, you don't need to tell it, and it is additionally resilient in that it can skip unknown Field names and values while parsing instead of throwing an exception. This is important if the library is ever deployed in a production setting and Amazon changes the CloudFront log format before a new version of the library is read.

CloudFront Log Parser's API was designed with the existing AWS Java JDK in mind and ensuring that integration would fit naturally. More specifically, the API can directly consume raw InputStream for the stored .gz log files on S3 directly from the S3Object.getObjectContent() method.

That being said, the API is coded to accept ANY InputStream; whether you are processing local copies and passing FileInputStreams or processing remote copies; it just happened to be particularly easy to integrate it with the AWS Java SDK.

CloudFront Log Parser is intended to be used in any deployment scenario from a desktop analysis app to a long-running server process environment.

Design

CloudFront Log Parser was designed, first and foremost, to be as fast as possible with as little memory allocation as possible so as to work smoothly in a long-running server log-processing usage scenario.

Object creation is kept to a minimum during parsing (no matter how big the parse job) by re-using a single ILogEntry instance, per LogParser, to wrap parsed line values and report those to the given ILogParserCallback.

The ILogEntry instance received by the callbacks is ephemeral; in that the ILogEntry instance is only valid for the scope of the callback's method. Once returned from that method, the ILogEntry instance is reused and the values it holds are swapped.

CALLBACKS SHOULD NEVER HOLD ONTO ILOGENTRY INSTANCES!

However, the values reported by the callbacks are copies and can be safely stored.

This design was chosen because over large parse jobs for busy sites, where millions of log entries would not be uncommon, the heap memory savings and performance improvement because of this design would be noticeable. The VM would not be thrown into longer GC cycles as it attempted to clean up the millions of useless objects that were so short lived.

Performance

Benchmarks can be found in the /src/test/java folder and can be run directly from the command line (no need to setup JUnit).

[Platform]

Java 1.6.0_24 on Windows 7 64-bit
Dual Core Intel E6850 processor
8 GB of ram

[Benchmark Results] Parsed 100 Log Entries in 27ms (3703 entries / sec) Parsed 100,000 Log Entries in 864ms (115740 entries / sec) Parsed 174,200 Log Entries in 1341ms (129903 entries / sec) Parsed 1,000,000 Log Entries in 7520ms (132978 entries / sec)

The Amazon CloudFront docs say log files are truncated at a maximum size of 50MB (uncompressed) before they are written out to the log directory. The 3rd test, parsing the 174k log entries is exactly 50MB uncompressed and matches this worst-case-scenario.

This means that using CloudFront Log Parser to parse the largest log files that CloudFront will write out to your S3 bucket, you can parse that file in a little over a second on equivalent hardware.

If you are running on beefier server hardware, you can increase that rate and if you are parsing logs in a multi-threaded environment (1 thread per LogParser) you can increase that rate my magnitudes.

CloudFront Log Parser is fast.

Runtime Requirements

The Buzz Media common-lib (tbm-common-lib-.jar)
The Buzz Media common-pars-erlib (tbm-common-parser-lib-.jar)

History

After deploying apps that utilized CloudFront for content delivery, I had the need to parse the resulting access logs to get an idea of what kind of traffic, bandwidth and access patterns the content was receiving.

Amazon promotes the use of their Map/Reduce Hadoop-based log parser multiple times on their site, but that requires additional EC2 charges to run.

After about a week of prototyping and engineering I had initial versions of the CloudFront Log Parser written and running. Initial "does it work" prototypes took an afternoon, but I toyed with a multitude of different API designs and approaches trying to best balance an easy-to-use API with runtime performance.

Eventually settling on what was by far the cleanest and simplest API, I used HPROF to tighten up the runtime performance and minimize object creation down to the bare minimum.

I hope this helps folks out there.