Closed mwengren closed 7 years ago
This is good. Please keep @robragsdale and I posted on what you decide.
@lukecampbell Can you make sure the web server for data.ioos.us is configured to capture Apache-style 'combined' log format for March (if possible), and if not by April? With monthly log rotation?
We actually use nginx for the HTTP server for data.ioos.us. I'll take a look.
123.123.123.123 - - -@@@@ /usr/share/nginx/html/api/i18n/en [06/Mar/2017:17:40:32 +0000] "GET /api/i18n/en HTTP/1.1" 200 2 "https://data.ioos.us/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" "-"
123.123.123.123 - - -@@@@ /usr/share/nginx/html/images/icons/ckan.ico [06/Mar/2017:17:40:33 +0000] "GET /images/icons/ckan.ico HTTP/1.1" 404 12671 "https://data.ioos.us/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" "-"
I just changed the rotation period from 10 days to 40 days. I'll add a monthly cron job to archive these to s3.
I knew you guys used nginx, but the point was to make sure the logs follow the 'combined' format as specified by Apache HTTPD. Specifically the referrer is important, I don't think I see that in the above log examples, only UserAgent info.
Can nginx log referrers?
This is with nginx set to "combined"
123.123.123.123 - - [06/Mar/2017:19:16:21 +0000] "GET / HTTP/1.1" 200 4018 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
123.123.123.123 - - [06/Mar/2017:19:16:21 +0000] "GET /api/i18n/en HTTP/1.1" 200 2 "https://data.ioos.us/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
123.123.123.123 - - [06/Mar/2017:19:16:25 +0000] "GET /robots.txt HTTP/1.1" 200 137 "-" "ltx71 - (http://ltx71.com/)"
123.123.123.123 - - [06/Mar/2017:19:16:26 +0000] "GET /dataset HTTP/1.1" 200 8144 "https://data.ioos.us/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
123.123.123.123 - - [06/Mar/2017:19:16:26 +0000] "GET /api/i18n/en HTTP/1.1" 200 2 "https://data.ioos.us/dataset" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
Is there a way to set nginx to rotate logs monthly automatically? In Apache, you can do this with the cronolog program. This would be easier for us to submit to our logfile analysis POC, he expects monthly logs.
Luke,
I'm going to just paste the NOS' requirements in this ticket, so you can decide what the best way to go is. They can be posted automatically via ftp to NOS for analysis, but archiving them on S3 as well would be a good idea (although they may need to be restricted for privacy reasons, not sure).
NOS weblog site:
Web Log File Format
Web logs come in a number of formats. Each month (or at other times, depending on circumstances), the NOS Web Admin team (nos.webadmin@noaa.gov) collects Web log files from all NOS and NOS-partner sites, and processes them in a uniform manner.
In order to get the most value from these logs, servers should be configured to log the following W3C tokens (or their NCSA or IIS equivalents):
DATE
TIME: time can be either UTC or local, as long as it doesn't change
CS-URI: the URI portion of a request (the portion after the host name)
C-IP: client (visitor) IP address
BYTES: bytes sent to visitor.
SC-STATUS: status code (i.e., 200, 404, etc.)
CS(REFERER): site and page visitor clicked on to link to your page
CS(USER-AGENT): browser or robot name and version
CS-USERNAME: of use for sites requiring authentication
TIME-TAKEN: time required to send response
CS(HOST): name of host site requested by the visitor
CS-URI-QUERY: CGI (Common Gateway Interface) arguments
CS-METHOD: request header (GET, PUT, etc.)
CS-VERSION: HTTP protocol version number.
Since the Web statistics are taken directly from the logs, not logging Web browser type, for example, will result in blank reports on Web browser usage.
Log Naming Conventions
Virtually all Web servers save their logs in a file named access.log. In order to distinguish one server from another, and sort the logs in a consistent order, please use the following naming convention:
nameofsite+year+month
as in:
oceanservice.noaa.gov201001.gz
oceanservice.noaa.gov201002.gz
oceanservice.noaa.gov201003.gz
which would indicate log files from oceanservie.noaa.gov for January, February and March of 2010.
Compression Format
If you wish to submit log files as a zipped directory of individual days or weeks, that is fine. But ideally logs should be gzipped. Gzip does not support directories (it is a compression format, not an archiving format) but it does support decompression on the fly in memory, which means that terabytes of log files can be analyzed without using terebytes of drive space. Concantenating daily files into a file for an entire month, and gzipping the file, speeds the process of generating Web metrics.
Submitting Logs
If you don't have a directory where your logs can be snatched automatically after the first of the month, please submit your logs here:
ftp://ftpnos.woc.noaa.gov/incoming/charters/
You can submit them using anonymous FTP or, with an account, via WebDAV.
This will take some work, I'll throw it on our backlog.
Just a heads up early on too, data.ioos.us is home to a lot more than just CKAN, it's home to the compliance-checker, and GliderDAC products.
Are these fields in a particular order? Apache by default logs the majority of these in the "combined" format, which can be specified in the configuration.
See: https://httpd.apache.org/docs/1.3/logs.html#combined
Edit: did see the earlier comment regarding combined logs, but got a bit confused because the application server for Catalog/CKAN is serving via Apache, and so can output logs. I haven't looked at the production config in a while, but IIRC, we're also reverse proxying several services through Nginx, which I presume is what is being discussed here.
@mwengren, with respect to rotating logs monthly, usually it is the responsibility of another application to inspect the timestamps, i.e. logrotate for pure *nix, and Java applications have a number of logging frameworks which I believe can also take care of this for those cases. From that point, there are usually some compression and renaming facilities.
@benjwadams The Apache 'combined' format is fine. I just wanted to list the full NOS page that describes their requirements/recommendations. We've sent combined format logs to them for years and it's sufficient.
The important parts are the monthly rotation and log file naming. We should call our logs by domain name 'data.ioos.us.201703.log.gz' etc, and then post them on S3 for archive.
Ideally, if you guys could script the upload step to ftp://ftpnos.woc.noaa.gov/incoming/charters/ on the first of the month, that would be great. For that to be automated, we need to make sure the logs are named distinctly.
I'm assuming these are primarily access logs for the time being. Are error logs desired as well?
We're mostly interested in the access logs for archive and analysis. Not sure whether there's benefit to archive the error logs. You can probably skip those for now.
I'm going to close this one as I think we've pretty much wrapped it up.
We should rotate web server logs on a monthly basis and archive either in place or on an external server (if available).