ioos / ckanext-ioos-theme

IOOS Catalog as a CKAN extension
GNU Affero General Public License v3.0
7 stars 14 forks source link

Archive monthly data.ioos.us web logs #133

Closed mwengren closed 7 years ago

mwengren commented 7 years ago

We should rotate web server logs on a monthly basis and archive either in place or on an external server (if available).

jbosch-noaa commented 7 years ago

This is good. Please keep @robragsdale and I posted on what you decide.

mwengren commented 7 years ago

@lukecampbell Can you make sure the web server for data.ioos.us is configured to capture Apache-style 'combined' log format for March (if possible), and if not by April? With monthly log rotation?

lukecampbell commented 7 years ago

We actually use nginx for the HTTP server for data.ioos.us. I'll take a look.

lukecampbell commented 7 years ago
123.123.123.123 - - -@@@@ /usr/share/nginx/html/api/i18n/en [06/Mar/2017:17:40:32 +0000] "GET /api/i18n/en HTTP/1.1" 200 2 "https://data.ioos.us/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" "-"
123.123.123.123 - - -@@@@ /usr/share/nginx/html/images/icons/ckan.ico [06/Mar/2017:17:40:33 +0000] "GET /images/icons/ckan.ico HTTP/1.1" 404 12671 "https://data.ioos.us/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" "-"
lukecampbell commented 7 years ago

I just changed the rotation period from 10 days to 40 days. I'll add a monthly cron job to archive these to s3.

mwengren commented 7 years ago

I knew you guys used nginx, but the point was to make sure the logs follow the 'combined' format as specified by Apache HTTPD. Specifically the referrer is important, I don't think I see that in the above log examples, only UserAgent info.

Can nginx log referrers?

lukecampbell commented 7 years ago

This is with nginx set to "combined"

123.123.123.123 - - [06/Mar/2017:19:16:21 +0000] "GET / HTTP/1.1" 200 4018 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
123.123.123.123 - - [06/Mar/2017:19:16:21 +0000] "GET /api/i18n/en HTTP/1.1" 200 2 "https://data.ioos.us/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
123.123.123.123 - - [06/Mar/2017:19:16:25 +0000] "GET /robots.txt HTTP/1.1" 200 137 "-" "ltx71 - (http://ltx71.com/)"
123.123.123.123 - - [06/Mar/2017:19:16:26 +0000] "GET /dataset HTTP/1.1" 200 8144 "https://data.ioos.us/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
123.123.123.123 - - [06/Mar/2017:19:16:26 +0000] "GET /api/i18n/en HTTP/1.1" 200 2 "https://data.ioos.us/dataset" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
mwengren commented 7 years ago

Is there a way to set nginx to rotate logs monthly automatically? In Apache, you can do this with the cronolog program. This would be easier for us to submit to our logfile analysis POC, he expects monthly logs.

mwengren commented 7 years ago

Luke,

I'm going to just paste the NOS' requirements in this ticket, so you can decide what the best way to go is. They can be posted automatically via ftp to NOS for analysis, but archiving them on S3 as well would be a good idea (although they may need to be restricted for privacy reasons, not sure).

NOS weblog site:

Web Log File Format

Web logs come in a number of formats. Each month (or at other times, depending on circumstances), the NOS Web Admin team (nos.webadmin@noaa.gov) collects Web log files from all NOS and NOS-partner sites, and processes them in a uniform manner.

In order to get the most value from these logs, servers should be configured to log the following W3C tokens (or their NCSA or IIS equivalents):

DATE
TIME: time can be either UTC or local, as long as it doesn't change
CS-URI: the URI portion of a request (the portion after the host name)
C-IP: client (visitor) IP address
BYTES: bytes sent to visitor.
SC-STATUS: status code (i.e., 200, 404, etc.)
CS(REFERER): site and page visitor clicked on to link to your page
CS(USER-AGENT): browser or robot name and version
CS-USERNAME: of use for sites requiring authentication
TIME-TAKEN: time required to send response
CS(HOST): name of host site requested by the visitor
CS-URI-QUERY: CGI (Common Gateway Interface) arguments
CS-METHOD: request header (GET, PUT, etc.)
CS-VERSION: HTTP protocol version number.

Since the Web statistics are taken directly from the logs, not logging Web browser type, for example, will result in blank reports on Web browser usage.

Log Naming Conventions

Virtually all Web servers save their logs in a file named access.log. In order to distinguish one server from another, and sort the logs in a consistent order, please use the following naming convention:

nameofsite+year+month

as in:

oceanservice.noaa.gov201001.gz
oceanservice.noaa.gov201002.gz
oceanservice.noaa.gov201003.gz

which would indicate log files from oceanservie.noaa.gov for January, February and March of 2010.

Compression Format

If you wish to submit log files as a zipped directory of individual days or weeks, that is fine. But ideally logs should be gzipped. Gzip does not support directories (it is a compression format, not an archiving format) but it does support decompression on the fly in memory, which means that terabytes of log files can be analyzed without using terebytes of drive space. Concantenating daily files into a file for an entire month, and gzipping the file, speeds the process of generating Web metrics.

Submitting Logs

If you don't have a directory where your logs can be snatched automatically after the first of the month, please submit your logs here:

ftp://ftpnos.woc.noaa.gov/incoming/charters/

You can submit them using anonymous FTP or, with an account, via WebDAV.

lukecampbell commented 7 years ago

This will take some work, I'll throw it on our backlog.

lukecampbell commented 7 years ago

Just a heads up early on too, data.ioos.us is home to a lot more than just CKAN, it's home to the compliance-checker, and GliderDAC products.

benjwadams commented 7 years ago

Are these fields in a particular order? Apache by default logs the majority of these in the "combined" format, which can be specified in the configuration.

See: https://httpd.apache.org/docs/1.3/logs.html#combined

Edit: did see the earlier comment regarding combined logs, but got a bit confused because the application server for Catalog/CKAN is serving via Apache, and so can output logs. I haven't looked at the production config in a while, but IIRC, we're also reverse proxying several services through Nginx, which I presume is what is being discussed here.

benjwadams commented 7 years ago

@mwengren, with respect to rotating logs monthly, usually it is the responsibility of another application to inspect the timestamps, i.e. logrotate for pure *nix, and Java applications have a number of logging frameworks which I believe can also take care of this for those cases. From that point, there are usually some compression and renaming facilities.

mwengren commented 7 years ago

@benjwadams The Apache 'combined' format is fine. I just wanted to list the full NOS page that describes their requirements/recommendations. We've sent combined format logs to them for years and it's sufficient.

The important parts are the monthly rotation and log file naming. We should call our logs by domain name 'data.ioos.us.201703.log.gz' etc, and then post them on S3 for archive.

Ideally, if you guys could script the upload step to ftp://ftpnos.woc.noaa.gov/incoming/charters/ on the first of the month, that would be great. For that to be automated, we need to make sure the logs are named distinctly.

benjwadams commented 7 years ago

I'm assuming these are primarily access logs for the time being. Are error logs desired as well?

mwengren commented 7 years ago

We're mostly interested in the access logs for archive and analysis. Not sure whether there's benefit to archive the error logs. You can probably skip those for now.

mwengren commented 7 years ago

I'm going to close this one as I think we've pretty much wrapped it up.