ioos / ioos-code-sprint

Information about IOOS Code Sprint activities.
https://ioos.github.io/ioos-code-sprint/
MIT License
8 stars 14 forks source link

[Project Proposal]: ERDDAP web logs analysis #34

Open callumrollo opened 7 months ago

callumrollo commented 7 months ago

Project Description

Develop a tool that reads in the web logs of an ERDDAP server to analyse how the server is being used. This would include:

Expected Outcomes

A python based tool that ERDDAP admins can use to quickly and easily establish how data from their server are being used

Skills required

Expertise

Novice

Topic Lead(s)

@callumrollo

Relevant links

Work in progress here https://github.com/callumrollo/erddaplogs

callumrollo commented 7 months ago

@ocefpaf

abkfenris commented 7 months ago

Another possible take on this is to build something that can be run as a sidecar to a deployment and use the logs for metrics that can be consumed by a tool like Prometheus.

Axiom has done some work in https://github.com/axiom-data-science/erddap-metrics which uses the outward facing status page, but I think it could be extended (or more likely re-written with a framework that is still being developed) to work with logs as a sidecar.

callumrollo commented 6 months ago

Looks like a very promising docker sidecar like project here https://github.com/dfo-meds/erddaputil

Will check it out properly next week

jcermauwedu commented 6 months ago

@callumrollo We can also reach into the erddap code to improve/optimize some of the logging messages. I have been exercising the maven docker erddap development container as of late (a bit of a kerfuffle over netcdfAll.jar artifacts no longer being published, yada yada...). So, there is an opportunity to examine what could be added/optimized and passed back to erddap as a PR (if desired). Part of the build process exercises the current test suite, so we can ensure we don't break things too badly.

fgayanilo commented 6 months ago

Interested!

ChrisJohnNOAA commented 6 months ago

I'd love for the ERDDAP logs to be more useful for both admins and ERDDAP developers (me and others). One thought is there are a number of logging/analytics services that will likely have support for many of the requested features above (and more). Would it make sense to use one of those existing services?

From the ERDDAP developer point of view, there's a lot of data that could be useful for me. In particular reporting errors (so I can fix them without relying on users/admins reporting them to me) and feature usage (to inform prioritization of work). Getting that data from a running ERDDAP to a central point I can access it is the biggest hurdle.

jenseva commented 5 months ago

@ChrisJohnNOAA are you familiar with the log reporting/analysis that Roy and Dale (@rmendels and @dhr-sc) were testing with the SWFSC ERDDAP? Dale showed a demo where he had the output of the logs as an ERDDAP dataset. It was pretty brilliant. This sort of system may offer a solution for providing logs back to you via ERDDAP.

rmendels commented 5 months ago

@ChrisJohnNOAA @jenseva @dhr-sc All credit to Dale. It is entirely his work.

ChrisJohnNOAA commented 5 months ago

@ChrisJohnNOAA are you familiar with the log reporting/analysis that Roy and Dale (@rmendels and @dhr-sc) were testing with the SWFSC ERDDAP? Dale showed a demo where he had the output of the logs as an ERDDAP dataset. It was pretty brilliant. This sort of system may offer a solution for providing logs back to you via ERDDAP.

I haven't seen that yet. I'll take a look and see if it can solve my needs.

That said I still want to support improvements to logs for admins.

7yl4r commented 5 months ago

As much as I love python, I agree that digging into the ERDDAP source (Java) is the cleaner approach here. The logs could be improved, but it may also be possible to add this information to the status.html page (or similar).

rmendels commented 5 months ago

Please look at the output in the daily emails. A lot of the information may already be there. It summarizes in great detail all the requests.

7yl4r commented 5 months ago

I don't have that configured on my ERDDAP; I'll look into it. If the project is about parsing that output we should start by collecting together some example outputs from ERDDAP that this project will be using as input.

rmendels commented 5 months ago

Even if not mailed, I believe but could be wrong, that the file is created in the log directory. Either way, I can provide an example of what we get, but would prefer not to do so publicly, so if you can contact me by email I can send you a sample. It is really quite extensive, breaking access down into all sorts of categories.

callumrollo commented 5 months ago

As @rmendels posted, there are daily emails that get logged to erddapData/logs/emailLogyyyy-mm-dd.txt. They look like this:

https://pastebin.com/NpfSwWTe

I have tried parsing information from this, but it lacks some of the details that my manager has requested. I have been asked to crunch the numbers on a monthly basis to answer questions like:

  1. What datasets are being accessed most?
  2. Which variables are people requesting?
  3. Are they sub-setting the data? If so, how?
  4. What datatypes are they requesting?
  5. What is the geographical/temporal spread of visitors?

So far, I've found it easier to analyse the nginx/apache logs of incoming http requests rather than getting it from ERDDAP's status page, daily emails or logs. Looking at requests is nice as you have very granular, raw data. Not the summarised/binned data that goes into e.g. the daily email report.

MathewBiddle commented 5 months ago

Thank you for taking the time to propose this topic! From the Code Sprint topic survey, this has garnered a lot of interest.

Following the contributing guidelines on selecting a code sprint topic I have assigned this topic to @callumrollo. Unless indicated otherwise, the assignee will be responsible for identifying a plan for the code sprint topic, establishing a team, and taking the lead on executing said plan. The first action for the lead is to:

joe-smithe-glos commented 4 months ago

In response to a merged PR question: happy to be a participant @callumrollo !

MathewBiddle commented 4 months ago

Webpage https://ioos.github.io/ioos-code-sprint/2024/topics/03-ERDDAP_web_logs_analysis.html

joe-smithe-glos commented 4 months ago

Noting here I volunteer to be on-the-ground scrum master as a last resort lest someone else takes the lead.

jcermauwedu commented 4 months ago

Also able to help co-lead on this topic as needed.

apkrelling commented 4 months ago

I would also like to contribute to this project, if possible. It would definitely be nice to get some example outputs from ERDDAP that this project will be using as input. I'll email @rmendels about it.

jcermauwedu commented 4 months ago

@apkrelling That sounds like a good place to start. An existing ERRDAP log to parse with above mentioned tools to see if we can come up with answers to these questions. If the log currently does not provide those details, how to add details to the log so that can be done. Three things to also examine is not to do harm to the test suite and determine if these two tools can be used to conjure up the requested information, add too them or start a different tool all together. @callumrollo What might be useful is a good example from nginx/apache log and a script that you use to answer some of those questions. I will also look for my logs on our server. We do not have a lot of traffic, but I have a apache/ERDDAP installation for which log information might be possible.

I can at least setup a docker ERDDAP text based development container for hacking. At a minimum, participants will want to install Docker Desktop to begin climbing through the java code. Interactive containers for ERDDAP and Ubuntu can be found here. Build scripts. The VIM editor is included. Use the apt system to install your favorite.

Items to review:

ChrisJohnNOAA commented 4 months ago

Just a note that for running a local ERDDAP, you can use Jetty. Which doesn't require installing anything manually (after maven).

https://github.com/ERDDAP/erddap/tree/main/development/jetty

mwengren commented 4 months ago

@ChrisJohnNOAA I am not sure how Jetty does web server logging, but I'd be interested to know if it can produce compatible logs w/Apache or nginx. I will try to look around tomorrow.

ChrisJohnNOAA commented 4 months ago

It looks like Jetty request logging needs to be turned on. java -jar $JETTY_HOME/start.jar --add-modules=http,requestlog

They look like (from the Jetty documentation, not a real example): 192.168.0.100 - - [31/Jan/2020:20:30:40 +0000] "GET / HTTP/1.1" 200 6789 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"

This looks compatible with nginx, but I haven't tried parsing one yet.

aalloilla commented 4 months ago

I tested with that Jetty example log line you provided, and it does seem to work with the nginx log parser.