hbz / lobid

Linking Open Bibliographic Data
https://lobid.org/
Eclipse Public License 2.0
15 stars 4 forks source link

Fix stats #303

Closed acka47 closed 7 years ago

acka47 commented 8 years ago

Since the beginning of the year we don't have web access stats anymore. Especially for nwbib.de they would come handy.

acka47 commented 8 years ago

I just wrote in an email that I could provide some statistics for a forthcoming article on NWBib in Prolibris. The submission deadline is 15 August. It would be great to have the stats fixed within the next two weeks. (I will be on holidays in the first half of August.)

dr0i commented 7 years ago

Cause: changing the nfs to store the logs (see hbz/nwbib#93) I forgot to update the location at /etc/logrotate.d/apache2, so the file access.log wasn't rotated, so the statistic-script couldn't work on the expected (rotated) files. Solution:

Will have a look at in august to see if rotation and creation of statistics is happening as supposed.

dr0i commented 7 years ago

@acka47 @fsteeg could it be that some URL paths were changed between 2016-02 and 2016-04 ? Because the facets-usage decreased extremley from > 100k hits to < 1000).

These are the regex' working on the URLs stored in access.log :

grep 'GET /nwbib' > nwbib/nwbib-access_log-${DATE}.html grep "t=Raumsystematik" > nwbib/issue_93/${DATE}_nwbibspatial.html grep "t=Sachsystematik"t > nwbib/issue_93/${DATE}_nwbibsubject.html grep "org/nwbib/advanced" > nwbib/issue_93/${DATE}_advanced.html"

grep "GET /nwbib/topics" > nwbib/issue_93/${DATE}_topics.html" grep "GET /nwbib/facets" > nwbib/issue_93/${DATE}_nwbibfacets.html" grep "GET /nwbib/search?location=&" > nwbib/issue_93/${DATE}_simple.html"· grep "GET /nwbib/search?location=[A-Z] > nwbib/issue_93/${DATE}_map.html"

fsteeg commented 7 years ago

I don't know about URL changes, but we did some performance tweaking in that time, so maybe the large usage numbers were actually internal calls that no longer happen after the optimization.

acka47 commented 7 years ago

Looks good. +1

acka47 commented 7 years ago

Closing

acka47 commented 7 years ago

Reopening because nwbib.de hits aren't covered.

fsteeg commented 7 years ago

After some more discussion with @acka47 we found the probable cause (and I now understand @dr0i's question in https://github.com/hbz/lobid/issues/303#issuecomment-232694519). When we moved the domain to nwbib.de, all URL paths changed, since we dropped the nwbib segment: https://github.com/hbz/nwbib/commit/908d8898e16f4a8f905cf4afd7fb2739d4a646b3. Current stats probably only cover access via the lobid.org/nwbib redirect.

dr0i commented 7 years ago

Ok, changed the regex accordingly, e.g. bzgrep "t=Raumsystematik" $SRC | grep -E "GET /nwbib/search|GET /search.*/nwbib.de" .... Now e,g, the facets-hits are commensurable.

@acka47 please have a look again.

One note concerning the location query (aka *_map statistic, see e.g. http://stats.lobid.org/web-access/nwbib/issue_93/2016-06-01_map.html) : These log entries seem to be the result of one query : bzgrep -E "nwbib.de/search\?location=[A-Z]" /files/weywot4/logs_apache_emphytos/access_log-20160701.bz2 | grep -E "GET /nwbib/search?location=[A-Z]|GET /search.*nwbib.de/search\?location=[A-Z]"

[01/Jun/2016:12:07:01 +0200] "GET /search?location=51.29999994300306%2C6.8499998934566975 HTTP/1.1" 200 12660 "http://nwbib.de/search?location=D%C3%BCsseldorf|u1hu47+...
[01/Jun/2016:12:07:53 +0200] "GET /search?location=D%C3%BCsseldorf%7Cu1hu47+...
[01/Jun/2016:12:08:14 +0200] "GET /search?location=D%C3%BCsseldorf%7Cu1hu47+...
[01/Jun/2016:12:08:43 +0200] "GET /search?location=D%C3%BCsseldorf%7Cu1hu47+...
[01/Jun/2016:12:08:56 +0200] "GET /search?subject=http%3A%2F%2Fd-nb.info%2Fgnd%2F118548018&location=D%C3%BCsseldorf%7Cu1hu47+...

but are counted as 5 hits (albeit the three in the middle may be the result of 3 clicks (because they are identical (which is not a pattern, comparing to other log entries))). To examine this you can also check the relation Visitors/Hits under Top requests (URLs) in the above mentioned statistic file: e.g. a query having one visitor but 5 hits indicates doublication of hits (which also may be the result of clicking multiple times on the map ...).

So, again, use these statistics wisely. They are mostly good for intra-interpretation (e.g. what is the tendency etc.) and not so well for global statements (like: "100 hits per month using the location query") because for the latter you really have to know the data and determine what exactly is going on and define excatly what is calculated.

dr0i commented 7 years ago

Also changed the time when logs are rotated: no more at the first day of a month at 14:00 but at 00:00.

acka47 commented 7 years ago

The total number of hits looks good now. But something still isn't ok. I am interested in the referrer stats.While there where >5k hits via google.de until April (see e.g. http://stats.lobid.org/web-access/nwbib/nwbib-access_log-2016-03-01.html#referring_sites), in July there where less then hundred, see http://stats.lobid.org/web-access/nwbib/nwbib-access_log-2016-08-01.html#referring_sites. I guess only hits on lobid.org/nwbib are counted there...

dr0i commented 7 years ago

You are right! In fact, the vhost wasn't configured to be logged. Modified /etc/sysconfig/apache2 to use APACHE_ACCESS_LOG="/files/weywot4/logs_apache_emphytos/access_log vhost_pchbz_combined" which in return is LogFormat "%h %v %u %t \"%r\" %>s %b \ \"%{Referer}i\" \"%{User-Agent}i\"" vhost_pchbz_combined . It's a slightly modification to have a compatible syntax to reuse the existing goaccess scripts, where the old %l (aka Remote logname which was never used, notated in the log files as dash (-)) is substituded with the vhost. Now, a grep like grep -E "GET /nwbib| nwbib.de " makes sure to get the old URL-path style nwbib hits and the new one useing the nwbib.de vhost. (sorry for the confusion - the other stats are possibly near to what was demanded, using the referer to decide if it is a nwbib.de hit - which is obviously not working when also computing the google-referrer ).

The referrer-workaround is now deconfigured. Since the vhost logging is only just activated this month may have less hits calculated for nwbib as expected. If this is the case, the workaround can be reconfigured (see the git logs) for recalculating this august.

Assigning this to myself for reviewing with beginning of the next month.

dr0i commented 7 years ago

logrotate didn't work as there was a file which didn't have the right file permission. Then, the logrotate-script exited, and thus stats-script generated no stats on the website. Fixed the file permission and regenarated the log-files and made the stats for september.

acka47 commented 7 years ago

Wow, referrers from google.de doubled compared to August (for NWBib).

dr0i commented 7 years ago

Checked that the logrotate works now. If you discover problems or demand some feature not that we have a new repo now where the configs and scripts are hosted: hbz/lobid-webserver.

dr0i commented 7 years ago

Closing.