Closed NateWr closed 3 years ago
@ctgraham if you have any thoughts on this, that'd be much appreciated!
This may be a bit broad for this ticket, but I think it is strongly relevant....
James and I chatted with Paul Needham of IRUS-UK last month about some of the considerations in log processing. This includes:
There might be opportunities to collaborate with IRUS-UK to architect a common library which can process logs into statistics (or perhaps just perform the bot-exclusions). It also made we wonder why there wouldn't be a shared library for representing statistics ontop of COUNTER-SUSHI R5 (or why this wouldn't be outsourced to an Electronic Resource Management system), but that didn't seem to get much uptake in our conversation.
I can definitely check in on what the interpretation of "double clicks" of versioned items is from a COUNTER perspective.
To my surprise, I think we are ok-ish on the versioning front without any changes. However, there is one possible issue I uncovered while looking into it.
From my investigations, here's how the URLs are parsed:
article/view
is considered ASSOC_TYPE_SUBMISSION
.article/download
is considered ASSOC_TYPE_SUBMISSION_FILE
.This works for PDF/HTML galleys because even though they load at article/view/<submission-id>/<galley-id>
, the PDF or HTML file is loaded in an iframe at article/download/<submission-id>/<galley-id>
.
Our versioned URLs look like article/view/<submission-id>/version/<publication-id>[/<galley-id>]
, but the underlying PDF or HTML file is still loaded in article/download/<submission-id>/<galley-id>
. So visits to the landing page and galleys of old versions are counted correctly.
However, this means that when someone directly visits the page of a PDF or HTML galley, without going through the article landing page, they record two entries in the logs: article/view/<submission-id>/<galley-id>
and article/download/<submission-id>/<galley-id>
.
UsageStatsLoader::_getUrlMatches()
returns both lines as a match, one for submission and one for the file (what we call in the backend stats area an "abstract" hit and a "files" hit). However, once the file is parsed, it appears in the usage_stats_temporary_records
table as a single hit to the submission (not the file):
mysql> select * from usage_stats_temporary_records where day="20191219";
+----------+------------+----------+------------+--------+------------+--------+------+---------------------------+-----------+
| assoc_id | assoc_type | day | entry_time | metric | country_id | region | city | load_id | file_type |
+----------+------------+----------+------------+--------+------------+--------+------+---------------------------+-----------+
| 5 | 1048585 | 20191219 | 1576762274 | 1 | NULL | NULL | NULL | usage_events_20191219.log | 0 |
+----------+------------+----------+------------+--------+------------+--------+------+---------------------------+-----------+
This is not a new issue with versioning, but I thought I would double-check with you @asmecher and @ctgraham to see if this is intended behaviour. It seems to me like this should count as a single view of the file, not the submission.
To test this I constructed the following fake log to parse, which includes only two URL hits. One to the URL to view a PDF (article/view/5/1
) and one to the URL that actually loads the PDF (article/download/5/1/15)
. I believe that this simulates what a single request to view a PDF would generate in a real usage log.
127.0.0.1 administrative 1 "2019-12-19 14:32:08" http://localhost:8000/publicknowledge/article/view/5/1 200 "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0"
127.0.0.1 administrative 1 "2019-12-19 14:32:08" http://localhost:8000/publicknowledge/article/download/5/1/15 200 "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0"
(To use this for your own testing you would need to update the submission, galley and file ids to ones that exist in your system.)
I think this is probably tied up with COUNTER, which I'm not very familiar with -- it has specifications for things like debouncing. And I'm not sure whether COUNTER business rules are applied when metrics are recorded, or when they're processed. @ctgraham, are you familiar with this aspect? If not, maybe I could follow up with Bozana.
I don't think COUNTER cares about abstract views, nor has it in my memory. I reviewed COUNTER R5, R4, and R3 and each focuses on fulltext downloads/views. Back in R3 there was a distinction between fulltext HTML, fulltext PDF, and fulltext "other", but that distinction goes away in later releases.
If a user views fulltext via HTML and then fulltext via PDF, or downloads the fulltext via the same medium twice, this only counts a one view.
If I were more helpful I would check to make sure the COUNTER reports are using only ASSOC_TYPE_SUBMISSION_FILE for calculations (I think this is the case), check for legacy reports in OJS which describe abstract views separately from fulltext views (hopefully not), and would verify that the inline display of HTML fulltext is registered correctly (I think I recall fancy jiggering in the a plugin for this). But first I want to get that Crossref testing out the door, and even that simple task is eluding me right now.
Thanks @ctgraham. I think this is not an urgent question to address for 3.2, because is not something newly introduced. I'll defer it from 3.2 for now but would like to keep this conversation open.
I'm not sure if I understand you correctly, but perhaps there is some divergence here between how OJS keeps statistics and how you describe COUNTER expectations.
First, OJS does make a distinction between abstract and file views. It tracks both separately in the metrics type by the assoc_type
and we display them independently to editors in the new article stats UI.
Second, from my brief test, it appears that OJS is treating a single visit to the view PDF page as a single visit to the article abstract page. If COUNTER ignores abstract views entirely, usage may be under-reported. (This was an isolated test. More investigation would be necessary to see what happens in different scenarios and how the rows in a metrics table get compiled into COUNTER reports.)
Maybe I can take a further look at the problems described here i.e. what is needed to be solve for the next release... First I will test a few things to get the current status. I will do it incrementally because I can only work a little bit every day, will write the (part) results here, and then at the end come back to you all. Thus, please ignore the text till my testing is finished.
OJS:
Galley view and download:
"However, this means that when someone directly visits the page of a PDF or HTML galley, without going through the article landing page, they record two entries in the logs: article/view/http://.../index.php/publicknowledge/article/view/1/1
I only see one log entry: ... http://.../index.php/publicknowledge/article/download/1/1/2 200 ...
i.e. the galley landing page 'http://.../index.php/publicknowledge/article/view/1/1' is never logged.
This conforms to the code in the UsageEventPlugin: https://github.com/pkp/ojs/blob/master/plugins/generic/usageEvent/UsageEventPlugin.inc.php#L57-L71.
However, journal might use apache logs where the galley view page is logged, so I will also check how the log file is processed. It should actually follow the same logic, but... it does not:
Here https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L418-L427 the logic from https://github.com/pkp/ojs/blob/master/plugins/generic/usageEvent/UsageEventPlugin.inc.php#L67-L71 is not considered. So yes, the galley view page will count as submission abstract page :-( This should be corrected so that the galley view page is not counted.
I do not understand why Nate's example from above would count only as abstract view, looking into the code, and with the bug above, it should count both (abstract and file download). I will test that.
Abstract pages and downloads: We log both, i.e.: journal landing page, issue landing page, article landing/abstract page, as well as file downloads (article and issue galleys). The following file types are considered: pdf, html, doc, and other. How are they considered in the current report tools?
Considering versioning: The 'version' URL paths seem not to be considered. It works for file downloads because the same download hook is used, but an article abstract page of an old version (http://.../index.php/publicknowledge/article/view/2/version/3) is not logged and thus not counted. This should be corrected. Currently we do not differentiate the statistics for different article versions. If needed the file downloads can be aggregated, because we probably know to which version/publication a file ID (that we store) belongs. This is not possible for article abstract pages -- once the problem above is corrected there will be just a total number for article abstracts generally i.e. not knowing for which version. Can this stay so or should we maybe add a new column to the DB table metrics that will store e.g. publication ID? (The question in No. 5b -- if a version is an unique item -- is only relevant when implementing the Release 5).
Processing rules for COUNTER Release 4 reports, s. https://www.projectcounter.org/code-of-practice-sections/data-processing/
a) Double click filtering (s. section 7.2):
"When two requests are made for one and the same article within the above time limits (10 seconds for HTML, 30 seconds for PDF), the first request should be removed and the second retained."
This is also how we implement it. It is implemented here: https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L194-L224.
b) Protocol for internet robots and crawlers
COUNTER maintains the current list of internet robots and crawlers at https://github.com/atmire/COUNTER-Robots.
We use it as module in lib/pkp/lib/counterBots
, assign the file to the variable COUNTER_USER_AGENTS_FILE
(https://github.com/pkp/pkp-lib/blob/master/classes/core/Core.inc.php#L23) and implement the function isUserAgentBot
in https://github.com/pkp/pkp-lib/blob/master/classes/core/Core.inc.php#L100. The function is then used when the log files are processed (https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L170).
We should define the strategy when we get the most recent version of the list.
Thus, everything seems to be OK for now, for the COUNTER Release 4 that we currently support.
COUNTER Release 5:
It seems the Release 5 with lots of changes is out there. Here a guide for journals: https://www.projectcounter.org/wp-content/uploads/2020/08/Module_2_Journal_Usage_20200811.pdf.
Thus maybe to only fix the problems Nate encountered here in this release and then implement the support for the new R5.
5.1. Processing rules for COUNTER 5 reports, s. https://www.projectcounter.org/code-of-practice-five-sections/7-processing-rules-underlying-counter-reporting-data/:
a) Double click filtering (s. section 7.2):
This is implemented here: https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L194-L224. We differentiate between the access of HTML, PDF and other. This seems not to be needed any more -- We can change it to consider 30 seconds for any link i.e. file?
b) Unique Items (s. section 7.3):
In our case Item
is an article. The matching report is AR1. And the rule is: "If multiple transactions qualifying for the Metric_Type in question represent the same item and occur in the same user-sessions, only one unique activity MUST be counted for that item." Where user-session seems to be defined for an hour, as far as I understand it?
This is what @ctgraham wrote above: "If a user views fulltext via HTML and then fulltext via PDF, or downloads the fulltext via the same medium twice, this only counts a one view.".
We will need to implement it. At the moment I do not know how would it best to do it -- I believe that the journal managers and editors would like to have the separate counts for each file, to see how they are used.
The question if the article versions do belong to the same Item
is still open. Due to the way we represent them internally I would say they do belong to the same Item
.
c) Unique Titles (s. section 7.4):
In the case of a journal Title = a journal and the report = Title Master Report. Similar to the rule for the unique item above, the rule here is: "If multiple transactions qualifying for the Metric_Type in question represent the same title and occur in the same user-session only one unique activity MUST be counted for that title.". Where the user-session seems to be defined for an hour? I.e. here, if a user accesses one article and then another in the same session, it would only count once.
This rule i.e. report seems not to be used for single journals -- introduced mostly for books.
d) Internet Robots and Crawlers (s. section 7.8):
Same as for Release 4.
COUNTER maintains the current list of internet robots and crawlers at https://github.com/atmire/COUNTER-Robots.
We use it as module in lib/pkp/lib/counterBots
, assign the file to the variable COUNTER_USER_AGENTS_FILE
(https://github.com/pkp/pkp-lib/blob/master/classes/core/Core.inc.php#L23) and implement the function isUserAgentBot
in https://github.com/pkp/pkp-lib/blob/master/classes/core/Core.inc.php#L100. The function is then used when the log files are processed (https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L170).
We should define the strategy when we get the most recent version of the list.
The other points @ctgraham mentioned above in https://github.com/pkp/pkp-lib/issues/4904#issuecomment-509689881 are surely worth considering. I would first need to know more about them. If there is something we could consider 'quickly' with these changes, please tell me. Else, I would leave them for some other time.
Hi @NateWr and @ctgraham, there would be some things to correct here: the problem that Nate found out (No. 1 and 3 above) should be fixed. The changes in the the log processing (No. 5b and 5c above) -- above all the immediate access on HTML and then PDF, or different article versions -- are related to the new COUNTER Release 5. So we will need to implement them when we implement the support for that R5. I could not find anything specific about different article versions (in R5) -- if they are considered as a unique item. I could try to write to that email address on the counter web site, if you have no better idea. Or should/can we maybe just define them as one item for now? Any other comments/thoughts are of course very welcome. Thanks a lot!!!
Hi @NateWr and @ctgraham, reading the document https://www.projectcounter.org/wp-content/uploads/2020/08/Module_2_Journal_Usage_20200811.pdf: it seems that that unique item and title (from the 5b and 5c above) first came now with the COUNTER Release 5. The Release 4, that we currently support, I think, still counts HTML and PDF separately. Also, the R5 considers abstract views as Investigations
. Also, SUSHI support is mandatory for compliance with COUNTER Release 5 (s. https://www.projectcounter.org/wp-content/uploads/2019/05/Release_5_TechNotes_PDFX_20190509-Revised.pdf).
Thus, I would maybe suggest that we only fix the problems Nate encountered for this issue (eventually also for this OJS/OMP/OPS release?) and then think and re-implement the statistics so that we support that release 5. What do you think?
Can one still use the R4 reports?
Note about the unique title from 5c above: The Journal Report changes -- e.g. the JR1 seems to exclude the OA-Gold -- it seems we should then provide Title Master Report (and eventually TR_J3, by Access_Type?)... s. https://www.projectcounter.org/wp-content/uploads/2019/05/Release_5_Providers_20190509-Revised.pdf. Those journal reports seem not to have that unique title numbers from 5c above? -- I only see them in the document for librarians (https://www.projectcounter.org/wp-content/uploads/2019/05/Release_5_Librarians_20190509-Revised-Edition.pdf) and it seems they are above all meant to be for (some) books. So I think we could ignore that 5c above for journals? (some examples of reports https://www.projectcounter.org/appendix-i-sample-counter-repor/)
Hmmm... This Release 5 seems to has been out there for almost 3 years now, so I suppose we should support it very soon... Do we know journals that use the COUNTER reports? It was nice to be able to say that our (internal) statistics are COUNTER compatible, so this will be good to have further on -- although eventually not much journals are actually using the COUNTER reports... Having a look at the Release 5 reviews (https://www.projectcounter.org/counter-release-5-an-independent-review/) it seems it is still not implemented by most of the publishers/members. And it seems the OA is a little bit 'neglected' in this release.
COUNTER R4 is still fairly widely used around Libraries, though it ought to be phased out in favor of R5; Pitt ULS uses COUNTER in communicating usage statistics to Plum Analytics.
@shanu17 from Pitt ULS is working on SUSHI/COUNTER R5 for PKP. I owe him better definitions of each report so that he can map the report requirements against our Statistics Service / MetricsDAO. There remain some gaps in our internal statistics harvesting for non-OA usages, e.g. mapping access against institutional subscription and counting access denied requests.
Hi @ctgraham and @shanu17, that is great to hear -- that you are working on the support for R5! :-))) I would then only fix the processing problems @NateWr encountered here, for R4, now. If I can somehow help for the support of R5, just let me know. (Maybe we can open a new issue for that and ... )
The coming PRs consider the following issues:
1) Article abstract page versions URL were not logged. Now they are logged, e.g. .../article/view/1/version/1
.
2) If there is a log entry of a galley view page, it will not be counted. The log file example above:
127.0.0.1 administrative 1 "2019-12-19 14:32:08" http://localhost:8000/publicknowledge/article/view/5/1 200 "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0"
127.0.0.1 administrative 1 "2019-12-19 14:32:08" http://localhost:8000/publicknowledge/article/download/5/1/15 200 "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0"
counts only the last, file download.
3) The representation ID was calculated using file->getAssocId()
. Because a file can be associated with several representations in OJS now, this eventually led to saving the wrong representation ID in the table metrics. This is corrected -- now the representation ID is passed through, from the log file URL processing via table usage_stats_temporary_records
to the table metrics
.
NOTE about the double click processing for versions: The double click processing uses assocType and assocId -- if they are equal, the double click rule is applied. Earlier those two variables were defining also the unique URL, now with versioning this is not the case any more. The R4 says here https://www.projectcounter.org/code-of-practice-sections/data-processing/:
All users’ double-clicks on an http-link should be counted as only 1 request.
When two requests are made for one and the same article within the above time limits (10 seconds for HTML, 30 seconds for PDF), the first request should be removed and the second retained. Any additional requests for the same article within these time limits should be treated identically: always remove the first and retain the second. (For further information on the implementation of this protocol, see Appendix D: Guidelines for Implementation)
Thus, I am not sure if this applies to the same URLs or content objects.
(The R5 is more precise, I believe, and mentions only URLs, s. https://www.projectcounter.org/code-of-practice-five-sections/7-processing-rules-underlying-counter-reporting-data/#doubleclick. -- The uniqueness is handled extra.)
Currently, that means for versioning: For the file download, e.g. the following log file entries when two different versions have the same file:
article/download/2/3/4
article/download/2/4/4
are counted only 1.
If file changes in a new version, e.g. the following log file entries
article/download/1/1/2
article/download/1/10/18
are counted = 2.
This seems to be OK -- if the file does not change and the new version contains the same file (with the same file ID) it is considered for double click. But if the abstract changes (which we do not know) and we have these two URLs in the log file:
article/view/1/version/1
article/view/1/
they are counted = 1.
Is this all OK so for now? (For R5 we would then change the way double click processing works, to consider only the same URLs)
PRs: pkp-lib: https://github.com/pkp/pkp-lib/pull/6787 usageStats: https://github.com/pkp/usageStats/pull/43 ojs: https://github.com/pkp/ojs/pull/3051 omp: https://github.com/pkp/omp/pull/937 ops: https://github.com/pkp/ops/pull/127
stable-3_3_0: pkp-lib: https://github.com/pkp/pkp-lib/pull/6815 usageStats: https://github.com/pkp/usageStats/pull/44 ojs: https://github.com/pkp/ojs/pull/3061 omp: https://github.com/pkp/omp/pull/939 ops: https://github.com/pkp/ops/pull/128
@NateWr, could you please take a look at the PRs i.e. changes above? -- I haven't tested them with OMP and OPS, so just a rough view would be sufficient -- if the solutions are OK so -- and I would continue with OMP and OPS -- and then later a deeper code review can happen. Please see my comment above -- what they are solving now. Also, do you know how to ensure that the new db table structure (new table column) is ensured for the users after these changes are merged? -- I will also take a look... Thanks a lot!
Thanks @bozana! I've had a quick look and it looks ok to me.
Thus, I am not sure if this applies to the same URLs or content objects.
You've probably assessed the COUNTER spec closer than me, but my sense is that we should treat versions as equivalent. So a click on any version of an article will count towards that article and we don't need to track versions separately.
This is a little bit more complicated for galleys, because we track clicks to galleys independently. And there is no way to say Galleys A and B are two versions of the same Galley. But I'm not sure this matters for us, because to my knowledge we do not present statistics on clicks to individual galleys. We only use the representation tracking to determine whether it was a click on a PDF, HTML or Other representation.
Maybe in the future we would want to display the usage stats for every representation. But I can't recall any requests for that. This probably deserves a broader discussion. But for the purposes of this PR, I think (I hope!) that it will be enough to record visits to galleys as counting towards a hit on a submission, regardless of what version the galley is. In other words, I think we can disregard the versioning in the stats.
Thanks a lot @NateWr! I will then test everything with OMP and OPS and make new PRs...
Maybe here just one general note: All the abstract views, as well as all files downloads of a submission, no matter which version is counted towards that article in total. For a submission we then usually display the numbers for abstracts, PDF downloads, HTML downloads, eventually other downloads, and a total count, so the versions do not really matter. We however have the information about the file and representation ID, so that we could differentiate between the files and even versions for file downloads, which is not the case for abstract views (which is OK, I think, and as you said). When we implement the R5 it will be similar, but we would additionally need to keep the information about the unique item numbers.
I am currently not sure about our double click processing -- as described above -- but I think we can leave that for now as it is. In R5 they are defined more precise, only the same URLs should be considered. If two different abstract versions (that have different URLs) are accessed within 10 seconds, they would now count only 1. In R5 they would count total = 2, unique item = 1. Thus, because IMO the R4 does not define it very precise, and because only that one or two cases regarding our versioning (that are probably also very rare to happen) is affected by the rule, we can leave it so.
@NateWr, shall all the changes be ported to the stable-3_3_0 even if the DB upgrade is needed?
I don't think a 3.3.0-x release can have a migration in it. Are we sure that this is needed? I know the problem is that a single file id might track back to multiple representation ids. But if all we need to know is whether it is PDF, HTML or Other, can we get away with not knowing the representation ID? Can we get that from the file ID / path itself?
The DB table metrics already has the column representation_id and currently it could contain the wrong IDs there. It seems that that column is currently not really used -- it was used earlier for the submission usage stats (graph) represented to the reader on the submission view page -- probably to display the usage of each galley/publication format (OMP). Those usage (graph) display functions seem not to be used any more. If we only want to differentiate between PDF, HTML and Other -- no matter about galley, publication/version or publication format (OMP) -- we could ignore that column i.e. remove it later. But, in general it is maybe OK/useful to have that information too. Unfortunately I do not see any other way to get the right galley ID (in OJS) when we only know the file ID. Thus, maybe we can port the other changes to the stable branch -- that do not need the DB upgrade -- and have that change with the DB upgrade first in the next major release 3.3.1?
Yeah, I think we need to do what we can without a DB upgrade for now. We can do the DB upgrade with 3.3.1, but I think in general that we have discussed reducing the metrics table columns where possible. I can see the utility of collecting whatever we can, but I also know that some large installations have struggled with the sheer size of their metrics table. Let's discuss this further as we think about how to refactor our usage stats.
@NateWr, could you take another look at the PRs above? -- Now everything is finished for the main branch... I will back port those fixes not needing the DB upgrade to the stable-3_3_0... Thanks a lot!
:+1: I've done a code review but not yet tested it myself. When you have PRs for stable-3_3_0
I can load it up and run it through some manual tests.
Hi @NateWr, sorry to bother you again -- I've considered everything from the code review I think -- if you would like to have another look. Thanks a lot!
Looks good to me. :+1:
Thanks a lot @NateWr! I would then merge the main branch changes as soon as all tests are successfully run, ok?
I also did the PRs for stable-3_3_0 (see above). Those contain only the fixes for the problem No. 1 and No. 2 from here https://github.com/pkp/pkp-lib/issues/4904#issuecomment-784346753. (No 3. requires DB change, so this is not coming into the stable branch). Also, I did not implement PKPSubmissionDAO::exists() -- as we said -- not to change it in the stable branch, but I added the PKPPublicationDAO::exists() -- because this check is new/first now added. Would you like to take a look and eventually test it? :pray:
Looks good, go ahead and merge to stable.
Everything merged from this issue, thus closing...
The code which reads access logs and stores metrics data needs to be updated to ensure stats are calculated correctly across different versions.
The main URLs for a submission should stay the same. However, new URLs will be introduced for each version and its galleys (see #4870). Visits to these URLs should go toward a submission's total.
Also, COUNTER may have some rules regarding counting duplicate visits within a time period to the same resource. We need to figure out what these rules specify and how to correctly count visits to two versions of the same item in a short period.