COUNTER Release 5 - Githubissues

bozana commented 3 years ago

[x] Implement a new log file format that will consider institutions and be conform with DGPR (done in https://github.com/pkp/pkp-lib/issues/6782 and https://github.com/pkp/pkp-lib/issues/6895)
[x] Define new data model (done in https://github.com/pkp/pkp-lib/issues/6782)
[x] Implement COUNTER R5 log data processing (done in https://github.com/pkp/pkp-lib/issues/6782)
[x] Implement SUSHI API and relevant reports
[x] Adapt all for OMP
[x] Adapt all for OPS
[x] Check all with PostgreSQL
[x] How to calculate the supported begin and end dates?
[x] Check for/implement using the warning/error 'No Usage Available for Requested Dates'
[x] Opt-out of public stats
[x] Consider the first publication date of a context when calculating the SUSHI start date
[ ] Implement TSV reporting (maybe in the future?) s. https://github.com/pkp/pkp-lib/issues/8248

Implement the COUNTER Release 5 for OJS/OMP/OPS usage statistics. Here we can collect everything we decide is necessary. We can have a discussion below and every time we decide something we can summarize it here.

It seems the Release 5 with lots of changes is out there. Here a guide for journals: https://www.projectcounter.org/wp-content/uploads/2020/08/Module_2_Journal_Usage_20200811.pdf.

Processing rules for COUNTER 5 reports, s. https://www.projectcounter.org/code-of-practice-five-sections/7-processing-rules-underlying-counter-reporting-data/: a) Double click filtering (s. section 7.2): This is implemented here: https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L194-L224. Till know the differentiation was between the access of HTML, PDF and other. This seems not to be needed any more -- We can change it to consider 30 seconds for any link i.e. file. Also we should change our implementation so that only the same URLs are considered (and not the assocType + assocID as till now). The uniqueness is treated differently: b) Unique Items (s. section 7.3): In our case Item is an article. The matching report is AR1. And the rule is: "If multiple transactions qualifying for the Metric_Type in question represent the same item and occur in the same user-sessions, only one unique activity MUST be counted for that item." Where user-session seems to be defined for an hour, as far as I understand it. The question if the article versions do belong to the same Item is still open. Due to the way we represent them internally I would say they do belong to the same Item. c) Unique Titles (s. section 7.4): In the case of a journal Title = a journal and the report = Title Master Report. Similar to the rule for the unique item above, the rule here is: "If multiple transactions qualifying for the Metric_Type in question represent the same title and occur in the same user-session only one unique activity MUST be counted for that title.". Where the user-session seems to be defined for an hour. I.e. here, if a user accesses one article and then another in the same session, it would only count once. This rule i.e. report seems not to be used for single journals -- introduced mostly for books. Do we need it (e.g. for libraries and multi-journal installations)? d) Internet Robots and Crawlers (s. section 7.8): Same as for Release 4. COUNTER maintains the current list of internet robots and crawlers at https://github.com/atmire/COUNTER-Robots. We use it as module in lib/pkp/lib/counterBots, assign the file to the variable COUNTER_USER_AGENTS_FILE (https://github.com/pkp/pkp-lib/blob/master/classes/core/Core.inc.php#L23) and implement the function isUserAgentBot in https://github.com/pkp/pkp-lib/blob/master/classes/core/Core.inc.php#L100. The function is then used when the log files are processed (https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L170). We should define the strategy when we get the most recent version of the list.
Because R5 now supports/count abstract views (in total views count), shell we consider the galley view pages too?
SUSHI support is mandatory for compliance with COUNTER Release 5 (s. https://www.projectcounter.org/wp-content/uploads/2019/05/Release_5_TechNotes_PDFX_20190509-Revised.pdf).
What Reports we would need/like to support/provide: AR1, Journal Master Report, X?

bozana commented 3 years ago

Hi @NateWr, @ctgraham, and @shanu17, I've opened this issue for us to see what everything has to be done for the COUNTER R5 support. I just started with a few things I have identified above, but the list is still to be filled. It would be great if we would also know what exactly is Pitt ULS working on, so that we can work on other things and arange. Closely related to these changes for R5 would be some improvements discussed here: https://github.com/pkp/pkp-lib/issues/6782.

NateWr commented 3 years ago

Till know the differentiation was between the access of HTML, PDF and other. This seems not to be needed any more -- We can change it to consider 30 seconds for any link i.e. file.

We will probably want to continue to track total views between different kinds of full text, because journals will want to know that. So we'll just need to make sure that we're counting appropriately for R5 while not losing some specificity we already have.

bozana commented 3 years ago

Till know the differentiation was between the access of HTML, PDF and other. This seems not to be needed any more -- We can change it to consider 30 seconds for any link i.e. file.

We will probably want to continue to track total views between different kinds of full text, because journals will want to know that. So we'll just need to make sure that we're counting appropriately for R5 while not losing some specificity we already have.

:+1: (The above is about double-click processing, which was different in R4 and now it is the same -- 30 seconds -- for any files)

bozana commented 3 years ago

Hi all, above all @asmecher and @NateWr, but maybe @ctgraham (above all regarding COUNTER R5 rules) as well :-) I implemented the major part of the new UsageStatsLoader (the function processFile()), that considers the COUNTER R5. Would it be possible for you to take a look at it, if you would have better ideas, suggestions,... Here the short summary:

the old logic is kept: -- extends FileLoader (that is still only used by/for usage stats) -- moving log files through the directories: usageEventLogs -> stage -> processing -> archive or reject -- read line by line, -- using temporary tables to store the log entries that counts (after double click and unique item removals). the temp DB tables are good structure to move the summarized data to the actual tables. can you think of some other structure (e.g. just PHP arrays), that is clean and with better performance?
COUNTER R5 (s. https://www.projectcounter.org/code-of-practice-five-sections/7-processing-rules-underlying-counter-reporting-data/ and https://www.projectcounter.org/code-of-practice-five-sections/7-processing-rules-underlying-counter-reporting-data/): -- user identification done by IP and userAgent -- double clicks: when the same user clicks the same URL within 30 seconds -- unique item: the day is sliced in 24 pieces --> when the same user views/uses the same submission (either abstract or files) within an hour Here is the new UsageStatsLoader: https://github.com/bozana/pkp-lib/blob/6782/classes/task/UsageStatsLoader.inc.php

In the process of log file processing till the data in the DB tables: what do you think at which place I should check if the object with the ID exists? -- For the current log files this is surely not necessary, but if someone would like to reprocess some old files. I was thinking at the moment we load the data from the temp tables into the actual ones.

Thanks a lot!

bozana commented 3 years ago

@ctgraham, earlier we had administrative and/or user name logged, but I do not think this was considered in a way for the usage stats numbers and COUNTER. As far as I could see we do not need them now, we do not need to differentiate/consider/remove such access, correct?

bozana commented 3 years ago

And maybe one more question @ctgraham: I think we do not need unique_title metric type, correct? -- We would/could consider books as submissions in OJS?

NateWr commented 3 years ago

Thanks @bozana, I've left some comments on the commit.

extends FileLoader (that is still only used by/for usage stats)

I think this is fine for now. Ideally, we would migrate this to use the new FileService and Jobs Queue to handle the staging and processing of files. We would probably benefit from breaking this down into several smaller jobs, but that can be done another time.

unique item: the day is sliced in 24 pieces

Is that really how the COUNTER spec works!? :open_mouth: So if I view something at 7:59 and 8:01 these are considered unique, but not if I view it at 8:01 and 8:03?

bozana commented 3 years ago

Thanks a lot @NateWr! Yes, we would need to adapt scheduling from Laravel, also the jobs queue, but I agree to do it then, when everything else is done... Yes, what you say about uniqueness is true, and maybe @ctgraham can confirm? Actually the uniqueness is connected with/based on one user session, but if such does not exists (e.g. if the user is not logged in), than that way, s. https://www.projectcounter.org/code-of-practice-five-sections/7-processing-rules-underlying-counter-reporting-data/.

ctgraham commented 3 years ago

earlier we had administrative and/or user name logged, but I do not think this was considered in a way for the usage stats numbers and COUNTER.

If we had a way to exclude administrative usage counts from COUNTER statistics, we would be responsible to do so. If we could exclude counting access generated via the Issue Preview, this would be appropriate (but might not be readily done). In general, just the fact that a user was logged in should not be a consideration for COUNTER.

I think we do not need unique_title metric type, correct? -- We would/could consider books as submissions in OJS?

The unique_title metric shouldn't be relevant for OJS. Perhaps for OMP, if individual chapters can be presented?

Actually the uniqueness is connected with/based on one user session.

Yes, the fuzzy definition of "per hour" is only relevant if the user session itself cannot be identified.

bozana commented 3 years ago

I haven't thought much about OMP yet, but: A book/submission can just contain the files or it can contain chapters. So it could be a problem if we would have different Items (per COUNTER definition) within one press -- in the first case the book/submission and in the second chapters -- correct? So my first thought was to simplify all this and say the book/submission is the item, and the chapters would be seen as just files... :thinking:

bozana commented 3 years ago

Yes, the fuzzy definition of "per hour" is only relevant if the user session itself cannot be identified.

Here as well I tended to (over) simplify it and have just considered the hour slices :sweat_smile: So I should consider/log the user session, if there? I will see how long do our sessions last... Somehow I do not like this COUNTER 'rule' neither -- different systems can have differently lasting sessions... :-P

bozana commented 3 years ago

Regarding the administrative access:

we can check if the user is admin or editor or so, but this does not necessarily mean the access is administrative i.e. we maybe should not do that, right?
the issue preview uses the same function 'view' but we could fire the usage event only when the object (issue or submission) is published, ok?

ctgraham commented 3 years ago

Regarding the administrative access:

* we can check if the user is admin or editor or so, but this does not necessarily mean the access is administrative i.e. we maybe should not do that, right?

* the issue preview uses the same function 'view' but we could fire the usage event only when the object (issue or submission) is published, ok?

Agreed, and agreed.

bozana commented 3 years ago

@asmecher, do I see/understand it correctly that our user sessions, depending on the setting in the config file, either 'never' expires (30 days) (and the session id is not changed) or with the browser session i.e. when the browser is closed?

asmecher commented 3 years ago

SessionManager.inc.php contains:

ini_set('session.cookie_lifetime', 0);
...
ini_set('session.gc_maxlifetime', 60 * 60);

...so by my read, sessions unused for 1 hour become eligible for garbage collection, which is stochastic.

These policies haven't been changed for a long time, and I suspect there are some best practices we could adopt. So I'm open to change on this.

bozana commented 3 years ago

That sounds good to me -- I am just trying to figure our if we can rely on our session ID for usage stats... For some reason I am always logged in (with the same session ID), also after 1 hour (and setting 0 in the config and deactivating 'remember me') of not using the site... Do the other experience the same?

bozana commented 3 years ago

Hmmm... It seems that using those two settings is not reliable and we should implement the session timeout by ourselves, s. https://stackoverflow.com/questions/520237/how-do-i-expire-a-php-session-after-30-minutes. So maybe to have that check, if the last usage is long time ago, here, before these lines: https://github.com/pkp/pkp-lib/blob/main/classes/session/SessionManager.inc.php#L109-L110. Maybe somewhere else too?

bozana commented 3 years ago

But, even then, if we implement to expire the user session after 30 minutes or 1 hour of inactivity: For COUNTER usage stats: If a logged-in user uses the journal site for the whole day, it would mean only 1 unique submission access, differently to the other counts when users are not logged-in and we use the 24 day slices. Somehow I tend to always use those 24 slices for usage stats... @ctgraham and @NateWr, what do you think?

bozana commented 3 years ago

Thanks to suggestion from @NateWr I moved the double click and unique item processing i.e. removal to the database, doing in with the SQL -- all log entries will be inserted into the temporary tables and then the removal of double and unique clicks done there, s. https://github.com/bozana/pkp-lib/blob/6782/classes/statistics/UsageStatsTotalTemporaryRecordDAO.inc.php#L95 and https://github.com/bozana/pkp-lib/blob/6782/classes/statistics/UsageStatsUniqueTemporaryRecordDAO.inc.php#L94 Now the processing in the UsageStatsLoader is slim, s. https://github.com/bozana/pkp-lib/blob/6782/classes/task/UsageStatsLoader.inc.php#L87. Maybe @asmecher and @jonasraoni could have a look at that SQLs too?

asmecher commented 3 years ago

@bozana, I think it should be possible to formulate a query that works for both MySQL and PostgreSQL using DELETE FROM xxx WHERE yyy IN (subquery) -- but you'd need to test it against both to be sure, as I remember seeing complaints about self-joins but don't recall the conditions. I have some PostgreSQL test datasets from various versions, and could either send you those, or test potential queries, whatever's most helpful.

bozana commented 3 years ago

@ctgraham, just to be sure: do you think we should consider user session when possible or use only 24 slices?

bozana commented 3 years ago

If we decide to use the session ID when possible, shall it expire after 1 hour of inactivity, or 1/2 hour?

bozana commented 3 years ago

@asmecher, would this code be OK for the session expiration: https://github.com/bozana/pkp-lib/commit/e16bbfa545703137f322c917886e33baf4a035dc, as said above?

NateWr commented 3 years ago

^ This is a question with broader UX implications. I will defer to Alec on the security side, but I do think from a UX perspective it's becoming more and more common to leave the user logged in for longer. If a user has logged in, we don't want to force them to log in again if we can help it, even if they haven't logged in in a few days.

ctgraham commented 3 years ago

do you think we should consider user session when possible or use only 24 slices?

I read the COUNTER specification as considering user sessions to be relatively short lived (that is: minutes rather than days), so I think slicing the day by hour is appropriate, even though it introduces edge cases. I am basing this on:

COPr5 Processing 7.3

A user session is defined any of the following ways: by a logged session ID + transaction date, by a logged user ID (if users log in with personal accounts) + transaction date + hour of day (day is divided into 24 one-hour slices), by a logged user cookie + transaction date + hour of day, or by a combination of IP address + user agent + transaction date + hour of day.

To allow for simplicity in calculating session IDs, when a session ID is not explicitly tracked, the day will be divided into 24 one-hour slices and a surrogate session ID will be generated by combining the transaction date + hour time slice + one of the following: user ID, cookie ID, or IP address + user agent.

bozana commented 3 years ago

Yes, all fine with me :-) The settings we have:

ini_set('session.cookie_lifetime', 0);
...
ini_set('session.gc_maxlifetime', 60 * 60);

would do it too but not reliable. So the solution above would do it reliable, when the user is inactive for an hour, and only if the 'remember me' option is not selected (which it is per default).

bozana commented 3 years ago

The comment above is on the Nate's last comment :-) And else, I would agree with you @ctgraham. @NateWr, is it also OK for you to then only use the 24 day slices for usage stats?

NateWr commented 3 years ago

Sure! Let's keep it simple.

On Wed, 21 Jul 2021, 09:07 bozana, @.***> wrote:

The comment above is on the Nate's last comment :-) And else, I would agree with you @ctgraham https://github.com/ctgraham. @NateWr https://github.com/NateWr, is it also OK for you to then only use the 24 day slices for usage stats?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pkp/pkp-lib/issues/6781#issuecomment-883982633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARTERN3C4CAWMWY5V4ZYR3TYZ553ANCNFSM4YAYUYIQ .

asmecher commented 3 years ago

(Late to the dance, but I agree -- we'd probably receive pushback if we enforced more rapid expiry of idle sessions!)

bozana commented 3 years ago

OK, I will not change anything in the session management and will use only 24 day slices for usage stats... Thanks all!

bozana commented 3 years ago

Hi @ctgraham, I would now start to work on COUNTER R5 reports. I still need to read/understand a lot, but maybe to double check a few things already now with you: "R5 reports can be delivered in tabular form, or as machine-readable data (JSON) via the COUNTER_SUSHI API." (from https://cop5.projectcounter.org/en/5.0.2/03-specifications/02-formats-for-counter-reports.html). So we could only concentrate on the COUNTER_SUSHI API, correct? Would you recommend? I was thinking to leave the currently supported V4 reports as well. At least for some time? What do you think? Would they need to stay 'forever', for the old statistics? We would be able to deliver R5 only beginning with release 3.4, when we collect the data in the required way, correct? You were already working on COUNTER_SUSHI. Could I use that and build on top of that? Later I will surely have lots of questions... but for now I would have to start... Thanks a lot!

ctgraham commented 3 years ago

Yes, I think the JSON delivery via COUNTER_SUSHI should be preferred. We can add transforms to convert this into tabular or XML representations if need arises.

We kept the R3 reports active for awhile when we added R4 reports, and can do the same for in adding R5. My gauge would be: "keep them as long as we don't have to do any work to support them". As soon as support time is involved, drop them: ultimately COUNTER has moved on, and we should too. (But I know the community may depend on the existing reports.)

Our existing work is largely a proof of concept in extending the API in a plugin. We didn't make it to the report definitions before shanu17's classwork hijacked his time. The repo https://github.com/ulsdevteam/pkp-counterSushi is available for you to build on anything that is valuable. Looking at that repo, I think one of the next challenges which we needed to address was to allow an anonymous role access to the API endpoint.

bozana commented 3 years ago

Thanks a lot @ctgraham! I will take a look at the repo... Maybe another question, while I am adapting the R4 reports...: Do you remember what was the metric type ojs::legacyCounterPlugin ? :thinking: I've forgotten... :-\f Funny: I do not have that metric type in the DB but I still can create R3 reports... The counts in R4 and R3 reports seem to be the same in my installation... Now, that we do not have different metric types, shall I create R3 using the same entries as for R4 - it seems to be so in the current ojs stable too - or shall I totally remove R3 reports?

bozana commented 3 years ago

Ah, I think I know: ojs::legacyCounterPlugin were the counts before ojs::counter was introduced... Hmmmm.... Now, that we will not migrate those counts and will not have any metric type those numbers will not exist any more... Hmmm... I will have to think... Maybe to just export and save those different metric types as CSV... just for journals to have them in a form... And thus, those years will not appear on the Counter Plugin index page...

bozana commented 3 years ago

Hmmm... or were those ojs::legacyCounterPlugin counts according to the COUNTER standard? Maybe we can then also migrate them -- in that case they would mean same as ojs::counter counts...

bozana commented 3 years ago

Hmmm... Me again :-D I took a look into the old OJS code. The different metric types were aggregated in release 2.4.3. In that release ojs::counter was introduced. Earlier there were:

OJS default views counts: OJS_METRIC_TYPE_LEGACY_DEFAULT
TimedViews plugin counts: OJS_METRIC_TYPE_TIMED_VIEWS
Counter plugin counts: OJS_METRIC_TYPE_LEGACY_COUNTER.

So, if possible, during the upgrade process I would save the entries of the DB table metrics for each one in a separate CSV file. For the journals to have them. But, the problem is that we will not be able to create COUNTER R3 reports for those old years, that used OJS_METRIC_TYPE_LEGACY_COUNTER. We cannot migrate them because they seem to be summarized monthly PDF, HTML and other counts per journal, having no submission or file/galley ID (s. https://github.com/pkp/ojs/blob/ojs-2_4_3-0/classes/install/Upgrade.inc.php#L1036). Thus I would prefer to totally remove the R3 reporting from the current COUNTER plugin. If that is a big problem a) I could leave R3 reporting for the years with ojs::counter metric type and b) I could see if we could automatically create the R3 reports for those old years and save them too during the upgrade process. @ctgraham, what do you think? How would you estimate the need for those R3 reports and R3 reports for those old years?

I would also double check everything with other team members in the next dev call...

bozana commented 3 years ago

Me again :-) From the current Code of Practice I understand that reports has to be available for the last 2 years, s. https://cop5.projectcounter.org/en/5.0.2/13-transitioning/01-transitioning-to-a-new-reporting-service.html. Thus I think those old R3 reports, prior to ojs::counter metric type, are definitely not needed any more. Thus I will remove R3 reporting from the current Counter plugin. (As I said, I will try to save the data form the DB table metrics, that contains the metric type different than ojs::counter, to a CSV file, so that data is not lost).

bozana commented 3 years ago

Another note: "For clarity, a provider MUST NOT perform the transition mid-month such that the customer is required to run reports on both the old and new reporting services for the same month and merge and sum the results to obtain actual monthly usage." (s. https://cop5.projectcounter.org/en/5.0.2/13-transitioning/01-transitioning-to-a-new-reporting-service.html). Thus, the counts from the DB table metrics will not be migrated to the new DB table metrics_counter_submission_daily/monthly -- only total investigations and total requests could be calculated, which is not enough (the unique investigations and requests are missing) for R5. Thus R5 reports would start with the installation of the release 3.4. If this is in the middle of a month, I would maybe need to check the 3.4 installation date in the DB table versions and provide the R5 reports from the next month of the installation. The R4 reports will be there and continue to be there (at least for some time).

I would surely have lots of such comments/notes/decisions (to be made). I would be more than happy if someone would track/read them here and react in case something is not clear or wrong or...

bozana commented 3 years ago

Hmmm... From here https://cop5.projectcounter.org/en/5.0.2/05-delivery/index.html I understand that both, 1) tabular form as either an Excel or a tab-separated-value (TSV) file, and 2) JON formatted in accordance with the COUNTER_SUSHI API Specification must be provided. However I will first start with SUSHI and then see/think further...

bozana commented 3 years ago

This:

Publishers must provide COUNTER reports on a per-customer ID basis. That is, if a business school has a separate customer ID from its parent university, the school and the university should be sent separate COUNTER reports. This applies whether authentication is through IP address recognition, Shibboleth, or other mechanisms. To follow the example above, if the business school shares the parent university’s IP range and relies on IP recognition for authentication, it will not be possible to distinguish usage from the school from that of the university and therefore only the university should receive a COUNTER report.

from here https://www.projectcounter.org/wp-content/uploads/2021/09/Release5.0.2_FG_Tech_v3.pdf, chapter "Delivering COUNTER Reports" sounds opposite to what we originally thought https://github.com/pkp/pkp-lib/issues/6782#issuecomment-872363201. I will stick with our original idea/plan -- it sounds more logical to me :-)

bozana commented 3 years ago

Hmmmm... As far as I understand it, only the customer_id (provided in the SUSHI URL) is needed to get the report for that institution. We said that customer_id would be the OJS DB table institution ID. According to the documentation https://www.projectcounter.org/counter-sushi/ this should be enough. They do not recommend to use other authentication options, like IP authentication. Hmmm... :thinking: Theoretically I could just try different numbers for the customer_id to get different reports, correct? Any problem with that? :thinking:

ctgraham commented 3 years ago

How would you estimate the need for those R3 reports and R3 reports for those old years?

Certainly these reports are obsolete from a COUNTER perspective, but I can see how there might be concerns with removing them. At what point to we lose the ability of reprocessing old logs to produce updated counter reports? Can usage statistics logs from 2.x be reprocessed in current code to generate COUNTER R5 reports?

Reprocessing the logs may also address the problem of mid-month conversions.

As far as I understand it, only the customer_id (provided in the SUSHI URL) is needed to get the report for that institution

Yes, this is the minimal implementation for per-customer reports. We can make per-customer reports public, or, at our option, require authentication. I am not opposed to making them public by default; if a service provider want to add in authentication, this could be a contribution or sponsored development.

bozana commented 3 years ago

@ctgraham, R4 would stay, I would remove R3. And the only thing R3 provided other than R4 was for those old years, where OJS_METRIC_TYPE_LEGACY_COUNTER was used, that was before 2.4.3. We cannot migrate those OJS_METRIC_TYPE_LEGACY_COUNTER counts adequately. Thus, the only remaining solution is if we would automatically create the R3 reports for those old years during the migration and save them somewhere in the file system. But, is this worth for such old years? (Well... depending also when the journal upgraded to 2.4.3...) And, yes, we are planing to have a 'rough' mechanism for reprocessing the old log files, but I still don't have a full concept... I also don't know how log file format changed over the years. I assume we have the same format since 2.4.3 release... I am not sure if log files existed before 2.4.3, I don't think so... If so, even if we have a reprocessing mechanism we would not have the counts from the years earlier than 2.4.3... Do I see it correctly? (In connection with log file reprocessing I have to think about how the R5 reporting start date will change when someone reprocess the old log files. We have the table metrics_submission that would allow R4 reporting, but we could definitely then reprocess the log files also for R5 reporting)

bozana commented 3 years ago

Hi @ctgraham, on the dev call just now we decided to ask users to export the old data by themselves before they upgrade to 3.4. We could/would leave the old metric types data in a "metrics_old" table (at least for some time), so that they still have those old counts (and would not automatically export/save the data as a CSV file). Does this sound good for you? Do you think we can then remove the R3 reporting/code from the counter plugin?

ctgraham commented 3 years ago

I am OK with asking users to export the R3 data if they care about it.

Can the R4 data can be expressed to be compatible with R5 without reprocessing logs? As an example, for one journal I have statistics in ojs::counter going back to 2010, but my usage statistics log files only go back to 2014.

bozana commented 3 years ago

No, unfortunately not -- R4 data, from 3.4. being in the DB table metrics_submission (that contains almost the same data as metrics -- just for the submissions (abstract + files) and metric_type ojs::counter, but without Geo data) cannot be used to get the unique (investigations and requests) metric types required in R5. We can get total (investigations and requests) from there, if we would need, but those are not enough for R5. (For R5 we will collect/save the data in the table metrics_counter_submission and metrics_counter_submission_institution). We use the metrics_submission table for the internal statistics, so the data there will further exist, grow and be used.

Regarding the log file reprocessing: If the IP addresses are hashed we will not be able to get institution and Geo data, but we can get those unique metric types. In this case there is no loss of data, because we did not have institution data till now and there could not be Geo data if the privacy option was selected and IPs were hashed. Just that the COUNTER reports per institution can not be provided. Would this be a problem?

bozana commented 3 years ago

Hi @ctgraham, I was working on another issue, but would now like to figure out what I extra need to do for OMP and OPS, in order to support COUNTER R5. This depends a lot from the reports we have to provide/support. I would need you help here :-) @NateWr, I will also include you -- you can always help great with decisions :-)))

I understand all applications should support/provide Platform Report. PR and PR_P1?

To be honest, I don't know what host_type would OPS be i.e. what report should it support? Does OPS need COUNTER R5 reporting? Would it be a kind of a Repository? Maybe it could, additionally to PR, only support Item Report (IR and IR_A1)?

Regarding OMP: In COUNTER R5 a Book is a Title and chapters (our submission files) would be Items. This all is slightly different to OJS -- in OJS Items -- the smallest COUNTER R5 level -- are the articles/submissions. The question here is if we need the COUNTER R5 metrics for chapters. In that case we would need an extra DB table that would collect the data per chapter. (Also some extra calculation and ...). If we would need to support the COUNTER R5 metrics on submission file level, we would then provide all metrics: total/unique_item_investigation, total/unique_item_requests, and unique_title_requests. We would support TR, TR_B3 and IR report, correct? If we would not need them, the submissions (COUNTER Book object) could have Total_Item_Requests and Unique_Title_Requests. It seems those are enough for TR_B1, but maybe also for TR_B3? (The documentation for TRB3 says "...all applicable MetricTypes". What does that mean?). We could support all the metric types (total/unique_item_investigation, total/unique_item_requests, and unique_title_requests) but as a total for a book. In this case, if we would only have the metrics on submission/book level, we would/could not support IR report. So, I wanted to check with you if you think the COUNTER R5 item (chapter) level metrics and reporting are necessary for OMP. (EDIT: Somehow I believe we have to support item/chapter level metrics in OMP, but 1% of me is hoping you will maybe say we don't have to :-D). (EDIT: I think we would very much like not to have too much extra work for OMP, so maybe the question is if we could satisfy by providing only the metrics on the submission/Book level (having, if possible, only total_item_requests and unique_title_requests, or, if needed, all the metric types: total/unique_item_investigation, total/unique_item_requests, and unique_title_requests)?)

Maybe also a few links here: https://connect.ebsco.com/s/article/COUNTER-5-and-eBook-Reporting-Frequently-Asked-Questions?language=en_US https://cop5.projectcounter.org/en/5.0.2/03-specifications/01-counter-reports-for-libraries.html#master-reports https://cop5.projectcounter.org/en/5.0.2/03-specifications/01-counter-reports-for-libraries.html#title-usage-standard-views

bozana commented 2 years ago

Hi all (above all @ctgraham and @NateWr :-)), as far as I could see form the COUNTER R5 documentation and form some hosts that provide COUNTER R5 reports it seems that we do not have to provide Item Report (IR) for the Host_Type = eBook. We would only provide TR and TR_B3. I.e. we don't need to collect metrics for each chapter. However, I spoke with @withanage yesterday and it seems that the metrics on the chapter level would be good to have. At least University of Heidelberg needs them and is using them -- their own implementation (that is not part of OMP). The chapters are always used when there is an edited volume and it seems that the users would like to see their usage...

Another thing: There is an attribute Section_Type (= Chapter, or = Book) in the Title Master Report (TR), that better describes item level metrics (total+unique_item_investigations+requests) -- it says if the book is delivered in chapters or as a single file, if the item is chapter or book. The attribute seems to be optional, however used in every TR example in the COUNTER R5 documentation and it seems to be useful because it explains why a book delivered in chapters would have much more item investigations and requests compared with a book delivered as a single file. The unique title (on the book level) investigations and requests however are the same for those two types of books. Thus, now I am not sure if we should differentiate the item metrics by that Section_Type i.e. for those 2 different kind of book access :thinking: @ctgraham, do you maybe have a suggestion here? If we would differentiate the chapter and the single file item metrics, we would have something like this: chapter_total_item_investigations, chapter_total_item_requests, chapter_unique_item_investigations, chapter_unique_item_requests book_total_item_investigations, book_total_item_requests, book_unique_item_investigations, book_unique_item_requests. Those are all numbers on the book/submission level (not chapter level). Additionally there are also unique_title_requests and unique_title_investigations on the book level. In any case we have to log chapter ID in order to be able to calculate the unique_item_investigations+requests... Thus, I will have to think about it all a little bit more -- e.g. to see what exactly does it mean for us/our implementation and storage... and will then come back to you again...

I also figured out that we would need all 6 metric types for our TR_B3: total+unique_item_investigation, total+unique_item_requests, and unique_title_requests+investigations. (Differently to TR_B1 only the two metric types total_item_requests and unique_title_requests are not enough).

And one more thing: OPS would be a Repository host type and would thus need 'only' IR, IR_A1 reports.

bozana commented 2 years ago

Hmmm... The Section_Type is in every example, so I think we should/would need to consider it. I will then think further about the implementation and will come back with some options...

bozana commented 2 years ago

In OMP we could theoretically have 3 ways of book presentation:

in chapters
as single file/whole book
in chapters + single file/whole book

Thus this could be mapped to COUNTER R5 this way: Access to Chapters (landing page + files) would belong to Chapter items (item metrics). Access to Book landing page and files other than chapters would belong to Book items (item metrics). Then there is also unique_title_investigations and _requests that consider unique accesses on a book object/title (no matter of its presentation in chapters or a single file).

Thus, for the log file and processing:

we would consider the new chapter landing page for usage stats event logging
we would consider chapterId in our usage event log format
we would use submission_id, chapter_id, assoc_type, assoc_id/file_id to differentiate between Chapter and Book items, as well as between investigations and requests:

assoc_type would be: new ASSOC_TYPE_CHAPTER (that would mean chapter landing page), ASSOC_TYPE_SUBMISSION_FILE (genre=Chapter and genre=Manuscript, i.e. those would mean the COUNTER Requests) or ASSOC_TYPE_SUBMISSION_FILE_COUNTER_OTHER (other, supplementary genres, counted within investigations). ASSOC_TYPE_SUBMISSION (book landing page). The chapter_id would be NULL for Book items. (The assoc_id/file_id would be NULL for chapter and book landing pages.)

A Chapter item would be defined with the same submission_id and chapter_id. Chapter item investigations would be chapter log events (with the same submission_id and chapter_id) with assoc_types ASSOC_TYPE_CHAPTER, ASSOC_TYPE_SUBMISSION_FILE and ASSOC_TYPE_SUBMISSION_FILE_COUNTER_OTHER. Chapter item requests would be chapter log events (with the same submission_id and chapter_id) with assoc_type = ASSOC_TYPE_SUBMISSION_FILE. There are total and unique Chapter item investigations and requests.

A Book item would be defined with the same submission_id and chapter_id = NULL. Book item investigations would be log events with all assoc_types ASSOC_TYPE_SUBMISSION, ASSOC_TYPE_SUBMISSION_FILE and ASSOC_TYPE_SUBMISSION_FILE_COUNTER_OTHER (chapter_id = NULL). Book item requests would be log events with assoc_type = ASSOC_TYPE_SUBMISSION_FILE (chapter_id = NULL).

Unique Title metrics would consider the same submission_id. Unique Title investigations would count = 1 all log events for a submission_id within an hour. Unique Title requests would count = 1 all log events for a submission_id and assoc_type = ASSOC_TYPE_SUBMISSION_FILE within an hour.

If we don't want to save statistics on the chapter level, the above metric types would be summarized for a submission and all the COUNTER metrics DB tables would additionally need 6 further metrics columns: chapter_investigations, chapter_unique_investigations, chapter_requests, chapter_unique_requests, title_unique_investigations, title_unique_requests.

pkp / pkp-lib

COUNTER Release 5 #6781