COUNTER Release 5 - Githubissues

bozana commented 3 years ago

[x] Implement a new log file format that will consider institutions and be conform with DGPR (done in https://github.com/pkp/pkp-lib/issues/6782 and https://github.com/pkp/pkp-lib/issues/6895)
[x] Define new data model (done in https://github.com/pkp/pkp-lib/issues/6782)
[x] Implement COUNTER R5 log data processing (done in https://github.com/pkp/pkp-lib/issues/6782)
[x] Implement SUSHI API and relevant reports
[x] Adapt all for OMP
[x] Adapt all for OPS
[x] Check all with PostgreSQL
[x] How to calculate the supported begin and end dates?
[x] Check for/implement using the warning/error 'No Usage Available for Requested Dates'
[x] Opt-out of public stats
[x] Consider the first publication date of a context when calculating the SUSHI start date
[ ] Implement TSV reporting (maybe in the future?) s. https://github.com/pkp/pkp-lib/issues/8248

Implement the COUNTER Release 5 for OJS/OMP/OPS usage statistics. Here we can collect everything we decide is necessary. We can have a discussion below and every time we decide something we can summarize it here.

It seems the Release 5 with lots of changes is out there. Here a guide for journals: https://www.projectcounter.org/wp-content/uploads/2020/08/Module_2_Journal_Usage_20200811.pdf.

Processing rules for COUNTER 5 reports, s. https://www.projectcounter.org/code-of-practice-five-sections/7-processing-rules-underlying-counter-reporting-data/: a) Double click filtering (s. section 7.2): This is implemented here: https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L194-L224. Till know the differentiation was between the access of HTML, PDF and other. This seems not to be needed any more -- We can change it to consider 30 seconds for any link i.e. file. Also we should change our implementation so that only the same URLs are considered (and not the assocType + assocID as till now). The uniqueness is treated differently: b) Unique Items (s. section 7.3): In our case Item is an article. The matching report is AR1. And the rule is: "If multiple transactions qualifying for the Metric_Type in question represent the same item and occur in the same user-sessions, only one unique activity MUST be counted for that item." Where user-session seems to be defined for an hour, as far as I understand it. The question if the article versions do belong to the same Item is still open. Due to the way we represent them internally I would say they do belong to the same Item. c) Unique Titles (s. section 7.4): In the case of a journal Title = a journal and the report = Title Master Report. Similar to the rule for the unique item above, the rule here is: "If multiple transactions qualifying for the Metric_Type in question represent the same title and occur in the same user-session only one unique activity MUST be counted for that title.". Where the user-session seems to be defined for an hour. I.e. here, if a user accesses one article and then another in the same session, it would only count once. This rule i.e. report seems not to be used for single journals -- introduced mostly for books. Do we need it (e.g. for libraries and multi-journal installations)? d) Internet Robots and Crawlers (s. section 7.8): Same as for Release 4. COUNTER maintains the current list of internet robots and crawlers at https://github.com/atmire/COUNTER-Robots. We use it as module in lib/pkp/lib/counterBots, assign the file to the variable COUNTER_USER_AGENTS_FILE (https://github.com/pkp/pkp-lib/blob/master/classes/core/Core.inc.php#L23) and implement the function isUserAgentBot in https://github.com/pkp/pkp-lib/blob/master/classes/core/Core.inc.php#L100. The function is then used when the log files are processed (https://github.com/pkp/usageStats/blob/master/UsageStatsLoader.inc.php#L170). We should define the strategy when we get the most recent version of the list.
Because R5 now supports/count abstract views (in total views count), shell we consider the galley view pages too?
SUSHI support is mandatory for compliance with COUNTER Release 5 (s. https://www.projectcounter.org/wp-content/uploads/2019/05/Release_5_TechNotes_PDFX_20190509-Revised.pdf).
What Reports we would need/like to support/provide: AR1, Journal Master Report, X?

NateWr commented 2 years ago

Thanks @bozana. This is really hard to get my head around, but it sounds complicated. I didn't fully understand the issue with Section_Type. Does this mean that COUNTER wants different statistics for monographs vs edited volumes? I think this will be difficult for us to deliver reliably. We can use the workType property to make a distinction, but some monographs are distributed with separate chapter landing pages/files and some edited volumes are distributed with just the book file. In other words, the workType property will not produce accurate statistics according to Section_Type, if I understand you correctly.

If it's important that we implement Section_Type accurately then I think we need to defer OMP's chapter-level statistics until we can figure out what to do on the UX side to encourage the correct handling of this. That would be after 3.4.

Hopefully that's not the case and we can continue anyway. If I understand correctly, we have the following possible statistics:

Object	Landing Page	File
Book	request	investigation
Chapter	request	investigation

We are really trying hard to move away from the use of assoc_type and assoc_id columns. What about using two different tables for book and chapter statistics? The book statistics could use the same table as OJS. For chapter statistics, OMP can add a new table.

bozana commented 2 years ago

I have just spoken with @NateWr, so to summarize it here: COUNTER R5, as far as I understand:

does not require the statistics on the chapter level (for each chapter) -- it does not require the IR report for the Host_Type = eBook,
it needs a distinction (the Section_Type attribute above) if a book contains chapter statistics altogether/in total.

Thus we would keep the COUNTER DB tables on submission level and would provide 6 additional metric types (4 for chapter items metrics and 2 for title metrics) in those tables (and on submission level). The chapter_id will be added to the DB table metrics_submission that is used for our internal statistics. So some statistics on the chapter level (for each chapter) would be possible. This DB table would contain only the total numbers (the unique numbers are not there). @NateWr, have I forgotten something? :thinking:

NateWr commented 2 years ago

Just to clarify with an example, because this was confusing for me. Take the following metrics for a book and it's chapters:

Book: 5 Chapter 1: 5 Chapter 2: 5

The COUNTER report we return will have statistics that show 15 for the total book metric, 10 for the total chapter metric, and 5 for only the book metric (ie - the book landing page and book-wide files, not including any requests to chapters).

We will still support per-chapter statistics, but not in COUNTER reports. These statistics will only be available in the default OJS statistics.

bozana commented 2 years ago

Hi all, above all @NateWr and @ctgraham :-) We will implement the SUSHI API in the core, not as a plugin. The namespace would be stats/sushi/. The API always requires customer_id parameter/filter. This would be our institution_id, as we said, and in that case we would provide the stats for that institution. But what about when the report request is 'globally scoped' (e.g. for OA stats)? What would be customer_id then? -- I cannot find anything in the documentation :-P Shall we define it to be 'null' or 'oa' or 'anonymous' or 'world' (as the according name used in the reports) or ? Or make it just optional, so if omitted the request and reports would be 'globally scoped'? How would the user (e.g. an institution) know what customer_id they should use? Would journal manager need to tell them? We said above that the API (and thus reports too) could be public, would you agree? Else, who would be able to get the reports? It could be a user, that we would provide an additional ID/api key with... In that case the institutions and COUNTER should get a user for that... Thanks!!!

ctgraham commented 2 years ago

The open access customer_id can be anything we like. Presuming our institution_id will always be numeric, we could even say something like "any non-numeric value" will be interpreted as the anonymous open access request.

Alternately, my preferred values would be "none", "false", or "anonymous".

The institution would need to get the customer_id from journal manager, unless we have some self-service UI for a institution to manage or pay for subscriptions in the future.

I support building the COUNTER/SUSHI API without authentication, at least at first.

bozana commented 2 years ago

Maybe one more comment/question: I am thinking that "platform" would be a journal i.e. a press i.e. a server. I.e. the SUSHI API endpoint is on the context (journal, press, server) level. @ctgraham, do you think this is OK? Maybe later and if needed we can see if we want to implement it all on the site level. This would maybe also mean different customer_id format -- to be 100% sure that the customer_id is unique it would maybe need to contain the context_id as well...

NateWr commented 2 years ago

The API always requires customer_id parameter/filter

I suppose the COUNTER group has defined its own specifications. But in terms of common conventions/best practice for REST APIs, there are two approaches depending on where the parameter appears. If it is a query parameter (?customer_id=), then it should be optional so that a request without this parameter is equivalent to "all". If it is a path parameter (/sushi/institutions/<customer_id>), then a request to the parent path (/sushi/institutions) is equivalent to "all".

I understand the SUSHI spec may have already decided not to follow this approach. But if this is permitted, that's the API design we would prefer.

If not, and we must have a value that represents a customer_id, then again I would suggest two options depending on whether it is a query or path parameter. If it is a query parameter, lets make 0 = all. It is easier to document, validate, read and use an API when the query parameters have a consistent type (integer). If it is a path parameter, then let's use the CONTEXT_ID_ALL constant. This is set to _ so that some endpoints can be reached at the site-wide level (eg - /_/api/v1/contexts) and we know this character is supported by the API router (a previous value, *, did not work in all cases).

the SUSHI API endpoint is on the context (journal, press, server) level

The API must work at the context level, because we need to support deployments where the contexts have no relationship to each other (ie - the host is not a publisher or anything). If a site-wide level is desired, that can be added as well. But at a minimum it needs to work at the context level.

bozana commented 2 years ago

The customer_id is a query parameter. I am not sure if we can assume if customer_id is omitted then "all" because the COUNTER defines an exception in that case -- so I am not sure if the COUNTER validator (that validates the implemented SUSHI API) would complain in that case. On the other side I haven't read anything else, i.e. how to proceed in the case of OA when the institutions are not tracked... I could eventually ask... And will then use "0" for now... (By the way, the COUNTER exceptions are more than terrible for me :persevere: -- what everything is required to provide... -- well, one could say "very precisely"... )

bozana commented 2 years ago

Hi @NateWr and @ctgraham, The Title Reports consider the year of publication (YOP), for display (metrics aggregation) and filtering. What do you think would be the appropriate year/date of publication for a submission? Can I just take the date_published of the first submission version for the YOP for that submission? -- The first submission version and its date_published should be the one when the submission was first published, so somehow this sounds correct to me... at least for OJS... :thinking: I will need to see is this all the same in OMP and OPS...

NateWr commented 2 years ago

@bozana Yes, that's how we treat it in the UI as well. When a new version is released the date published remains the date of the first version.

bozana commented 2 years ago

Hi all, (@NateWr and @ctgraham) I wanted to see a few more things and decisions with you:

SUSH API is on context (journal, press, server) level, as we said. In OJS we provide Platform Master Report (PR), Title Master Report (TR) and Item Master Report (IR). In OMP PR and TR, and in OPS PR and IR. The TR in OJS is for a journal, in OMP for a book (a book = title).

Platform (in every report) I am not sure what would be Platform, that is used/displayed in every report, because the API works on context level. I thought it would be the same as context name (at least in case only one journal exists on that OJS installation). We could eventually use site name if several journals are hosted on one OJS installation but the PR would however only show metrics for one journal. Would that make sense? For OMP we could use context name -- in OMP title = book and usually only one press is hosted on one installation.

From the documentation (https://cop5.projectcounter.org/en/5.0.2/03-specifications/02-formats-for-counter-reports.html#report-body):

Identifies the platform/content host where the activity took place. Note that in cases where individual titles or groups of content have their own branded user experience but reside on a common host, the identity of the underlying common host MUST be used as the Platform.

Examples listed there are: EBSCOhost, ProQuest, ScienceDirect.

That sounds differently but because we provide reports only on the context level we also provide metrics only for that context, and institutions are connected to a context... I see in the older COUNTER reports, version 4 and 3, "Open Journal Systems" is used for Platform. But I am not sure that would be correct... What do you think?

Publisher (used/displayed in TR and IR) This would be context->getData('publisherInstitution'). (I would need to double check if this is correct for OMP too). OK?

_ProprietaryID (used/displayed in TR and IR) This should be in the form: platform:ID. For article and book (title) the ID would be the submission_id. And for journal (title) it could be path, as it was in the old COUNTER reports (4 and 3). Or should we also here take the journal_id?

_ArticleVersion (used/displayed in IR) Allowed are: Accepted Manuscript, Version of Record, Corrected Version of Record, and Enhanced Version of Record. Thus I would always use VoR.

_itemid (optional filter in TR and IR) This would be the submission_id for OJS IR. This would be the submission_id for OMP TR. But for OJS TR this filter (e.g. journal_id) does not make any sense, because the OJS TR contains the metrics for only one title = journal. Thus this filter could only contain one value. We would then not support that filter for OJS TR, correct? I.e. I would return the warning "Invalid ReportFilter Value" for all other (than the current journal) requested values.

_include_componentdetails (optional attribute in IR) From the documentation:

Repositories often store multiple components for a given repository item. These components could take the form of multiple files or datasets, which can be identified and usage reported on separately in Item Master Reports.

Because we do not collect usage on components smaller than submission, we would not support this attribute, correct? I.e. I would return the warning "Invalid ReportAttribute Value" if the attribute is requested (set to true).

I hope this is everything, at least for now :-) -- There is always something new I need to ask/double check... :-\

NateWr commented 2 years ago

what would be Platform

I would suggest making a setting in the site settings area (for admin). There they can choose to make the platform the journal (default) or the site. It should be clear that this is for the SUSHI settings. Suggested wording:

Platform for SUSHI Statistics This software provides usage statistics according to the SUSHI protocol. This protocol requests that a platform be specified. By default, the journal will be designated as the platform for all statistics. However, if all of the journals on this site are published, owned or operated by the same provider, you may wish to designate the site as the platform.

[X] Use the journal name as the platform for each journal.

[ ] Use the site name as the platform for all journals.

Article_Version (used/displayed in IR)

If this is optional, I think we should skip it. We have an open issue to handle version types at #4860. If not, then the VoR sounds ok.

Everything else sounds good to me. :+1:

ctgraham commented 2 years ago

I was about to agree that in a multijournal installation our Platform is the Site Name, but Nate's suggestion is more flexible.

bozana commented 2 years ago

Sorry to bother you again, @NateWr and @ctgraham :-( The proprietary IDs (for institutions, titles (journal, book, server) and items (articles)) should be in the form: {namespace}:{value} i.e. :. From the documentation:

A proprietary ID assigned by the content provider for the item being reported on. Format as {namespace}:{value} where the namespace is the platform ID of the host which assigned the proprietary identifier.

If the user selects journal name as platform this platformID could be the journal/context path or initials or ID. But what if the user selects site name as platform? Shall it maybe then be 0? I don't know how picky COUNTER is but maybe we could always use context path/initials/ID as that namespace for the proprietary ID, and independent of what user selects for the platform? I haven't seen that platformID used/provided anywhere else, only in those proprietary IDs... :thinking:

NateWr commented 2 years ago

When the user selects the option to use the site as the platform, show an additional field for a Platform ID. You can use the showWhen property in a field to only show this field when the site option is selected. See an example and documentation of the showWhen prop.

bozana commented 2 years ago

Thanks a lot!!! :-)))

bozana commented 2 years ago

I found an explanation about PlatformID (https://cop5.projectcounter.org/en/5.0.2/11-extending/01-platform-as-a-namespace.html):

The platform ID MUST only contain ASCII letters (a–z, A–Z), digits (0–9), underscores (_), dots (.) and forward slashes (/), and the length MUST NOT exceed 17 characters. Note that the platform ID is used in various columns and therefore should be as short as possible, but still recognizable. The platform ID usually should be based on the name, a well-known unique abbreviation or the domain of the publisher or platform. A short standard identifier like GRID or ROR (without the https://) also could be used. COUNTER will assign the platform ID when adding the platform to their Registry of Compliance (content providers can suggest a value to be used for their platform ID). Other organizations providing COUNTER reports, such as consortia or ERM providers, may contact COUNTER to register a namespace if they desire to create extensions and customizations. COUNTER will maintain a list of approved namespaces.

So it sounds good to have that option to enter a platform ID...

bozana commented 2 years ago

I've just found this point (https://cop5.projectcounter.org/en/5.0.2/10-compliance/03-counter-reporting-for-consortia.html):

COUNTER acknowledges that some organizations treat their usage data as sensitive and private information. Content providers may include the option for consortium members to opt-out of consortium reporting. COUNTER recommends the default setting for an organization is to opt-in to consortium reporting.

it is for consortium and its members but eventually we would need/like to implement the opt-out for institutions? (Because our reports are OA) Maybe not now but in the future... :-)

bozana commented 2 years ago

And: we do need to provide TSV too, which would mean GUIs for users to select filters and attributes. Also the report elements are ordered, structured and provided differently in the TSV :-P But, I would love not to think about this TSV for now :-(

NateWr commented 2 years ago

eventually we would need/like to implement the opt-out for institutions?

I'm not sure I understand what you mean. But in my opinion we should always allow opt-out of any public stats, just like we allow opt-out of Beacon and OAI.

I would love not to think about this TSV for now

I don't know what TSV is but let's not worry about it.

bozana commented 2 years ago

I would add the task to opt-out of public stats /at different level) above and will address it then later. TSV is tab separated values -- the reports that the user can see/import in something like Excel.

bozana commented 2 years ago

I would then now introduce the assoc type ASSOC_TYPE_SUBMISSION_FILE_COUNTER_OTHER (i.e. the differentiation "is primary" file) in OMP. S. https://github.com/pkp/pkp-lib/issues/3541. Back then I was not sure how it all should be for OMP and COUNTER. But now, that the new metric types are introduces by COUNTER I believe those supplementary genres files should only be considered as investigations and not as requests, also because there is genre manuscript and chapter in OMP that would define the requests. Do I see it correctly? @ctgraham?

ctgraham commented 2 years ago

Per this statement:

An investigation is tracked when a user performs any action in relation to a content item or title

https://www.projectcounter.org/release-5-understanding-investigations-and-requests/

I concur that access to the supplemental material (appendix, bibliography, glossary) counts as an investigation, but not a request. Such material corresponds to, for example, "View Cited References" in the graphic illustration.

bozana commented 2 years ago

Hi all, I am now adapting the OPS with changes so far and I am at the COUNTER R4 report plugin. The plugin does not function, so I am wondering if it was used at all. @ajnyga, do you maybe know?

The following item and host types are specified there:

define('COUNTER_LITERAL_PREPRINT', 'Preprint');
define('COUNTER_LITERAL_SERVER', 'Server');

However, in COUNTER R5 it should be 'Repository_Item' and 'Repository'. Thus, I will change it to fit with R5...

EDIT: Hmmm... Now I see that in R4 there was no such host type Repository :-P and appropriate reports. The current R4 OPS plugin provides Journal reports... So I will leave the item and host types as is -- and thus not functioning -- till we figure out how this should be ... Or, maybe the plugin should be removed from OPS?

bozana commented 2 years ago

Hi all, again a few comments... I contacted COUNTER support but did not get any answer :-(

IR in OPS: The Host_Type is 'Repository' and Data_Type is 'Repository_Item'. COUNTER R5 does not provide appropriate 'Article_Version' for preprints:

ALPSP/NISO code indicating the version of the work. Possible values are the codes for Accepted Manuscript, Version of Record, Corrected Version of Record, and Enhanced Version of Record.

Thus I will leave VoR here as well. The 'Parent' is not known, so I will proceed in the same way as for 'Component' in IR in OJS and will return 3062 warning (Invalid Report Attribute Value)

OMP: If there is a mismatch of the following requested parameters section_type=Chapter&metric_type=Unique_Title_Investigations currently the empty 'Report_Items' is returned. I am not sure what warning I could use else, maybe also 3062 (Invalid Report Attribute Value)?

Generally: The way some attributes' values are written in the SUSHI documentation and COP differ -- in COP they begin with the capital letter while in SUSHI documentation they contain all small letters, e.g. granularity (values: month/Month and totals/Totals), include_parent_details (values: true/True and false/False) and include_component_details (values: true/True and false/False). For those I will use the first capital letter, as listed in COP and used for all other parameters.

The requested 'begin_date' and 'end_date' can be in the form yyyy-mm-dd. If a date is in the middle of a month, I will consider the whole month -- because the metrics are expected to be and also saved monthly.

ctgraham commented 2 years ago

In the case of an unsupported date (e.g. date is in the middle of the month, but we only save metrics at the month level), we should either:

Respond with a 3020 SUSHI Exception (Invalid Date Arguments), or
Respond with a custom 1-999 SUSHI Exception to note that we are sending data different than what was requested.

https://www.projectcounter.org/appendix-f-handling-errors-exceptions/

bozana commented 2 years ago

I then use the custom exception code 1 as warning, so that the processing can still take place:

"Code":1,
"Severity":"Warning",
"Message":"Wrong Requested Dates",
"Data":"Requested date ({$beginDate} or {$endDate}) is in the middle of the month; however, only whole months are considered.

bozana commented 2 years ago

Hi @NateWr and @ctgraham,

I have a few questions again:

I would need to somehow save/know the date when the COUNTER R5 metrics start to be available -- I would need to know the begin_date for R5 reports. For the new installations where there was no usage earlier, it could be the installation date. But for the users that upgrade to a release > 3.4 (and that have some usage earlier, that I assume everybody would have) the R5 begin_date should start with the next month of the upgrade -- because R5 reports are monthly, but also because the user should be able to only use one reporting system to get the correct/whole numbers for a month (i.e. in that case the user would get the usage of the upgrade month using R4 reports).

The next question would be: We continue to provide the R4 reporting, but with release > 3.4 we consider 30 seconds for double clicks (as defined in R5), instead of 10 seconds for double clicks for HTML files (as defined in R4). Do you think this is acceptable/ignorable?

And the next question would be: What to do when a user reprocesses some old (< 3.4 release) log files? We would again reprocess them considering 30 seconds for double clicks as defined in R5, so that eventually the numbers in old R4 reports could slightly change for HTML access. Also what to do with R5 begin_date? -- The user could only reprocess a few old months, and there could be emptiness/hole between those reprocessed months and the actual R5 begin_date.

So: A) I hope/think the 30 vs. 10 seconds for HTML files double clicks is ignorable (@ctgraham, what would you say?): Generally see this chapter of the COUNTER Code of Practice: https://cop5.projectcounter.org/en/5.0.2/13-transitioning/index.html. Here https://cop5.projectcounter.org/en/5.0.2/13-transitioning/02-transitioning-to-a-new-code-of-practice.html it says:

Support the prior-release reports on the new reporting service. This may involve using the metrics from the new release to produce reports formatted to the prior release; or it may involve logging additional data to the new reporting service such that the prior release reports can continue to be supported.

Here https://cop5.projectcounter.org/en/5.0.2/13-transitioning/03-transitioning-from-counter-r4-to-r5.html it also says:

Content providers may choose to meet the requirement to provide R4 reports based on R5 metrics.

So it does look to me like 30 vs. 10 seconds are not relevant?

B) Regarding the R5 begin_date and old log file reprocessing, maybe: Save the counterR5InstallDate in the site_settings and journal/press/server_settings table on installation and on upgrade. If an old log file is reprocessed, we would save the metrics in the COUNTER (R5) tables, but the R5 begin_date would not change. If the user reprocesses the old log files so that the R5 begin_date can change -- for example all months of the last year -- so that there is no hole between them and the actual R5 begin_date -- the user can edit/change the R5 begin_date in the statistics settings (I don't know how else I could for sure know that the begin_date changed). I think we cannot just consider the oldest month in the COUNTER tables as begin_date, because: if there is a hole between a reprocessed old log file and the actual R5 begin_date the report user would not know (and we also cannot determine it) if there is no usage for those months missing or if the log files are not (re)processed yet or ... -- and COUNTER requires it to be reported as warnings, e.g. No Usage Available for Requested Dates, Usage Not Ready for Requested Dates, Usage No Longer Available for Requested Dates, Partial Data Returned (s. https://cop5.projectcounter.org/en/5.0.2/appendices/f-handling-errors-and-exceptions.html)

What do you think? Thanks a lot!!!

p.s. I still haven't started to work on the tool for the old log file migration and reprocessing, but I hope/assume this will be somehow possible... :-)

NateWr commented 2 years ago

For the begin_date, we have a record of upgrade dates in the versions table that we can use. It doesn't account for reprocessed statistics but it does tell us when an upgrade occurred.

I will defer to @ctgraham about handling R4 vs R5 with reprocessed logs. However, I think it should be the sysadmin's responsibility to ensure that the dates line up with upgrades, that there are no gaps in coverage, etc. Reprocessing old log files should be a specialist task, undertaken to recover lost or malformed data.

bozana commented 2 years ago

@ctgraham, do you maybe know if there is a COUNTER role/suggestion how to deal with deleted pub object? For example, if a submission is deleted, do we need to keep all the stats connected to that submission? -- If we would keep the stats, we would also need to keep the other information for the reports. If we would not keep the stats the numbers would change for the months when the submission existed. Or is it maybe just enough to inform about our deletion policy? Thanks a lot!

ctgraham commented 2 years ago

My personal reaction, not from the COUNTER COP, is that I would expect to see a deleted published object represented by a tombstone in services like COUNTER and OAI-PMH.

@bozana , I'll copy you on an email to the COUNTER folks to ask this question.

bozana commented 2 years ago

Thanks a lot @ctgraham! Lets see what they say...

bozana commented 2 years ago

Hi all, COUNTER support said that the usage stats metrics need to be retained: COUNTER Code of Practice requires providers to maintain records for the current year plus the prior 24 months of usage. That would mean that the provider needs to retain usage records for the deleted artefact, ideally linked to a unique identifier. The information should be retained for as long as the provider would typically keep their usage stats, so if they go back deeper than the current-plus-24 we require, metrics for the deleted item should similarly be retained. I'll summarize what that means for us, the solutions I can think of... so that we can decide...

bozana commented 2 years ago

Concerning the old usage stats that are in the old table metrics: Because we need to keep COUNTER R4 reports for at least current and the 2 last years (of the moment the user installs the new version with COUNTER R5), we would need to keep the old data i.e. also the old data that contains the references to the deleted objects for which we do not have any information. COUNTER R4 reports do not need any further information about the deleted objects, so those 'ghosts' are OK to be so for this reports. That means, either a) we would need to keep the 'dead' references to the deleted objects for which we do not have any information when migrating the stats data from the old table metrics, which would mean that we cannot use foreign key constraints, at least not in the metrics_submission table. or b) eventually we could keep the usage stats on the submission level in the old DB table metrics and use them only for the COUNTER R4 reports. Our internal reports would use the metrics from the new DB table metrics_submission where the 'dead' references would not exist. Those numbers could then differ, but... (Then there is also a question how and when we can say when do we not want to support COUNTER R4 any more -- because it depends when the user installs the version with COUNTER R5, but maybe this is the task user should take care of...)

This all also means that the old log files can never be reprocessed so that they are used for COUNTER R5 reports. They could be reprocessed for the internal reports, but will never be used for R5 reports.

Concerning the new usage stats that are then used for R5 reports: Here we need to consider 2 cases, when we want to provide log file reprocessing or not. If we want to provide log file reprocessing we need to consider deletion of submission files and submissions. If we do not care about log file reprocessing we can only take care about deletion of submissions. This is because we do need file downloads to calculate the statistics on the submission level, that are then kept in the DB. For the reports we need submission authors, IDs, title, date published, YOP. I.e. these are the information we would need to keep also after deleting a submission, in order to provide the R5 report. Depending on what we decide, we might not be able to use the foreign keys, at least in the counter_submission... tables, but maybe also in the temporary where the stats calculations happens. Here, I have no idea what and how to decide what we can do... :-\ Maybe one solution is also to inform the user what the deletion mean for the COUNTER reports so that the user that wants and needs to be COUNTER complaint would then decide not to delete anything, and the others (I suppose most of them) could delete the objects (knowing that the numbers in the reports would change).

@asmecher, @NateWr and @ctgraham, I think I need to help here to figure out the best/possible solutions.

NateWr commented 2 years ago

It's up to you @bozana. But in my opinion we should split things like this into a separate issue for future improvements.

bozana commented 2 years ago

Thanks a lot @NateWr! I would love to do so, because these all are already such a big changes. And I believe it is not a 'big deal' for the new statistics and R5. The user could first take care of that by themselves (e.g. not to delete anything already published) before we implement another solution. But, what to do with the not existing objects from the old table metrics? Alec, Nate and me agree that we could/should migrate only those that exists and keep the foreign keys in the new metrics tables. We could leave the not existing entries in the old DB table metrics so that they are not lost forever. @ctgraham, do you think it would be OK for R4 reports to live without those 'ghost' numbers? -- I know that COUNTER said differently, but... the practice/how the reports are used is maybe slightly different... ? I could eventually try to somehow add those 'ghost' numbers to the R4 reports (in the COUNTER plugin) :thinking: but would love not to have to do that...

bozana commented 2 years ago

Hi @ctgraham, It seems that the sushi item_id parameter (for IR) means only one item, correct? I.e. in our case one would need to provide a submission ID (and not several separated with '|') there... ? s. https://app.swaggerhub.com/apis/COUNTER/counter-sushi_5_0_api/5.0.2#/default/getReportsIR.

ctgraham commented 2 years ago

I agree with your reading.

bozana commented 2 years ago

The COUNTER warning 3061 sound different, like there could be more than one item_id (s. https://cop5.projectcounter.org/en/5.0.2/appendices/f-handling-errors-and-exceptions.html):

A filter element includes multiple values in a pipe-delimited list; however, the supplied values are not all of the same scope (e.g., item_id filter includes article level DOIs and journal level DOIs or ISSNs).

But, for now, I will leave it so that only one (first in the list in case of a list) is considered and I provide our custom warning if there are several requested:

'Code' => 3,
'Severity' => 'Warning',
'Message' => 'Wrong Item_Id Value',
'Data' => 'Parameter item_id contains more than one value, however only one is accepted. The following values are not considered: item_id={$itemIdValues}'

I also expect the numeric value for the item_id i.e. the submission ID, and provide our custom warning if it is not a numeric value:

'Code' => 2,
'Severity' => 'Warning',
'Message' => 'Invalid Item_Id',
'Data' => 'Requested item ID ({$itemId}) is not a number. This parameter expects a submission ID.'

I could change it so that multiple item IDs are considered any time...

bozana commented 2 years ago

Hi @ctgraham, @NateWr and @asmecher, Alec asked if the access to the SUSHI reports should not be restricted to managers, admin and subscription managers, which led that I would like to reconsider it with you: We said the SUSHI API/reports should be available without restrictions, but with the possibility for an institution to opt-out. Thus, currently, any (registered) user can access the API and all reports. (I even think we said they could be publicly available, but our API is for registered users only -- I would need to see if it can be made public, if needed).

Maybe first: would opt-out mean opt-out from having reports accessible by any user, or from tracking and collecting stats, or both options?

Then: if the institution does not want its stats to be accessible i.e. that lets say only a user connected to that institution should be able to access the stats reports: we would need some kind of user-institution relationship. This relationship exists/can be done with institutional subscriptions, I think. How to solve that for OA journals? Or is this not relevant for OA journals at all i.e. can we leave the access to everybody? -- OA journals would optionally manage institutions... So maybe it should be: admin, manager and subscription manager have access to everything, and the appropriate user has access to its institutional stats? If we do restrict the access: what about COUNTER -- what would journal needs to do to allow them the access? They would need access to everything?

And finally: what about subscription journals? Are we seeing them differently than OA journals, i.e. if the access to institutional reports is not a problem for OA journals, is it a problem here? I.e. shall we restrict the access here as said above?

ctgraham commented 2 years ago

I strongly prefer a first round of:

reports are publicly available (no login) by default. (auth is an optional implementation in the SUSHI spec)
Opt-out is available to turn off the SUSHI endpoint, stats are still collected. (This aligns with the COUNTER expectation that a formal process is needed for compliance, and some may want to complete that process before making reports available).
non OA journals can turn off public access if desired, but the default should be to expose the stats. (Like the default of exposing OAI-PMH)

Future development can tackle the association of COUNTER/SUSHI access to specific Institutional or Individual subscriptions by login.

bozana commented 2 years ago

Thanks a lot @ctgraham! I've merged the work till now, but will next need to see how to make the API public, and to implement the opt-out for institutions and 'turn-of' for subscription journals...

bozana commented 2 years ago

Hi @ctgraham, now I am back again to this task. I hopefully find a way to make the SUSHI API public, but I would like to double check the other two points with you:

Opt-out is available to turn off the SUSHI endpoint, stats are still collected...

Do I understand it correctly: An institution will have an opt-out possibility. In the case an institution opted-out, the SUSHI API will act like that customer ID would not exist i.e. it would return COUNTER error 1030 "Invalid customer_id."? Or?

non OA journals can turn off public access if desired...

Would this be correct/OK: If the non OA journal turns the public access off, the API can still be accessed by the admin, managers and subscription managers?

Thanks a lot!

ctgraham commented 2 years ago

Opt-out: I suggest making the SUSHI endpoint public by default, but able to be disabled in the journal configuration. That is, as a journal manager, I may not want to enable SUSHI publicly yet; I am still collecting statistics while applying for certification with COUNTER, and when my journal is certified as COUNTER compliant, I can turn on the SUSHI endpoint. (Maybe this is as easy as making it only available to admins and managers). I don't have a use-case for individual institutions (subscribers) opting out.

Public/signed-in access: Yes, I think it is good to leave the API available to logged in admins and managers if public access is turned off.

bozana commented 2 years ago

So you do not think it is necessary for an institution to be able to opt-out -- e.g. if they do not want their stats to be publicly available?

ctgraham commented 2 years ago

So you do not think it is necessary for an institution to be able to opt-out -- e.g. if they do not want their stats to be publicly available?

Not from the perspective of the ULS or our clients. And not from my personal perspective from a philosophical standpoint.

bozana commented 2 years ago

PRs for opt-out for public SUSHI API: pkp-lib: https://github.com/pkp/pkp-lib/pull/8214 ojs: https://github.com/pkp/ojs/pull/3517 omp: https://github.com/pkp/omp/pull/1187 ops: https://github.com/pkp/ops/pull/338

bozana commented 2 years ago

@NateWr, could you place review the PRs above? It implements the general possibility for public APIs and opt-out for public SUSHI API as discussed here.

bozana commented 2 years ago

@NateWr, I changed the way we make the APIs public according to your and Alec's comments. Could you please take another look? @asmecher, maybe you would like to take a look too?

bozana commented 2 years ago

@ctgraham, the major functionality is in the main branch. It would be great and I would be very happy if you would like and have some time to take a look/test... but... no pressure, of course... :-)

pkp / pkp-lib

COUNTER Release 5 #6781