pkp / pkp-lib

The library used by PKP's applications OJS, OMP and OPS, open source software for scholarly publishing.
https://pkp.sfu.ca
GNU General Public License v3.0
304 stars 444 forks source link

IP location and institution service #6895

Closed bozana closed 2 years ago

bozana commented 3 years ago

This issue discusses and defines the solutions that can be used to detect user location and institution, so that those can be used in usage statistics.

IP location solution: We will use the free DB-IP lite city database (e.g. https://download.db-ip.com/free/dbip-city-lite-2021-05.mmdb.gz) and the MaxMind library https://github.com/maxmind/GeoIP2-php per default. The library will be provided via composer, i.e. together with our code. The Geo data and granularity (country and city+region) option for usage stats would need to be set up on the site level, and can be opted-out on the journal level. If Geo data is selected on the site level, the DB-IP lite city database will be downloaded (in the files/usageStats/ folder) and a monthly scheduled task provided for the DB update. The IP to Geo data processing happens before the usage event is logged. Then, the anonymized IP address and the Geo data will be logged (and cached). To anonymize the IP, the IP is encoded using a randomly created salt that changes every day. When salt changes, the usage event log file as well as IP-Geo data cache change. The country and region iso code and city name will be saved. S. https://github.com/pkp/pkp-lib/issues/6782#issuecomment-853760292.

IP institution solution: We will add new DB tables institutions and institution_settings containing the id, ROR, and localized names. The DB table institutional_subscription_ip will be renamed into institution_ip table so it can be used more widely. Instead of having subscription_id it will contain the institution_id column. The table institutional_subscriptions will contain a new column institution_id. The institution_id will be logged.

Research: IP location data: It seems that finding out the Geo data is not an easy task at all, so mostly the services are not for free. Thus, if someone would like to have a 'perfect' (more accurate and extensive) service, then they could pay for it.

API solutions (GDPR problem): Most of the solutions provide an API, that a registered user can use for free up until ca. 10.000 - 50.000 requests per month, or for a price. But we cannot use/consider those APIs because we would forward the user's IP address to them, which would require a user consent. Databases: MaxMind data (that we were using too) are pretty much used (e.g. there seems to exist plugins for WordPress, Drupal,...) and apparently well. There are also others, e.g. IP2Location, DB-IP... For registered users MaxMind and DB-IP provide a free download of their lite database (that is not so accurate), or a full database download for a price. IP2Location provides a commercial download. Examples: Matomo ("Google Analytics alternative that protects your data and your customers' privacy") provides 3 options for the user to choose: 1) they get the Country using the user language settings (which is not a great solution and very inaccurate), 2) MaxMind database download, and 3) DB-IP database download.

Institution data: Unfortunately the WHOIS service (including the whois command) to get the institution and country cannot be used neither, because we would again sent the user IP address to those registries/services. DB-IP and MaxMind provide the IP to ISP i.e. ASN (containing organization name) database download (lite version as well as commercial). This can however not be used, because the data there is more general (s. comment below). Examples: Looking at the PSI (that is used by Knowledge Unlatched) site no detailed information can be found how this exactly works.

Conclusion: It seems like we would not have many options because of the GDPR. Suggestions:

  1. Provide a default solution:
    • IP location: provide only (inaccurate) country data either a) using the public database https://iptoasn.com/, b) from the user language settings, c) if someone with more legal knowledge could double check if we could somehow use whois or nslookup command, d) is there any other idea for that? Do we want to provide that probably inaccurate and only country level mapping -- would that be valuable for the users?
    • Institution: a) require the subscription managers to enter the data in the right way, so that we use our own subscription metadata (e.g. name), b) is there any other idea?
  2. Support a few other solutions that a user would need to setup/download on their own:
    • IP location: MaxMind and DB-IP databases. Any other (commercial)?
    • Institution: Maybe still see how is PSI functioning?
NateWr commented 3 years ago

Is the ISP or ASN going to help us? Do universities or other institutions that journals want to track usage stats for have their own ISP/ASN?

bozana commented 3 years ago

Yes @NateWr, you are right -- I've just saw a few IP addresses from a university in Berlin and the entry there is general i.e. the university is not recognizable. I could try a few more examples (maybe from other countries), but I assume it will be the same. That would mean that the only solution for the institutional subscription statistics would be to rely on the data entered by the managers in the system.

NateWr commented 3 years ago

That would mean that the only solution for the institutional subscription statistics would be to rely on the data entered by the managers in the system.

I'd recommend we think about something like a CSV import, which would let people quickly configure large data sets of IP range / institution matches and then import them. We may even want this data to be stored in a file somewhere rather than directly in the database. For example, a large service provider, like PS, may want to maintain a central file that all of their clients could use when processing stats.

asmecher commented 3 years ago

We may even want this data to be stored in a file somewhere

Yes, if we don't need relational integrity to be maintained and the lookup can be done without too great a performance hit, I'd favour keeping a lookup file on the filesystem. Having to sync a database table against a frequently-revised third-party data source gets painful!

bozana commented 3 years ago

Both, DB-IP and MaxMind lite databases are either in the CSV or MMDB format. CSV is meant to be imported into the DB. MMDB is specified here https://maxmind.github.io/MaxMind-DB/, and the MaxMind library https://github.com/maxmind/GeoIP2-php is used to get/lookup the data from the MMDB file. Both, DB-IP and MaxMind lite databases are licensed under Creative Commons Attribution 4.0 International License and require attribution to DB-IP.com i.e. maxmind.com for the data (on the page that uses their data). MaxMind attribution needs also to be on a public web page (e.g. about). To download the DB-IP lite database one does not need to be registered, while one needs to register for MaxMind lite database download. Both, DB-IP and MaxMind, provide a way (package with scripts or wget API call) for an automatic database updates (e.g. run as a cron job), which requires ACCOUNT_KEY (that can be obtained when registered). DB-IP lite databases are updated monthly, MaxMind lite databases are updated weekly.

asmecher commented 3 years ago

Thanks, @bozana -- and thinking more about this, I don't think it's likely that we'll be able to avoid a database sync for performance reasons (unless there's a tool to e.g. compile out the database to PHP flat-files or something crazy like that). If the databases come with a toolset to sync the data locally, we should probably just follow those patterns, as long as they e.g. support both PostgreSQL and MySQL where that's needed.

bozana commented 3 years ago

@ctgraham and @mfelczak, what are the exact requirements for the institution tracking i.e. institutional usage stats? -- Would only subscription journals like to track/use that, or also OA journals? I/we am/are not sure if we can rely on the current subscription data -- to use the entered IP (ranges) and institution name (or later ROR) of an active institutional subscription? -- This would only work for the subscription journals and under the assumption that the managers enter the data correctly in the subscription form. Then the question is also if the institutional usage should be tracked also for the OA content part (of a subscription journal), e.g. landing pages and OA galleys? If institutional usage stats are useful/required for OA journals too, than the journal managers would need to upload a file where IP (range) - Institution name/ROR pairs are defined. I do not see other possibilities to get the institution name/ID from the IP...

NateWr commented 3 years ago

@ctgraham and @mfelczak in addition to the question about subscription or OA journals, we're still not clear on where the IP data comes from. For a journal that wants to track institutional usage stats, where do they get their IP ranges from (is it PSI or do they have their own IP range data)? What does this data look like (are there 3-5 institutions or hundreds/thousands)? How often is it updated? Would separate institutional IP data be managed for each journal or is this something installed/configured by the publisher/service provider?

These are important questions we need to answer before we can design a sensible approach. Registering usage activity against an IP range is easy enough but the tricky part will be how the system acquires/manages the IP-to-institution data mapping.

ctgraham commented 3 years ago

I'm imagining that for OA the Journal would need to setup no-cost IP-based subscriptions (or no-cost login-based subscriptions?) for the institutions they want to track, with IP ranges sourced from the institutions themselves. PSI / theIPregistry would be a likely source of bulk IP data.

mfelczak commented 3 years ago

Hi all, my understanding is that subscription and delayed open access journals receive IP ranges from the respective institutions, so the data can be assumed to be accurate at the time that the subscription record is created. Each journal manages its own institutional subscription records and the data is stored on a per-journal basis. Once entered, IP ranges are not typically updated or checked for accuracy. Occasional updates are made on an as-needed basis when new IPs need to be added to the subscription record, e.g. faculty/staff/students are unable to access the journal from a new IP and the editors are asked to review the subscription. A given journal may have 100+ institutional subscription records. In addition to universities and university libraries, these will also occasionally include nonprofits and other organizations that require access to the journal's content for their staff.

For OA journals that receive funding from non-subscription sources, grants, etc. my understanding is there is ongoing interest in tracking institutional usage as a basis to demonstrate institutional use, relevance, etc. as part of the journal's reporting and/or fundraising activities.

NateWr commented 3 years ago

Thanks @ctgraham and @mfelczak. So I think that we have a clear understanding of the use-case for subscription-based journals, as well as the small collection of journals piloting the subscribe-to-open business model. It sounds like the use-case for OA journals is less clear, and maybe based on assumptions rather than an actual journal or journals with a plan in mind.

This matters because subscription journals make up a tiny fragment of our community. I worry that we're designing the feature to optimize the wrong use cases, just because those are the use cases that we know the most about. I'd really like to know more about OA journals that want institutional usage stats -- what institutions they want to track, where the IP data comes from, etc.

bozana commented 3 years ago

As far as I could see, the following two issues were the original requirements for tracking the institutional usage stats: https://github.com/pkp/pkp-lib/issues/3369 https://github.com/pkp/pkp-lib/issues/2676 The first assumes an IP-institution service that currently does not exist so that we can use it -- the ISP/ASP databases do not provide the specific institution name. Thus, this is similar to the problem the OA journals could have -- where from do journals get their IP-institution data and the concerns @NateWr mentioned. The second one is the one we could maybe concentrate on for now (maybe keeping the second requirement in mind) -- the subscription journals already have the data, provided/entered in their OJS institutional subscriptions. Thus, we already have the IP-institution data. The next question is what kind of reports do we need. As far as I could see, there is no COUNTER R5 report that contains the institution 'column/data'. @ctgraham, do you know it? If so, then we actually want to help libraries, journal managers and institutional subscribers to get an overview of the usage? If that is so, my suggestion would be to provide a solution for the subscription journals for now (keeping in mind that it could then be extended to support the OA journals too):

Later, if we find another and better source of IP-institution data, useful for the OA journals as well, we can replace the first step -- we would not look into the institutional subscription data, but X. To make it simpler in case of hybrid journals (having OA and closed content) we would log the institution if we find it in a subscription (or that other source X) no matter if the content is OA or closed. (I see there are COUNTER R5 reports that contain 'Access_Type' ('Controlled' or 'OA_Gold', where OA_Gold is the content that was/is/will always be OA). For that we could probably say all usage for an OA journal is Access_Type = OA_Gold and all usage for a subscription journal is Access_Type = Controlled. This might be different in OMP and we will eventually need to have a column 'access_type' in the metrics table too... -- But this will rather be the topic of this issue https://github.com/pkp/pkp-lib/issues/6781.)

It seems there are some concerns about privacy when tracking? or reporting? institutional usage on the article/file level, s. https://github.com/pkp/pkp-lib/issues/6782#issuecomment-831801535. This I am not able to decide -- I just see that it seems the requirement for that (institutional usage on the article/file level) is there...

I would love to hear your comments and concerns and... I am maybe overseeing something... I maybe do not understand well something... It would be great if we could find/decide on a solution for now, or decide to defer it for later... Thanks a lot!

ctgraham commented 3 years ago

I don't think we'll find "institution" references in the columns or data of the COUNTER reports. Rather, the "institution" (or "customer") is part of the SUSHI filters on which the COUNTER report is based.

See: https://app.swaggerhub.com/apis/COUNTER/counter-sushi_5_0_api/1.0.0#/default/getConsortiumMembers

@bozana , I support your bullet point suggestion. I think the OA use case is an OA journal using subscriptions to track and report access by IP (not sure if this can be done in the current product), knowing that this is quite limited in scope.

One concrete OA use-case which has been discussed previously is justification of the subscribe-to-open model, or evidence for a funding pitch. For example, a journal or journals collecting OA access by known institution IPs, and presenting that data to the institutions: "look and how you use our work, would you help to fund us?"

NateWr commented 3 years ago

Thanks @bozana and @ctgraham!

I think the OA use case is an OA journal using subscriptions to track and report access by IP

I think you may be correct. Although we might find a general interest from OA journals in knowing who is visiting them, there's not a strong use case for tracking this information outside of specific reporting requirements in which journals know the institutional IPs they want to track.

I'm wary of coupling this too closely with subscriptions for a few reasons.

  1. Subscriptions occur at the journal level. In the subscribe-to-open model, as well as other publisher models which report institutional usage to acquire funds, it will be common for stats to be collected and reported across a collection of journals. Since subscriptions are manually configured for each journal, we end up multiplying the configuration effort required. A publisher that has 30 journals and wants to track 30 institutions must manually configure 900 subscriptions. A change in the IP ranges of one institution will require 30 updates.
  2. The approach of logging the institution name in the metrics table will cause the metrics to report inaccurate counts when an institution's name is changed. In theory these won't change much, but in practice people make mistakes often, need to correct typos, etc.

We would add the additional column 'institution' to the DB table 'metrics

I think that we're trying to push two distinct sets of data together, just to avoid creating a few more tables. We're only going to make things harder for ourselves down the road. Could we have a small institutions table like this:

Schema::create('institutions', function (Blueprint $table) {
    $table->bigIncrements('institution_id')
        ->bigInteger('ror')->nullable();
});

And an institution_settings table for the name. We can convert the institutional_subscription_ip table to a institution_ip table so it can be used more widely. Then the institution_id can be assigned to a subscription as a foreign key. When adding or editing a subscription, we can create or update the associated institution records. We'd be able to reuse this table in the future for things like user affiliation ROR. And this data would persist even when a subscription is removed.

By decoupling the subscription and institution data, we'd be able to create and manage institution data separately from subscriptions, so we wouldn't have to merge that code into pkp-lib (the whole UI structure here is complex and would need a lot of refactoring). We'd also be able to save usage stats data by institution_id, so that we don't have to put non-localized institutional names directly into the metrics table. By using a foreign key, we can prevent institutions from being removed when they have associated metrics data.

I also think it's worth considering whether we want to add these metrics to the existing metrics table, or whether we should have a separate table for institutional metrics. The metrics table is already enormous (several million rows for large installs), and this has the potential to expand it exponentially.

bozana commented 3 years ago

Thanks a lot @ctgraham and @NateWr. I see there is Institution_Name and Institution_ID -- the institution for which the usage is being reported ("The World" in case of Gold_OA) -- in the header of the COUNTER reports. And the parameter for SUSHI as @ctgraham said. So yes, the institution is used, and on article level... @NateWr, I agree with your DB table model. Maybe just one thing: we would probably need to assign the institution_ip_id to an institutional subscription, because it could probably happen that different IP ranges from one and the same institution have different subscriptions... But... now it is clear how I can proceed... Thanks a lot for all your comments, information, suggestions, support... !

And the questions and discussion about the metrics table(s) is coming soon in this issue: https://github.com/pkp/pkp-lib/issues/6782...

bozana commented 3 years ago

For the automatic update of the DB-IP city lite database, and eventually for compressing the archived usage stats log files (depending on the settings), we need the gzip function. Could I add the following line to the config template:

; gzip (used in UpdateIPGeoDB scheduled task and to eventually compress the archived usage stats log files)
gzip = /bin/gzip

@asmecher, what do you think?

ctgraham commented 3 years ago

See also: https://github.com/pkp/pkp-lib/issues/5156 for a proposed polyfill. Might be a better one out there in composer. Probably better implemented by https://github.com/splitbrain/php-archive or https://packagist.org/packages/alchemy/zippy .

asmecher commented 3 years ago

We are trying to gradually remove calls to exec, as many shared hosts disable them; see https://github.com/pkp/pkp-lib/issues/6077, which most recently replaces tar calls for plugin installation in favour of PharData. (This hasn't been deployed yet -- I don't doubt it'll cause problems for some hosts since the variety is so huge, but it'll be a net benefit, I'm hoping.)

I think built-in gzip support will be broadly available enough that you can just rely on gzopen etc being built into PHP. For anything more elabourate than that, I'd suggest looking into what Flysystem has to offer, since we're already adopting that -- e.g. https://packagist.org/packages/emmetog/flysystem-gzip-adapter, which relies on gzopen et al behind the scenes.

NateWr commented 3 years ago

A warning on flysystem-gzip-adapter: it looks like there was only ever a 0.1 release and it hasn't been touched in 2-3 years. Looks dead to me: https://github.com/emmetog/flysystem-gzip-adapter

bozana commented 3 years ago

Hmmm... Because I now only need to just compress and decompress gz files I could just replace the current functions in the FileManager to use gzopen instead of exec?

bozana commented 3 years ago

@asmecher, do you think this is OK so: https://github.com/bozana/pkp-lib/commit/e29a25c036d8946c95d482cb06d014aa0d91ee85 ?

asmecher commented 3 years ago

@bozana, I haven't tested it, but yes, that looks fine. It's missing some error handling on e.g. the write.

@NateWr, the Flysystem gzip wrapper is so trivial that I'm not surprised it hasn't been updated in a while. But yes, noted.

nuest commented 3 years ago

I was looking around for geocoding libraries in PHP and recalled this issue when I saw the Geocoder PHP also supports several IP location providers, see https://geocoder-php.org/docs/

Maybe it's useful to have that abstraction layer and support multiple providers?

consultbelieve commented 2 years ago

Hi @NateWr

I see #2676 was closed in favour of this, but I don't think they are the same thing - thought perhaps inter-related

2676 is about the necessity for a COUNTER compliant report available at institutional level about the usage of that institution.

This seems to be about detecting the institution by IP - which I guess is a step in the right direction, but in our case all relevant institutions are authenticated by IP already. It's the ability to break down COUNTER reports from the aggregate "every visitor ever" level that they currently are at to a "by institution" level that we need. Should I open that as a separate issue?

NateWr commented 2 years ago

@consultbelieve, the institutional tracking that we are implementing will be in line with COUNTER R5. This issue covers the tracking of institutional stats. But you'll find more discussion on delivering the COUNTER R5 reports, including filtering by instutiton, at https://github.com/pkp/pkp-lib/issues/6781. Probably more than you ever wanted to read! :joy:

consultbelieve commented 2 years ago

Thanks @NateWr - that's perfect... Will make myself a cup of cocoa and have a read through that in front of the fire 😄

bozana commented 2 years ago

Also this issue is closely related to it all: https://github.com/pkp/pkp-lib/issues/6782 :-D -- it provides a data model that is then also used for COUNTER R5 reports...

bozana commented 2 years ago

merged with the Issue https://github.com/pkp/pkp-lib/issues/6782. Thus closing.