pkp / pkp-lib

The library used by PKP's applications OJS, OMP and OPS, open source software for scholarly publishing.
https://pkp.sfu.ca
GNU General Public License v3.0
307 stars 447 forks source link

Allow a journal to define a limited set of allowed keywords and reviewer interests #1550

Open mfelczak opened 8 years ago

mfelczak commented 8 years ago

Add plugin support for keyword vocabularies that JMs could define to limit the options available to both authors (indexing metadata) and reviewers (reviewer interests).

This would reduce near-duplicates and synonyms. It could also be used to provide automated suggestions when assigning reviewers, i.e. based on the submission's keywords, automatically score/suggest reviewers who include some of the same keywords in their review interests.

dennmuel commented 6 years ago

Maybe it would be possible to reuse some library work by making it possible to upload a standardised nomenclature like the Library of Congress Subject Headings or the equivalents in other languages (e.g. Gemeinsame Normdatei in Germany).

asmecher commented 6 years ago

@dennmuel, yes, if the classification can meet three conditions:

dennmuel commented 6 years ago

@asmecher , thanks for your quick reply.

I fear, I don't know one, that meets these conditions. At least to my knowledge, most countries have their own standard for library subject keywords. From the top of my head, the only classification system not necesseraily used, but at least known and translated world wide and based on numeric identifiers is the DDC. However, I think OCLC owns the right to it. So a "one size fits all"-solution doesn't seem possible to me, except there is something else - maybe with regards to ontologies used in the context of semantic web / linked open data like OWL. But I don't know much about that, just heard about it some time ago.

I was rather thinking about providing an option for the site- or journaladministrator to upload a file (of any origin in a structured format) with the subject/classification and specify, which language(s) and journal(s) these keywords are for. Then the data could be loaded (as autocompletion) or selected (via a tree-view) in the keyword field for the respective language during the submission and review process. One could theoretically reuse an existing classification or write one's own if needed.

ajnyga commented 6 years ago

Related discussion here: https://github.com/pkp/pkp-lib/issues/1828#issuecomment-296985434 And a related plugin: https://github.com/ajnyga/finto

In general I think that instead of trying to find a vocabulary that would meet everyones requirements, it should be made easier to connect vocabularies available through APIs to OJS/OMP and the keywords should be saved as keyword+URI pairs if URI is available. As @dennmuel mentions, many countries have their own vocabularies.

With LCSH you could maybe use http://fast.oclc.org/searchfast/fastsuggest?query=Osteopathic%20stu&fl=suggest50&rows=30 (https://www.oclc.org/developer/develop/web-services/fast-api.en.html). That would be fairly easy to implement by using the finto plugin as an example.

bkroll commented 6 years ago

Thanks to @NateWr to point to this ticket after talking about this at the Heidelberg sprint! I really like the approach of @ajnyga's finto plugin, because it is a slim and clean approach. However, using that OCLC api suggested, we could run into GDPR issues as soon as non-employees use this function (because the call to the api generates connection metadata with a third-party server) - at least that's what I hear from some journals in Germany. @dennmuel Upload would work pretty well with smaller databases, such as the mathematical classification someone else mentioned at the Heidelberg sprint. GND has apparently more than 1 GB. That might need a dedicated external service to handle.

ajnyga commented 6 years ago

Hmm. I do not see a GDPR issue here, or to be exact, I do not see where personal data is involved in using an API. If OJS / the plugin would send the personal information (like IP) of the user to the third party, then there would be an issue. But I do not think it can do that? The calls are coming from the OJS server, so that is the only IP visible.

Personally I think making a copy of an entire Thesaurus is not an ideal practice. You would have to be sure that your copy is up to date all the time and there is of course the size issue with many of them.

But whatever the solution is, I want to emphasize that is more important to store the identifier / URI of the keyword. Because the keyword itself is basically just a dum label.

marcbria commented 6 years ago

Looong time ago I developed a plugin for tema3 (now called vocabularyserver) that includes an API to be integrated with third party tools. I can recover this outdated code if you think it could be useful for building a new ojs3 version. The plugin asked the thesaurus, but keywords needed to be added via tema3... and this extra steep made us abandon the project. People (and me) are to lazy to keep an external DB of keywords up to date.

In the other hand I can't say for sure how healthy is tema3 right now but I know the developer so I can ask him directly if you think it worth the time. He contacted to me 2-3 years ago to update the plugin but I was overwhelmed so I decline. He knows OJS and he will be very interested in helping however he can.

This thread shows some different choices: https://forum.pkp.sfu.ca/t/ojs3-controlled-vocabularies/31754/4

Something with @ajnyga 's seal of approval is always a guarantee but I don't know much about finto. Is this a thesaurus tool? With a free software license? What programming language? Any info about the tech stuff to review?

Cheers, m.

ajnyga commented 6 years ago

Having an own seal of approval would be great ! πŸ˜‚

I remember your plugin. I think I built the first version of my plugin for OJS2 using your's.

Finto is a Finnish ontology service maintained by the national library. It has several ontologies/vocabularies that you can use with their API (https://finto.fi/en/about). My plugin uses YSO (General Finnish Ontology https://finto.fi/yso/en/) which was originally built on YSA (General Finnish Thesaurus https://finto.fi/ysa/en/). I think that they have some similarities with for example LCSH. YSO is an "official" vocabulary in Finland used in many places so it is a natural choice for us.

So I am not suggesting that everyone should use Finto/YSO. I am just suggesting that OJS/OMP should have simpler means of: 1) attaching a thesaurus API to the metadata fields (or use a local file) 2) besides the keyword, OJS should also strore the identifier/URI of the keyword

Probably many countries have their own similar thesauruses they would like to use.

I agree that people are lazy in keeping vocabularies up to date. That is why I see copying the whole thesaurus as a file to be a problem. When you use an API to fetch the words, the organization managing the vocabulary will take care of updates - they are the official source.

(Regarding Finto, I do think that you can upload your own ontologies/vocabularies there, but do not know the details. Finto has several of them as you can see here https://finto.fi/en/)

marcbria commented 6 years ago

I remember your plugin. I think I built the first version of my plugin for OJS2 using your's.

I'm so sorry for this... If we meet again, please let me buy you a beer. ;-)

So, back to the issue: are we talking about a generic plugin to integrate with external thesaurus? If this is the requirement looks like a tricky task because there is no standard and each thesaurus include it's own API calls... unless the plugin configuration let you add services and map how each call need to be done, and that will result in a plugin difficult to configure for newcomers.

If I understand the problem, I will suggest here divide the task in two parts: a) A OJS-native thesaurus (folkonomy or controlled) with better UI to clean or fix duplicated keywords for each associated field (I mean, you have article keywords, but also reviewer's interest keywords, etc...). b) A plugin that let you interact with an external thesaurus management system (with a flexible enough config to work against finto, but also tematres or whatever we got in future).

Does it make sense?

Unfortunately I don't have time and knowledge to help in this new development. :-( So sorry for (as we say in spain) "throwing the stone and hide the hand" here.

ajnyga commented 6 years ago

I agree that different kinds of API behaviours will cause a problem and could be that building a plugin that can work with any API is not a good idea.

All I am really asking is that OJS would make it more easier to create a plugin like Finto.

Two things really:

1) having the means to add a custom tagSource to replace the one that makes suggestions from the inbuilt vocabulary (https://github.com/ajnyga/finto/blob/master/FintoPlugin.inc.php#L97) (or whatever the function is if OJS drops Tagit)

2) having the means to store a Keyword + identifier/URI pair (did I mention this already?)

I now have to use smarty postfilters to get the custom tagsource it to work. Basically I replace the existing tagSource. I am not even sure if the postfilters will work anymore the way they do now, when everything is running with vue.js and with postfilters you have be sure that the elements you use to inject stuff do not change in the form. AND it is so UGLY! πŸ˜‚

bkroll commented 6 years ago

@ajnyga some argue that content, timestamp and server ip of an API call make the user personally identifiable, hence personal data. I am no lawyer, but we have probably become more cautious than it's practical when real-time querying third party servers. Never mind.

And yes, I agree that the identifier of a keyword needs to be saved with higher priority than the keyword itself. Sorry I didn't make this clear above.

I'm looking at the GND for this, which is the main authority file for subject headings, persons and all sorts of other entities for libraries in Germany. But they require an access token (which is free), so it might make sense to ask them how they feel about OJS journals asking for an access token and each directly using their API. Use cases of that API I know so far mainly involved a mirror or cache, if they weren't a temporary script.

ajnyga commented 6 years ago

Ok, I see. So the logic is that if I make a query with "Star" and that is sent to the third party together with the OJS server's IP and a timestamp (and logged there) you could trace that back to the user using the logs both from the third party server and the OJS server? Because you would need the logs from both servers to accomplish that while the only IP visible for the third party server is the IP of the OJS server.

You can of course get the necessary data from the access logs, but these usually fall outside GDPR regulation and do not need consent, because they have a very limited access and they are necessary for server security. (GDPR article 6 (legitimate interest) and Recital 49 (The processing of personal data to the extent strictly necessary and proportionate for the purposes of ensuring network and information security)). Of course it is possible that an API server logs the request also in some other log, but at least I think that is not too common?

(edit: just removed my straw man, not the same thing...)

Do not get me wrong, I think it is very important to think about all aspects of GDPR, but I really think that in this case GDPR does no apply.

Could also be that I am just totally misunderstanding your argument 🀣

marcbria commented 6 years ago

Related with GDPR, I agree with @ajnyga... nothing to add.

About the the initial subject I need to say that I misunderstood the whole thing, because I though we were looking for a more global/universal solution:

All I am really asking is that OJS would make it more easier to create a plugin like Finto. ...

This is much more specific and feasible and also makes a lot of sense to me. Let's see what main PKP developers say about it.

Cheers, m.

NateWr commented 6 years ago

A few quick thoughts from my perspective in terms of what OJS can reasonably provide out-of-the-box.

We'll probably always use our internal controlled_vocab_entries table for managing keywords/subjects. We'll need that to be the source of truth in order to do anything like browse by keyword/subject, usage stats filtering by keyword/subject, etc, in a generalized fashion. That means third-party vocab integrations should probably focus on how the user searches for and assigns new vocab entries in the UI, and how these entries are validated before being added to the database. (You may also want another table which tracks third-party IDs alongside controlled vocab entries.)

This should work for most cases, even a large third-party set of keyword/subject constraints, since any particular journal would probably only need to store a subset of actually used keywords/subjects in the OJS database. If you expect your installation to require a huge list of keywords/subjects, we will probably encourage you to do a more wholesale replacement, which would require you to implement your own data entry/storage solution, integrate required frontend features in your own theme, and roll your own solution for keywords/subject browsing if desired.

In the future, it will be much easier for a plugin to hook into a form, kick out one field, and replace it with another (very easy). There will also be a more generalizable FieldAutosuggest component that should make it easier to redirect where the search requests are sent. The submission metadata form will not be converted for 3.2, though, so don't count on it right away.

In the meantime, @ajnyga's ugly hack may be the only solution. We'd probably be open to adding a hook or something to make that easier, though.

ajnyga commented 6 years ago

You may also want another table which tracks third-party IDs alongside controlled vocab entries. For me this is really the only "must-have" requirement in this issue (but on a second read I guess this means there is no in-built support for keyword identifiers/URIs planned, you would have to use your own db table and have the plugin interact with that)

Ugly hacking is close to my heart and since the forms are going through a change anyway, I do not think it makes sense to add hooks to the current code.

asmecher commented 5 years ago

Noting from #4264 that we won't be able to support autosuggest for keywords etc. until this is implemented -- the way they're currently stored would mean a nasty SELECT DISTINCT to provide a disambiguated list. (I'm also leery of the combo of autosuggest and uncurated lists -- I suspect authors will get a lot of nasty spam unless the JM/editor were required to approve user-supplied entries.)

heike-r commented 5 years ago

I just stumbled over this issue and wanted to add, that, as an OJS user, I would be very happy about a controlled vocabulary for keywords. We are currently constructing a data pipeline like this:

OJS -> (via SWORD, in testing) -> repository (MyCoRe) -> (in planning) -> VIVO (current research information system / semantic web)

A controlled vocabulary for standardized keywords would be very good for the final mapping and representation of research in VIVO. Our vocabulary of choice would be AGROVOC (by FAO, CC BY license). I am very much looking forward to further developments on this.

amandastevens commented 4 years ago

Another hosted client has requested a controlled vocabulary option for keywords and reviewing interests.

pmangahis commented 3 years ago

+1 for hosted client

pmangahis commented 2 years ago

+1 for hosted client

monicalpaiva commented 1 year ago

I think it’s a big step for OJS insert a controlled vocabulary as needed by the editor (for here we use the Tematres), since the search is growing. Inserting a standard that receives API from a base like that, we can expand to others.

I saw the Finto structure and found it very interesting, if we can implement it so that the controlled vocabulary API is inserted by the editor.

ajnyga commented 3 months ago

@amandastevens and @pmangahis, do you maybe know what vocabularies the editors have had in mind when they have asked for this feature?

mfelczak commented 2 months ago

@ajnyga, editors are interested in manually defining and managing the vocabulary of keywords that are specific to the journal and field of study. This might include an initial import from an existing vocabulary followed by additions, edits, and deletions of keywords and the ability to gradually evolve the vocabulary over time as needed.