globalwordnet / cili

The Global WordNet Association Collaborative Inter-Lingual Index
Other
40 stars 8 forks source link

Propose new ILIs via GH pull requests? #9

Open goodmami opened 3 years ago

goodmami commented 3 years ago

As I recall (and the need to recall is because the CILI site is currently down), the process for proposing ILIs is to produce a wordnet with ili="in" on some synsets. These would, in theory, get scooped up by the OMW and added to a review queue. There are some issues with this:

  1. It requires that the wordnet is actually processed by OMW
  2. ILIs cannot be proposed, reviewed, and approved prior to producing the wordnet with them in it
  3. The scooping and reviewing all happens offline so wordnet authors cannot easily know the progress of their proposals and it's easy for the reviewers to forget that the work remains unfinished

I can see (1) and (2) being intentional for quality control. E.g., (1) is because OMW acts as the confluence point of many wordnets and therefore is aware of the IDs used across projects (i.e., it puts the "C" in CILI); and (2) because the proposed IDs thereby exhibit their intended use as a prerequisite.

For (3), the situation would be better if the proposal and approval process were out in the open. I think GitHub pull requests against ili.ttl would help here a lot, as it provides a diff view for the proposed changes, a discussion space, an approval workflow, and a review queue (the list of PRs). These PRs could be generated by OMW maintainers from submitted wordnets, or if they are submitted directly by wordnet authors, the wordnet file should be attached somehow to verify that the IDs are in-use.

If this proposed workflow is accepted, then this issue would be resolved by creating some documentation (in the README or elsewhere) explaining what to do.

jmccrae commented 3 years ago

I strongly support the idea of making changes to the ILI through GitHub.

I wonder what limitations would be on this... for example Open English WordNet has thousands of synsets that we would like to propose for ILI codes.

I wonder how this will integrate also with systems such as the duplicate detection system that I previously integrated with the CILI project?

goodmami commented 3 years ago

I wonder what limitations would be on this... for example Open English WordNet has thousands of synsets that we would like to propose for ILI codes.

This is a backlog of several years, right? If we aim to do a round every year, it normally wouldn't be so many?

I wonder how this will integrate also with systems such as the duplicate detection system that I previously integrated with the CILI project?

It would be great to integrate automated checks for, e.g., well-formedness of ili.ttl, duplicate IDs, the use of deprecated IDs, etc., to assist the reviewer. Where is the duplicate detection system? In this repository I only see the make-tsv.py script I wrote, which doesn't have any checks.

jmccrae commented 3 years ago

I think we still may make quite a few requests per year.

For the linking validation, I think automating this would be a very good idea, but it will take some effort. I also have contributed some code from the Naisc to the OMW. However, this is more advisory, it is not always clear if two similar definitions are actually the same... we should probably find a way to do this within the context of this workflow.

goodmami commented 3 years ago

Ok, so we probably need a few things to make this review pipeline work.

  1. A script that takes a wordnet and extracts the ili="in" synsets and corresponding <ILIDefinition> elements to create candidate ILI entries
  2. A first-pass quality-control filter that ensures that, e.g., the synset wasn't already granted another ILI, the candidates have a proposed definition, the definition is in English and of reasonable length, etc.
  3. Some way to batch the candidates into manageable sets for review (e.g., 100 at a time, sorted by order in the lexicon, taxonomic groups, or something else)
  4. More sophisticated quality control, e.g., comparing the taxonomic neighbors, comparing the definition similarity with others (to make sure it's not duplicating another ILI, but also maybe to make sure its not too different from its hypernyms?)
  5. Assemble a review item for the ILIs that shows the candidates with, e.g., similarity scores with other ILIs and other potential issues.

Each batch would then turn into a PR for further discussion. How does this sound?

jmccrae commented 3 years ago

Sure that seems good. I agree that we need a mixture of tests, some of which would be 'hard-fail' tests, for example if the definition already exists and some of which are more quality control, which could be a bit more tricky to implement in practice.

For (3), the best is probably just to write contribution guidelines and enforce them. So, if someone submits too much in one PR, just reject it and ask the contributor to break it into smaller PRs.

For (4), we may just create an external service that analyses the PR (using our Naisc tools) and this report can be added as a comment to the PR before it is accepted. This allows the person approving the PR to review and raise any issues with the contributor.