Open goodmami opened 3 years ago
I strongly support the idea of making changes to the ILI through GitHub.
I wonder what limitations would be on this... for example Open English WordNet has thousands of synsets that we would like to propose for ILI codes.
I wonder how this will integrate also with systems such as the duplicate detection system that I previously integrated with the CILI project?
I wonder what limitations would be on this... for example Open English WordNet has thousands of synsets that we would like to propose for ILI codes.
This is a backlog of several years, right? If we aim to do a round every year, it normally wouldn't be so many?
I wonder how this will integrate also with systems such as the duplicate detection system that I previously integrated with the CILI project?
It would be great to integrate automated checks for, e.g., well-formedness of ili.ttl
, duplicate IDs, the use of deprecated IDs, etc., to assist the reviewer. Where is the duplicate detection system? In this repository I only see the make-tsv.py
script I wrote, which doesn't have any checks.
I think we still may make quite a few requests per year.
For the linking validation, I think automating this would be a very good idea, but it will take some effort. I also have contributed some code from the Naisc to the OMW. However, this is more advisory, it is not always clear if two similar definitions are actually the same... we should probably find a way to do this within the context of this workflow.
Ok, so we probably need a few things to make this review pipeline work.
ili="in"
synsets and corresponding <ILIDefinition>
elements to create candidate ILI entriesEach batch would then turn into a PR for further discussion. How does this sound?
Sure that seems good. I agree that we need a mixture of tests, some of which would be 'hard-fail' tests, for example if the definition already exists and some of which are more quality control, which could be a bit more tricky to implement in practice.
For (3), the best is probably just to write contribution guidelines and enforce them. So, if someone submits too much in one PR, just reject it and ask the contributor to break it into smaller PRs.
For (4), we may just create an external service that analyses the PR (using our Naisc tools) and this report can be added as a comment to the PR before it is accepted. This allows the person approving the PR to review and raise any issues with the contributor.
As I recall (and the need to recall is because the CILI site is currently down), the process for proposing ILIs is to produce a wordnet with
ili="in"
on some synsets. These would, in theory, get scooped up by the OMW and added to a review queue. There are some issues with this:I can see (1) and (2) being intentional for quality control. E.g., (1) is because OMW acts as the confluence point of many wordnets and therefore is aware of the IDs used across projects (i.e., it puts the "C" in CILI); and (2) because the proposed IDs thereby exhibit their intended use as a prerequisite.
For (3), the situation would be better if the proposal and approval process were out in the open. I think GitHub pull requests against
ili.ttl
would help here a lot, as it provides a diff view for the proposed changes, a discussion space, an approval workflow, and a review queue (the list of PRs). These PRs could be generated by OMW maintainers from submitted wordnets, or if they are submitted directly by wordnet authors, the wordnet file should be attached somehow to verify that the IDs are in-use.If this proposed workflow is accepted, then this issue would be resolved by creating some documentation (in the README or elsewhere) explaining what to do.