glossarist / glossarist-desktop

Glossarist Desktop
https://www.glossarist.org
GNU General Public License v3.0
4 stars 6 forks source link

Allowing comparing concept entry with source (if source is machine-readable) #151

Open ronaldtse opened 4 years ago

ronaldtse commented 4 years ago

If the SOURCE is machine-readable (e.g. RDF), we should have an interface that compares a concept entry with the machine-readable source.

While data elements provided in ISO 10241-1 is a superset of SKOS, but we can still facilitate the comparison (such as the CR diff interface).

The particular use case is comparing OSGeo terms with ISO/TC 211 terms and OGC definitions.

cc: @camerons @barbaricyawps @ReesePlews @Naini-K

barbaricyawps commented 4 years ago

I'm no expert on linters, but I do wonder if a linter might be a good tool for us in general. A linter might help us flag terms that don't meet TC-211's standards. It might be worth doing a spike on it since it could potentially save us from a lot of manual labor in terms of reviewing entries. (For example, the linter could do a first pass on the term and then a human could confirm it.) We have someone who knows a lot about linters in The Good Docs if you'd like me to arrange a meeting.

strogonoff commented 4 years ago

May I suggest to describe the underlying use case and background in some detail, before we jump to a solution?

camerons commented 4 years ago

@strogonoff You might want to have a look at our Glossary pilot manifesto which includes the user story:

strogonoff commented 4 years ago

Thanks for the links, will have a read. Am looking for a specific scenario in which comparison happens.

On 7 Nov 2020, at 2:41 AM, Cameron Shorter notifications@github.com wrote:

 @strogonoff You might want to have a look at our Glossary pilot manifesto which includes the user story:

As an application using a glossary, I want terms defined with a consistent schema which facilitates machine readability and interoperability. A bit more background in this slide deck. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ReesePlews commented 4 years ago

@strogonoff, thank you for your work on the project!

i think there are two major categories/cases where comparison happens. they can be divided, roughly, by management vs discovery.

at a management level there will always be a need for comparison when an entry (term+def) is added. considering an ISO example during an update of a document from WD to CD... a comparison of the existing (WD) entry vs the proposed (CD) entry is made. a pure automated approach is probably not sufficient, so a visual (human) comparison needs to be made. if an automated approach could flag various differences (changed words, punctuation, etc) it would be helpful. at that point, a decision is made to replace the existing entry with the proposed entry. in TC211 we keep the existing term-id number and replace the existing contents with the proposed contents. also in TC211 we are not strictly comparing the content/existence of notes and examples. this decision may well be different within a different TC/organization. when i have discussed the comparison requirements with @ronaldtse he has mentioned some other options/levels may exist for the handling of versions of a concept. in TC211, because of our current repository solution, only one level of term-id is supported.

at a discovery level, any user (manager to guest) will want to compare 2 or more entries. visible contents are determined by user authorization level. how this is presented within a UI would require more discussion. a big issue with the current TC211 repository is the inflexibility of visualizing the selected set. a long horizontal scroll is quite difficult to do quickly. a selected view of specific entry attributes is critical. additional filter/sort capabilities on the attributes of the selected set is also very useful. (order by the oldest active entry)

however the step before comparison is probably more important than the comparison itself. that step is identification/query. is that automatic or user assisted or manual? within the TC211 repository, the identification of possible matching entries is performed automatically using what i suspect is a "fuzzy" type of search. identification/query is not only on term alone. (flag an existing definition that is similar to the proposed definition when each term is different). a user assisted approach could be a set of filters or the input of a query string. a manual approach would enable a user to select any entries from a list and have them in the active comparison set.

happy to discuss more. just let me know.

strogonoff commented 4 years ago

@barbaricyawps, to clarify on linting/validation: we’re aware it’s important to verify register item contents, especially for items not manually entered (e.g., migrated from other data sources) which can have missing data. This functionality will be applied to concepts and terms as well.

To expand on comparing terms: application database maintains internal structured representation of any given register item (including a concept). This representation will preserve versions of an item as it changes, with application interface allowing to show a difference between any two versions.

If the first version of a concept came from an external source, and subsequently changed, it would be easy to compare the changed version with the initial version. Exactly how importing will be exposed in application interface is being worked on.

(It is not required to export an item to RDF in order to compare it with another item; moreover, due to the flexibility of RDF format, RDF exported from a different source from Glossarist may have different structure, and give false positives when comparing. Comparison between Glossarist and non-Glossarist managed concept databases may not be directly supported without caveats.)

@ReesePlews, regarding authorization levels, we’ll discuss this with @ronaldtse. Access levels within a single repository are not yet supported, with application originally focusing on open data; splitting a single dataset into repositories with separate access controls should be possible with some architecture/engineering effort, but also with overhead for register/registry management.

Regarding:

flag an existing definition that is similar to the proposed definition when each term is different

Is use case being considered here that someone might mistakenly add a concept with a similar definition, but different terms, instead of proposing a new term for an existing concept? I tend to think this should be a responsibility of (1) an author, to identify an already existing concept before suggesting to add another one that may refer to the same unit of understanding, and (2) concept system manager at change request review stage, before accepting the proposal.

Agreed that user interface should assist those responsibilities, initially by providing flexible enough search functionality, and later potentially by offering proactive hints.

select any entries from a list and have them in the active comparison set.

I believe this comparison set would most of the time be changes contained within a change request? Allowing to show all entries contained within a given change request (further allowing to select an item and seeing the effect of the change) is something the graphical interface will support soon.

ronaldtse commented 4 years ago

@barbaricyawps thank you for the offer to meet with your linting expert! We'd love to discuss further with him/her. I believe by "linting" here you mean "verifying the content of a glossary item". You're right, linting is possible in the case where ISO rules for terminology are defined.

For example, ISO Directives Part 2, 16.5.6, state that:

[The definition ...] shall not start with an article (“the”, “a”) nor end with a full stop.

This can be detected.

A definition shall not take the form of, or contain, a requirement.

This can also be detected such that the definition does not contain the word "should", "shall" or "must".

There are a couple use cases that are conflated here.

Comparison of concepts, their attributes (designations, definitions, etc.) and their relationships (source, etc.) between:

And these comparisons can be done for these use cases:

While they are similar, we must track these use cases separately because the functionality demands of them are vastly different. Hope this clarifies.