tno-terminology-design / tev2-tools

The Terminology Engine (v2) is a set of specifications and tools that caters for the creation and maintenance (i.e. curation) of terminologies. This repository contains the sources for the tools.
Apache License 2.0
2 stars 3 forks source link

allow termselection to use formphrases #11

Closed RieksJ closed 8 months ago

RieksJ commented 11 months ago

When selecting terms from eSSIF-Lab for the TEv2-documentation terminology, there is a termselection line

  - "term[manage,management,governance,objective,risk,owner,owned]@essif-lab" 

The term manage doesn't exist in the essif-lab terminology. However, there is a curated text in eSSIF-Lab that has:

term: management
formPhrases: management, manager, manage, manages, managed, managed-by, management, managing, managing-part{yies}

When working with these terms in practice, authors/curators would use all these forms, and would not necessarily know which is the one that is actually defined. Requiring curators to go look for that is perhaps a bit overkill.

This issue calls for an enhancement, where term (as well as -term) are treated differently from other fields, in that its value(s) should be treated as showtext(s), in the same way as done in termrefs, and the output of converting a showtext in a term would then be taken as the value to add or remove from the mrg-under-construction

Ca5e commented 11 months ago

By refactoring the part of the MRGT that handles the terminology under construction, the formphrases are now also checked when term or -term is used. It isn't handled quite the same as showtext (yet), in the sense that the values are not converted to lowercase, '() aren't removed and remaining strange characters aren't replaces by dashes. This can, of course, still be done with somethinking like the following.

    if (key === "term") {
      values = values.map((value) =>
        value
          .toLowerCase()
          .replace(/['()]+/g, "")
          .replace(/[^a-z0-9_-]+/g, "-")
      )
    }
RieksJ commented 11 months ago

Time is a miraculous thing - it has the ability to change one's mind.

When I read @Ca5e 's comment here above, it occurred to me that there is no real need for this, as curators that wanted to have added whatever is necessary to accommodate for the term manage might have considered to use formPhrases[manage] rather than term[manage], which would work if the MRG would ensure that the formPhrases field of an MRG Entry would not contain macro's (as is currently specified) and the value of the term field would always be included in the formPhrases field (even if the formPhrases field wasn't specified in the curated text).

From a usability perspective, I can see that what this issue is doing might be preferable. But then, there is also a case to be made that curators should know what they are doing and by changing some examples that help to point out this difference might just do it.

Ca5e commented 11 months ago

Currently, within the contents of the formPhrases field (in the header of a curated text), they are described as a comma-separated list of such form phrases. As we will be using the formPhrases field within the TRRT and MRGT differently from now on, it may be wise to start using the yaml convention for a list. This ensures correct handling in the future as it can be interpreted as a list without first having to verify it's type and possibly having to split a string on a comma.

So

formPhrases: actors, actor's, actor(s)

should become

formPhrases: [actors, actor's, actor(s)]

with the following, of course, also being valid.

formPhrases:
- actors
- actor's
- actor(s)

I'd propose the same thing for the grouptags field, but this already seems to be the case according to the example listed in the specs here. It just doesn't seem to be honored within (some of) the curated texts (for example ict.md). This also means that existing references to grouptags, if any, may not work as expected right now. As tools, quality-assurance, for example, would be interpreted as one list item instead of two.

RieksJ commented 11 months ago

There is a problem here. I have tried to do that, but Docusaurus gives an error:

image

The error is due to the fact that { and } are used as the formphrase-macro delimiters. @Ca5e How do you suggest we solve this? Would it help to surround list-elements with quotes, as in

`formPhrases: { "element{ss}" }

Ca5e commented 11 months ago

You're right. The following syntaxes should all be correct however.

formPhrases:
- actor{ss}
formPhrases: ['actor{ss}']
formPhrases: ["actor{ss}"]
Ca5e commented 10 months ago
formPhrases:
- actor{ss}
formPhrases: ['actor{ss}']
formPhrases: ["actor{ss}"]

I've changed the formPhrases in the essiflab/framework and tno-terminology-design/tev2-specifications repo to match the above formats, the MRG's will default to using the first example format.

On another note, I believe we could simplify/reduce the MRG's a bit. I believe the term(id) can only contain alphanumeric characters. In this case, there is no reason to list those characters within the formPhrases. Unless we want to keep this from a user experience point of view.

- "author"
- "authors"
- "author's"
- "author(s)"

May very well become

- "author"
- "authors"
RieksJ commented 10 months ago

This other note is something I don't quite oversee (yet?). Currently, showtext-matching to formPhrases is straightforward: the showtext as it is must match one of the formphrases. This is easy enough to explain and to grasp.

If we were to go with the proposal, that would introduce a new conversion step wehre the showtext is first modified before it is being matched. That needs to be explained, and introduces a complication to users. Also, a proper specification (and explaination) has to exist before we can start pondering about any (un)wanted consequences, which is a prerequisite for deciding whether or not to do this.

My first impression is that this is not worth it, but I am open to being convinced...

RieksJ commented 10 months ago

I've given this some thought.

Here are some observations:

  1. Authors should be able to define/document semantic units that they should be able to refer to with terms such as C++, B+ tree, H2O, 4G network, file_path, user_name etc.
  2. The way in which we currently resolve TermRefs is by processing the named capturing group (ncg) term, or if it does not exist, showtext. This processing aims to turn that ncg into a text that is being looked-up in the formPhrases arrays of mrg-entries. Note that this process identifies an MRG entry, not the term (at least not directly)

From this, I conclude that it is better to think of

This has the following consequences:

@Ca5e : please comment if you think this poses problems

Ca5e commented 9 months ago

When the versions of the TRRT and HRGT are updated, they will use the new formPhrase behaviour. The tev2-specifications and framework repo still need to be updated.

RieksJ commented 9 months ago

We have introduced 'formphrases' as regular authorizable texts with spaces, special chars etc., and 'regularized texts' (uincluding regularized formphrases) as texts that don't contain such characters (see definition of 'regularized text', also for the conversion process.

RieksJ commented 9 months ago

Termselection is also done by the termselection criteria of the importer. I can see the benefit of allowing curators to add terms by using texts that are treated as showtexts. However, I also think that we shouldn't make exceptions to the simple syntax we currently have, e.g., by treating the term... instruction differently than we would, say, glossaryText.

What do you think about adding an instruction ADD (possibly also REMOVE) that would take a list of strings as argument, and treat it as a showtext for finding an MRG entry to be imported?

RieksJ commented 9 months ago

Decision: if key isn't specified in an instruction, then the argument list is considered to be a list of showtexts. Thus, we can say, e.g.,

Ca5e commented 9 months ago

Functionality of termselection adding and removing without specifying a key has been added to MRGT v1.0.4. There are still some TBD's related to this at the end of the specs here. I suppose we can leave this issue open until those have been taken care of.

RieksJ commented 9 months ago

The syntax [ "showtext1", "showtet 2"]@<tid> is illegal YAML. VSCode says YAML syntax error Unexpected scalar at node end. That is because of the @<tid>. When using MRGT 1.0.4., it says:

image

So we need another syntax for this.

RieksJ commented 9 months ago

Proper YAML syntax is to be used, i.e. stuff should be surrounded with quotes