allow termselection to use formphrases

RieksJ commented 11 months ago

When selecting terms from eSSIF-Lab for the TEv2-documentation terminology, there is a termselection line

  - "term[manage,management,governance,objective,risk,owner,owned]@essif-lab"

The term manage doesn't exist in the essif-lab terminology. However, there is a curated text in eSSIF-Lab that has:

term: management
formPhrases: management, manager, manage, manages, managed, managed-by, management, managing, managing-part{yies}

When working with these terms in practice, authors/curators would use all these forms, and would not necessarily know which is the one that is actually defined. Requiring curators to go look for that is perhaps a bit overkill.

This issue calls for an enhancement, where term (as well as -term) are treated differently from other fields, in that its value(s) should be treated as showtext(s), in the same way as done in termrefs, and the output of converting a showtext in a term would then be taken as the value to add or remove from the mrg-under-construction

Ca5e commented 11 months ago

By refactoring the part of the MRGT that handles the terminology under construction, the formphrases are now also checked when term or -term is used. It isn't handled quite the same as showtext (yet), in the sense that the values are not converted to lowercase, '() aren't removed and remaining strange characters aren't replaces by dashes. This can, of course, still be done with somethinking like the following.

    if (key === "term") {
      values = values.map((value) =>
        value
          .toLowerCase()
          .replace(/['()]+/g, "")
          .replace(/[^a-z0-9_-]+/g, "-")
      )
    }

RieksJ commented 11 months ago

Time is a miraculous thing - it has the ability to change one's mind.

When I read @Ca5e 's comment here above, it occurred to me that there is no real need for this, as curators that wanted to have added whatever is necessary to accommodate for the term manage might have considered to use formPhrases[manage] rather than term[manage], which would work if the MRG would ensure that the formPhrases field of an MRG Entry would not contain macro's (as is currently specified) and the value of the term field would always be included in the formPhrases field (even if the formPhrases field wasn't specified in the curated text).

From a usability perspective, I can see that what this issue is doing might be preferable. But then, there is also a case to be made that curators should know what they are doing and by changing some examples that help to point out this difference might just do it.

Ca5e commented 11 months ago

Currently, within the contents of the formPhrases field (in the header of a curated text), they are described as a comma-separated list of such form phrases. As we will be using the formPhrases field within the TRRT and MRGT differently from now on, it may be wise to start using the yaml convention for a list. This ensures correct handling in the future as it can be interpreted as a list without first having to verify it's type and possibly having to split a string on a comma.

So

formPhrases: actors, actor's, actor(s)

should become

formPhrases: [actors, actor's, actor(s)]

with the following, of course, also being valid.

formPhrases:
- actors
- actor's
- actor(s)

I'd propose the same thing for the grouptags field, but this already seems to be the case according to the example listed in the specs here. It just doesn't seem to be honored within (some of) the curated texts (for example ict.md). This also means that existing references to grouptags, if any, may not work as expected right now. As tools, quality-assurance, for example, would be interpreted as one list item instead of two.

RieksJ commented 11 months ago

There is a problem here. I have tried to do that, but Docusaurus gives an error:

The error is due to the fact that { and } are used as the formphrase-macro delimiters. @Ca5e How do you suggest we solve this? Would it help to surround list-elements with quotes, as in

`formPhrases: { "element{ss}" }

Ca5e commented 11 months ago

You're right. The following syntaxes should all be correct however.

formPhrases:
- actor{ss}
formPhrases: ['actor{ss}']
formPhrases: ["actor{ss}"]

Ca5e commented 10 months ago

formPhrases:
- actor{ss}
formPhrases: ['actor{ss}']
formPhrases: ["actor{ss}"]

I've changed the formPhrases in the essiflab/framework and tno-terminology-design/tev2-specifications repo to match the above formats, the MRG's will default to using the first example format.

On another note, I believe we could simplify/reduce the MRG's a bit. I believe the term(id) can only contain alphanumeric characters. In this case, there is no reason to list those characters within the formPhrases. Unless we want to keep this from a user experience point of view.

- "author"
- "authors"
- "author's"
- "author(s)"

May very well become

- "author"
- "authors"

RieksJ commented 10 months ago

This other note is something I don't quite oversee (yet?). Currently, showtext-matching to formPhrases is straightforward: the showtext as it is must match one of the formphrases. This is easy enough to explain and to grasp.

If we were to go with the proposal, that would introduce a new conversion step wehre the showtext is first modified before it is being matched. That needs to be explained, and introduces a complication to users. Also, a proper specification (and explaination) has to exist before we can start pondering about any (un)wanted consequences, which is a prerequisite for deciding whether or not to do this.

My first impression is that this is not worth it, but I am open to being convinced...

RieksJ commented 10 months ago

I've given this some thought.

Here are some observations:

Authors should be able to define/document semantic units that they should be able to refer to with terms such as C++, B+ tree, H2O, 4G network, file_path, user_name etc.
The way in which we currently resolve TermRefs is by processing the named capturing group (ncg) term, or if it does not exist, showtext. This processing aims to turn that ncg into a text that is being looked-up in the formPhrases arrays of mrg-entries. Note that this process identifies an MRG entry, not the term (at least not directly)

From this, I conclude that it is better to think of

formphrases as the set of (lowercased, human readable) texts that are used by authors to refer to the semantic unit (concept) as documented by an MRG entry
terms as the single text that satisfies regex [a-z_0-9-]+ that - in combination with termTypes - are used by machines to refer to the semantic unit documented by an MRG entry.

This has the following consequences:

[x] TermRef resolution takes the ncg (term or showtext), converts it to lowercase and trims whitespace off its edges, and then matches it with formphrases. Special characters remain as they are (so we're no longer going for the 'markdown-like-heading-ids'. This provides users with the maximum flexibility in referring to semantic units.
[x] a formphrase is simply a (lowercased, spaces-trimmed) character string that is an element in the formPhrases field of an MRG entry. The formphrases fields in the tev2-specifications repo and the essif-lab framework repo need to be revised (removing many of the - characters)
[x] the above has to be accommodated for in the tools (TRRT and MRGT I would guess).

@Ca5e : please comment if you think this poses problems

Ca5e commented 9 months ago

When the versions of the TRRT and HRGT are updated, they will use the new formPhrase behaviour. The tev2-specifications and framework repo still need to be updated.