@morido - Please document what the format will be here:

https://github.com/openETCS/toolchain/wiki/Process-Documentation

This is needed for @UweSteinkeFromSiemens

Ok, I will first give it a try right here in the issue since this is still work in progress. Hence, I invite you to comment on anything that seems weird to you. Sorry for so much text...

Update, Oct 17, 2014: Added formal specification of the tracestring and amended example section to reflect the new spec. Update, Oct 18, 2014: Added Open Issues

Note 1: In the following I will use the term "requirement" for anything that will become traceable. You may substitute "element" if that name is more convenient for you.

Note 2: Everything in [square brackets] is not yet implemented.

Goal:

Generate a hierarchical tree of all traceworthy artifacts in each chapter of subset-026. Each artifact shall be uniquely addressable via a tracestring.

Rules for computing a tracestring:

Each paragraph in the Word document becomes one requirement. A paragraph is a number of characters terminated by a paragraph end character (the thing that looks like an inverted "P" if you enable "Show formatting marks" in Word like it's done in the screenshots below)
1. Exception: Tables; technically a table is n paragraphs in Word (with n >= the number of cells in the table) possibly followed immediately by a caption in a separate paragraph. Each table cell becomes one requirement - no matter of how many paragraphs (n >= 1) it is composed of. [I do want to allow per-paragraph-requirements inside cells as well, if the cells are very long.] All these cell-requirements are then assigned a row (a fictional, empty requirement) as a parent and those rows are children of an overall table-requirement (again fictional but containing a rich-text version of the entire table). The caption, if present, will also become a separate requirement whose parent is the table-requirement. Prior to running the tool, the user may define cells which are not traceworthy. Those cells will then be excluded from the output. [Inter-cell requirements, i.e. those neat arrows in transition tables, are not yet dealt with.]. If a row has a table-header flag set (internal Word flag), none of its cells will be traced. If an entire row does not contain any children it will be omitted. If a table does not contain any children it will still become a requirement. Rows and Columns are automatically numbered (1-based) unless the user requested to use special names, instead.
2. Exception: Figures; Figures are either an image on an individual line of text immediately followed by a caption on the next line (equalling one single paragraph in total) or an image in a separate paragraph immediately followed by caption in the next paragraph (making it two paragraphs in total). In any case the hierarchical structure is: One figure-requirement which has one caption-requirement child.
3. Exception: Media (mainly inline images + equations); these are technically identical to figures but are surrounded by text. They will not become individual requirements. Instead each media artifact is written to disk with a unique filename derived from the current tracestring.
4. Exception: Empty paragraphs (that is all paragraphs which do not contain any printable characters). They will not be processed at all.
If a paragraph is part of the base list of the respective chapter (i.e. (simplified) for chapter n its numberText starts with "n.") and Rule 7 (see below) does not apply then the numberText equals the tracestring. Note: A numberText is the string representation of the current number of a list item (that thing on the very left of a list paragraph, e.g. "1.1.2")
If a paragraph is part of a list which is hierarchically less significant than the base list then the tracestring is a "breadcrumb" to the base list. That means if we have 3 lists in the following hierarchical order {base list, list 1, list 2} and their current numberText values are {"1.1.2", "a", "1"} then the tracestring will be "1.1.2.a.1".
If a paragraph is part of a list which is bulleted or does not have any numberTexts at all then (1-based) running numbers in square brackets are automatically inserted. I.e. "[*][3]" for the third bullet point in any bulleted list.
If a paragraph is not in any list at all then it will receive the tracestring of the the last list paragraph with a running number (1-based) appended in square brackets. The last true list paragraph implicitly has running number == 1 (which is omitted in the tracestring).
1. Exception: (see the appendix example below) certain paragraphs may have either a numberText injected if they conform to a certain pattern or if they are "fake list paragraphs" (i.e. the numberText is hardcoded into the paragraph text) then that fake numberText is used and the paragraphs are treated like real list paragraphs.
[If a paragraph 2, which is not in any list, is immediately preceded by another paragraph 1, which is also not in any list and has a lower left indentation than paragraph 2, then paragraph 2 shall become a child of paragraph 1.]
If any list lacks a level then this level is excluded from the tracestring. I.e. item 1 with numberText "1.1.2" immediately followed by item 2 with numberText "1.1.2.1.1". In this case item 2 will have "1.1.2.1" as its tracestring.

I hope not to have forgotten anything essential...

Scheme of the tracestring:

The tracestring may be prefixed by a user-definable string.
The tracestring is composed of individual parts separated by a dot ("."). The parts are in descending hierarchical order when read from left to right. The very last part does not have a dot appended.
Each part may start with an identifier in square brackets which defines the type of this part. E.g. "[t]" = table, "[r]" = row, "[c]" = column, "[f]" = figure, [*] = bulleted item, ... This identifier is followed by the running number of the current artifact in square brackets. If there is no number then * is inserted.
special identifiers without square brackets may also be used if the current artifact needs no running number. I.e. "C" in "1.1.2.[t][1].C" for the caption of table number "1" in requirement "1.1.2".

Formal Specification:

Note: This covers the current implementation. I.e. everything described above which was not wrapped in square brackets.

Below is a possible version of a lexer + parser written in ANTLRv4.

lexer grammar tracestringLexer;

// generic tokens
Delimiter : '.';
EOL : [\r\n]+ -> skip;

// List tokens in DEFAULT_MODE
fragment NoNumber : '*';
fragment Number : [1-9] | NumberGe10 ;
fragment NumberNot1 : [2-9] | NumberGe10 ;
fragment NumberGe10 : [1-9][0-9]+;
fragment LowerCaseCharacter : [a-z];
fragment Character : [A-Z] | LowerCaseCharacter;
fragment AlphaNumCharacter : (Character | Number);
fragment BracketedNumber : '[' Number ']';
String : AlphaNumCharacter+;
BulletedListID : '[*]' BracketedNumber;
ParagraphID : '[' NumberNot1 ']';

Table : '[t]' TableTraceString;
Figure : '[f]' FigureTraceString;
fragment FloatingEntityNumber : Number LowerCaseCharacter?;
fragment Caption : 'C';

mode TableMode;
TableTraceString : TableID InnerTable?;
fragment TableID : (FloatingEntityNumber | NoNumber);
fragment InnerTable : Delimiter  (
                RowID (Delimiter (ColumnID | String))?
                | ConditionID
                | Caption
             );
fragment RowID : '[r]' BracketedNumber;
fragment ColumnID : '[c]' BracketedNumber;
fragment ConditionID : '[C]' BracketedNumber;

mode FigureMode;
FigureTraceString : FigureID InnerFigure?;
fragment FigureID : FloatingEntityNumber;
fragment InnerFigure : Delimiter Caption;

parser grammar tracestringParser;
options { tokenVocab=tracestringLexer; }

entireString : baseList (Delimiter subList)* (Delimiter floatingEntity)?;
baseList : baseListID (Delimiter baseListID)* paragraphID?;
subList : subListID (Delimiter subListID)* paragraphID?;
baseListID : String;
subListID : String | BulletedListID;
paragraphID : ParagraphID;
floatingEntity : table | figure;

table : Table;
figure : Figure;

Example parse tree:

Input: 1.2.3.4[5].6.[*][7][8].9.[*][1].[t]*.[r][10].Data

antlr4_parse_tree

Real world examples from subset-026:

Note 1: The numberTexts in the following screenshots do not correspond to the original numberTexts in subset-026. The correct reference is given in the headings.

Note 2: Prefixes present in the numberTexts may sometimes be omitted because the corresponding levels are missing in the shown excerpts (see Rule 7 of the section on how to compute a tracestring for why this is).

Example 1: §3.8.5.3.2

3_8_5_3_2 will become

1
1.[*][1]
1.[*][2]
1.[f][21a]
1.[f][21a].C

Example 2: §3.19

appendix3 will become:

1
2
2.1
2.2
2.2.1
2.3

Example 3: Appendix of chapter 3

appendix3_1 will become:

A3
A3.1
A3.1[2]
A3.1[2].[t]*
A3.1[2].[t]*.[r][2]
A3.1[2].[t]*.[r][2].Data
A3.1[2].[t]*.[r][2].Value
A3.1[2].[t]*.[r][2].Name
A3.1[2].[t]*.[r][3]
A3.1[2].[t]*.[r][3].Data
A3.1[2].[t]*.[r][3].Value
A3.1[2].[t]*.[r][3].Name
A3.1[2].[t]*.[r][4]
A3.1[2].[t]*.[r][4].Data
A3.1[2].[t]*.[r][4].Value
A3.1[2].[t]*.[r][4].Name
A3.1[2].[t]*.[r][5]
A3.1[2].[t]*.[r][5].Data
A3.1[2].[t]*.[r][5].Value
A3.1[2].[t]*.[r][5].Name
A3.1[2].[t]*.[r][6]
A3.1[2].[t]*.[r][6].Data
A3.1[2].[t]*.[r][6].Value
A3.1[2].[t]*.[r][6].Name

Open Issues:

I cannot (yet) guarantee the generated tracestrings will be collision free (i.e. unambiguous). This is mainly due to the design decision to omit non-existing levels (see Rule 7 in "Rules for computing a tracestring"). A possible countermeasure would be to inject placeholders for the non-existing levels in the requirement tree. However, this would lead to non-backward-traceable paragraphs (i.e. a given paragraph which was injected does not have any associated position in the original Word document). This would require to make them technically unusable for any downstream activities (i.e. under no circumstances may those paragraphs be implemented in any way).

@morido - thanks for taking the time for this. Looks great. I have just a few comments:

Tables - As I mentioned, the ReqIF standard does support tables. This got introduced due to a similar feature in Rational DOORS. This is managed by SpecHirarchy.isTableInternal (Section 10.8.37). Note that the standard is rather vague on what it actually means. If you plan to use it after all, let me know and I'll give you information from the ReqIF Implementor Forum, which has a recommendation on it. Also note that ProR currently does not support this. However, if you decide to use this, we could implement this in ProR.
Media - ReqIF has information on how embedded content should be represented. This is a little hidden, you find it on page 57, entitled: "2. Inclusion of objects that are external to the exchange XML document in the requirements authoring tool". The only valid content ignoring this are .png images. This could be a useful approach to representing formulas and similar stuff, but would create additional work for you.
Collisions - I don't see this as a big issue: I would simply perform a check after parsing the document if there are duplicates. If there are, aborting is probably the best course of action.

@jastram

Tables - As I mentioned, the ReqIF standard does support tables. [...] If you plan to use it after all, let me know [...].

I do not think I need this.

Currently I do the following: Create one "fake parent" requirement (the "table requirement") which has all rows (again "fake requirements") as its leaves. Those rows then give shelter to all columns (actual requirements). The rows are only placeholders (no metadata or other meaningful content besides backwards tracing information) whereas the table holds a rich text version of the entire table as kind of a "visual helper".

The "DOORS-Table" approach, on the other hand, just shows you a table and all the actual requirements are right in there.

From my perspective my current implementation better suits our special needs. Mainly because:

I do not need any special flags which would otherwise be necessary to implement the DOORS thingy (besides something to mark the placeholders / fake requirements as such -- but that is rather generic).
I believe I am more flexible when it comes to merged cells or inter-cell requirements (the infamous "Transition Tables")
I can leave out cells (e.g. table header or other non-traceworthy stuff)
The visual representation can be rewritten without affecting the actual requirements (necessary for "Transition Tables" and to add the nice blue boxes on the top right of each cell).

See the example below (which corresponds to the same table shown in example 3 of my original posting). It shows non-traced cells (the header-row; item 3 above), and the blue boxes (item 4).

table__example_rich_text

Media - ReqIF has information on how embedded content should be represented. This is a little hidden, you find it on page 57, entitled: "2. Inclusion of objects that are external to the exchange XML document in the requirements authoring tool". The only valid content ignoring this are .png images. This could be a useful approach to representing formulas and similar stuff, but would create additional work for you.

As I have already told you orally I currently only export the preview information of all the OLE-data. For the subset-026 that usually means we end up with either image/x-emf or image/x-wmfdata. Theoretically I could also dump the raw OLE BLOBs somewhere, but as you said that involves some additional work since I have to traverse the internal Word filesystem to find the correct offset within the original .doc where this data is stored. Unless someone desperately needs this, I would postpone any such attempts (effectively we would end up with a plethora of proprietary file formats which are only readable if the tools used to create them are available).

Collisions - I don't see this as a big issue: I would simply perform a check after parsing the document if there are duplicates. If there are, aborting is probably the best course of action.

Since I create the requirement tree on-the-fly while parsing the input document, the current way of handling this is simply to throw an exception (see example below) if a requirement with an unambiguous identifier is about to be added to the tree (basically there is some simple class backed by a HashMap which does a lookup on existing entries every time a requirement is inserted). Aborting is of course a failsafe action here. However, that means my tool will not produce any meaningful output (i.e. it renders itself pretty much useless). So if I will come across a situation which triggers this I will see if there is anything smart (other than rewriting the input document) I can do about it.

Quick example which would currently make the tool throw the exception:

Raw numberText	Generated tracestring
`1.`	`1`
`1.1.1`	`1.1`
`1.1.2`	`1.2`
`1.2`	`1.2`

One more thing regarding 2. Inclusion of objects that are external to the exchange XML document in the requirements authoring tool in the ReqIF standard:

How am I supposed to store captions in there? I guess not in the "alternative text"?

Currently I simply export an image / table and attach a caption-tag (resp. figcaption-tag) to it, as one would do for an ordinary website. This "compound object" then becomes the rich text of a table- / figure-requirement and the caption itself is also stored in a separate child, again (in case anyone ever wants to trace that or the plain text is needed for any subsequent [NLP]-analysis).

@morido

Regarding tables and objects - I am perfectly fine the way you're handling this, I just wanted to document that this is in the standard.

Regarding collisions: aborting is in my opinion the right choice.

With regard to captions: the standard does not provide any recommendation for this. As embedded objects/images are just XHTML, you can surround it with further XHTML to add the caption - of course this will result in a human-readable, not a machine-readable caption. Your approach makes sense, from my point of view.

@morido A few remarks from the user's perspective:

The requirement identifiers must have delimiters at the beginning and at the end (pre- and postfix), that eases automatic detection as requirements IDs without ambiguity.
Collision-free identifiers are a must.
The identifiier generation algorithm should be robust against document modifications: Identifiers already assigned in a previous document version, should be preserved, if the document was extended or shortened.
Requirement identifiers are often used in discussions and conversations as references. The proposed identifiers seem somewhat unpronounceable. And they are a challenge for the human eye, if you're working with them in documents. Therefore, a more human-friendly representation would be useful.
The requirements text - that the requirement id identifies - should be made available for automatized grabbing too.

@UweSteinkeFromSiemens

Hi Uwe,

The requirement identifiers must have delimiters at the beginning and at the end (pre- and postfix), that eases automatic detection as requirements IDs without ambiguity.

Would you be ok with adding those delimiters somewhere further downstream in the toolchain? Rationale (from my side):

there is no technical reason to have them floating around inside my tool (it is like using a final \0 in a c-style char-array when you always know the string's length - superfluous...)
they do not convey any additional information, but would only further complicate the resulting string (confer with your fourth point).

The identifiier generation algorithm should be robust against document modifications: Identifiers already assigned in a previous document version, should be preserved, if the document was extended or shortened.

I believe that is infeasible. The current implementation of the tracestring generation relies only on that single document you throw at it. Hence, it does not have any historical data. And I would like to keep it that way, since it allows a 1:1 mapping (even manually) between tracestrings and the original document.

Revisioning should rather be handled further downstream inside ProR (i.e. by using a diff between two reqif files). Also I would urge the ERA to keep their numbering consistent at least across minor-releases of their documents.

Requirement identifiers are often used in discussions and conversations as references. The proposed identifiers seem somewhat unpronounceable. And they are a challenge for the human eye, if you're working with them in documents. Therefore, a more human-friendly representation would be useful.

Do you have any proposals? Otherwise I would weigh absence of ambiguity over clarity.

I suppose in average conversations you rarely talk about the second paragraph of the third bullet item inside "1.2.3", do you? At most you might mention "1.2.3" - and that's a rather simple identifier which will also be known by my tool (it's just way broader than a specific paragraph).

The requirements text - that the requirement id identifies - should be made available for automatized grabbing too.

Of course they will be. I just do not plan to expose any API to directly communicate with my tool but would rather point you at the resulting reqif from which you may then grab whatever you need.

@morido Please provide a short documentation on the requirements ID's definition in the documentation wiki here https://github.com/openETCS/toolchain/wiki/User-Documentation#TODO_Requirements_naming

@jastram @cecilebraun Do you want me to duplicate information? Each one of you proposed different documents where my requirement ID definition should be included.

For the time being I just propose it here (feel free to copy it to the final destination if there are no further comments):

Very short definition of requirement IDs

Note: This only covers the case of paragraphs which are part of a list. If you need to trace other artifacts (which are always children of such list paragraphs) please look into the detailed explanation in #437.

Take the following example: traceability_example

Guideline:

The scope of a single requirement ID is a paragraph of text (there are six such paragraphs in the above example).
requirement IDs are hierarchical. The hierarchy is a direct mapping of the hierarchy in the original subset026 text. Levels are separated by a dot. There is a requirement at each level (i.e. you may truncate the requirement ID to any level and it stays valid).

Howto:

Suppose we want to trace the fifth paragraph in the above example i.e

• End of mission is performed

let traceString be the variable to store the result.
Find the current running number of the baseList. That is the list which includes the chapter number. In this example this number equals 3.5.3.7. Set traceString to this number.
Count the number of paragraphs in this list item starting with 1 and append this number in square brackets to the traceString if it is greater than 1.
1. Note: For the first iteration in the example there is only one such paragraph (If the establishment...). Hence, we do not append anything. In the second iteration there are two such paragraphs (The on-board shall... + If this request is not...). Hence, the second one will receive an [2] appendix.
Until you arrived at your target paragraph: Append any running number of sublists and remove leading or trailing characters (such as braces). If the current sublist is bulleted then the level string always becomes [*][n](with n being the running number of that bullet starting at 1). Prefix this new level with a dot (.) and append it to the traceString.
1. Note: a) is the identifier of one such sublist item. The trailing brace will be removed. The bullet points form another (less significant) sublist.
Do step 3.
Do step 4 or break.
traceString is now the fully qualified requirementID.

This will result in the following requirement ID:

3.5.3.7.a[2].[*][2]

@morido - definitely no duplicates! Please use @cecilebraun's location.

@morido: Referring to your remarks to my input above: As a tool vendor it's essential to understand the user perspective. From there, it's irrelevant, how a tool performs internally and what it is able to achieve. So, if the tool is not able to solve the problem automatically, a human-assisted combination of tooling and manual intervention should be thought ahead and provided in the chain. Therefore, it would be great to elaborate the terms "should be handled further downstream inside ProR" into concrete processing steps.

@UweSteinkeFromSiemens

Hi Uwe,

Therefore, it would be great to elaborate the terms "should be handled further downstream inside ProR" into concrete processing steps.

I suppose you are referring to the revisioning issue.

Short answer: This is outside of the scope of my tool, therefore I do not care. All I do is to convert one *.doc-file into one *.reqif-file. Hence, I can only process one revision at a time.

Long answer: ProR Essentials includes ReqIF Diff. This could be the way to go if you want to compute the delta between different baselines of the subset026. DOORS (or other *.reqif-capable RM-tools) might also be of help here, as this is one of their core strengths. Generally it should not not be required to have requirement IDs stay consistent across baselines (at least as long as they are not consistent in the input *.doc-files). IMHO this only leads to (massive) confusion. Instead should the requirements of different baselines, which share common properties (i.e. which are "equal to a certain degree"), be linked. And that link should then be subject to manual checking. -- But this is only my personal opinion. You may disagree.

openETCS / toolchain

Define Requirements Identifier Format #437

Goal:

Rules for computing a tracestring:

Scheme of the tracestring:

Formal Specification:

Example parse tree:

Real world examples from subset-026:

Example 1: §3.8.5.3.2

Example 2: §3.19

Example 3: Appendix of chapter 3

Open Issues:

Very short definition of requirement IDs

Guideline:

Howto: