lutaml / expressir

Ruby parser for the ISO EXPRESS language
3 stars 2 forks source link

Ability to only extract current node's code (instead of all inner code) #53

Closed ronaldtse closed 3 years ago

ronaldtse commented 3 years ago

In the ISO 10303 series this usage is common:

Screen Shot 2021-01-21 at 12 41 58 AM

Notice here the SCHEMA part differs from the original code (screenshot of Metanorma output):

Screen Shot 2021-01-21 at 12 42 24 AM

Currently in Expressir if we get the source code, it comes with all the inner code. However, in this usage we only want the SCHEMA and REFERENCE FROMs.

There also seems to be some funky formatting in the extracted REFERENCE FROM:

REFERENCE FROM basic_attribute_schema
  (description_attributedescription_attribute_selectget_description_valueget_id_valueget_name_valueget_roleid_attributeid_attribute_selectname_attributename_attribute_selectobject_rolerole_selectrole_association);

Instead of

REFERENCE FROM basic_attribute_schema
  (description_attribute,
   description_attribute_select,
   get_description_value,
   get_id_value,
   get_name_value,
   get_roleid_attribute,
   id_attribute_select,
   name_attribute,
   name_attribute_select,
   object_rolerole_select,
   role_association);
ronaldtse commented 3 years ago

As described by @opoudjis

zakjan commented 3 years ago

Re the bug, it's weird, source attribute should contain original tokens without any change. I added a test that confirms it. https://github.com/lutaml/expressir/blob/7ae5967b01895b7be82427697543495142eb721a/spec/expressir/express_exp/source_spec.rb

Is it possible that the string changes along the way?

zakjan commented 3 years ago

Based on the different schema header, with vs. without schema version id, it seems that this is not the original source, but a generated source.

Source for any item can be obtained by calling Formatter on the object. For all interfaces, it would be schema.interfaces.map{|x| Formatter.format(x)}.join("\n\n"). Is there any progress about the possibility of calling Formatter from Liquid?

ronaldtse commented 3 years ago

Is it possible that the string changes along the way?

For @opoudjis to answer.

Is there any progress about the possibility of calling Formatter from Liquid?

Ping @w00lf

ronaldtse commented 3 years ago

Based on the different schema header, with vs. without schema version id, it seems that this is not the original source, but a generated source.

@zakjan this is unfortunately the source and target texts. Moreover, we also need a mechanism to read the ASN.1 representation inside that version string.

@TRThurman can we confirm that the generated output of:

SCHEMA action_schema '{iso standard 10303 part(41) version(7) object(1) action_schema(1)}';

must become:

SCHEMA action_schema;

?

Is there some current mechanism in the EXPRESS XML files that strips away the version string?

zakjan commented 3 years ago

Either I can remove version from Formatter default output, or it can accept options object to enable customising the output, in case version is needed in the output for other purposes. Version string can be parsed into separate fields. Both is fine.

However, we need to find out where to call Formatter in the current chain of libraries, so that it plays well with Liquid.

w00lf commented 3 years ago

Is it possible that the string changes along the way?

For @opoudjis to answer.

Is there any progress about the possibility of calling Formatter from Liquid?

Ping @w00lf

Hi there, have not looked into it yet

w00lf commented 3 years ago

Is it possible that the string changes along the way?

For @opoudjis to answer.

Is there any progress about the possibility of calling Formatter from Liquid?

Ping @w00lf

Hi there, have not looked into it yet, @zakjan can you please remind me how can I call the formatter in the latest version of expressir?

TRThurman commented 3 years ago

Where do you see this behavior?

Sent from my iPhone

On Jan 21, 2021, at 3:06 AM, Ronald Tse notifications@github.com wrote:

 Based on the different schema header, with vs. without schema version id, it seems that this is not the original source, but a generated source.

@zakjan this is unfortunately the source and target texts. Moreover, we also need a mechanism to read the ASN.1 representation inside that version string.

@TRThurman can we confirm that the generated output of:

SCHEMA action_schema '{iso standard 10303 part(41) version(7) object(1) action_schema(1)}'; must become:

SCHEMA action_schema; ?

Is there some current mechanism in the EXPRESS XML files that strips away the version string?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ronaldtse commented 3 years ago

@TRThurman the left of this image is the original; the right is from Metanorma.

Screen Shot 2021-01-22 at 2 08 26 PM

Notice the left side the "EXPRESS specification" block ends at REFERENCE FROM, but the right side it includes TYPE and all the way down to END_SCHEMA. This is because on the right side we have taken the entire SCHEMA block from the .exp file.

The difference stems from the differing document models.

The point is the "EXPRESS specification" block that describes SCHEMA action_schema only shows a small part of the schema (SCHEMA and REFERENCE FROM), and does not cover until END_SCHEMA.

In the original, the actual text is considered part of EXPRESS code, but is inconsistent in application.

The EXPRESS snippets in the document are wrapped with *) ... text ... (*. This means the document text is treated to be contained within "remark tags", i.e. the whole file is considered an .exp file. (I suspect this practice started off in real EXPRESS files).

However this application is inconsistent in at least 2 ways:

  1. The document header definitely does not start with (* Screen Shot 2021-01-22 at 2 11 37 PM
  2. The document footer does not end with *). Screen Shot 2021-01-22 at 2 12 17 PM

This practice seems to have originated in the early days where the "document is fully expressed in the EXPRESS code" and that each document contains only one schema.

The current publication chain already breaks the assumption that "the document is a valid EXPRESS file": therefore the *) ... text ... (* wrapping is semantically irrelevant, it only "pretends" to be such. Therefore I believe this practice is probably outdated.

In Metanorma, we are stitching three things together:

  1. Text from the document
  2. EXPRESS code with annotations

The former exists in the .xml files, and the latter exist in the .exp files (in the new way). Therefore 2 is extracted out from .exp files to place inside of the document.

Thus this task is to find a way to extract only the SCHEMA action_schema + REFERENCE FROM lines from the .exp file without the other content from the SCHEMA.

Is this assumption correct?

zakjan commented 3 years ago

@w00lf Currently it's Expressir::ExpressExp::Formatter.format(node).

Btw, I'm considering to rename ExpressExp to just Express, to follow the previous naming. Not now, but later together with cleanup of the previous code in #52, #15

TRThurman commented 3 years ago

Keith, We need your input here: Context is the metanorma EXPRESS source to HTML tool chain.

There is information loss from the EXPRESS resource schemata source to the html in the current publication pipeline. The following screenshots are from SMRLv8. 1 is the clause4 of part 41, 2 is the html of the Annex C computer interpretable listing 3 is the EXPRESS source in Annex C computer interpretable listing.

The Metanorma team is requesting guidance from us on presentation of the SCHEM declaration in the SRL moving forward. This is a document presentation issue only. No change to EXPRESS source.

My recommendation to WG12: Align the clause 4 html and Annex C html with the EXPRESS source for the SCHEMA declaration. Rationale: This aligns all presentations of the EXPRESS schema in the document and aids the reader in correlating the clause 4 with the actual EXPRESS.

regards, Tom

ps. Lower in this email is the original issue from the Metanorma team. Their current generated artifact has a minor layout issue in the 'lists' of items in e.g.,, REFERENCE FROM, USE FROM.. that will be corrected.

Item 1:

Item 2:

Item 3:

On Jan 22, 2021, at 12:24 AM, Ronald Tse notifications@github.com wrote:

@TRThurman https://github.com/TRThurman the left of this image is the original; the right is from Metanorma. https://user-images.githubusercontent.com/11865/105453568-4fa84680-5cbb-11eb-8b34-a8b7aa14a3b3.png Notice the left side the "EXPRESS specification" block ends at REFERENCE FROM, but the right side it includes TYPE and all the way down to END_SCHEMA. This is because on the right side we have taken the entire SCHEMA block from the .exp file.

The difference stems from the differing document models.

The point is the "EXPRESS specification" block that describes SCHEMA action_schema only shows a small part of the schema (SCHEMA and REFERENCE FROM), and does not cover until END_SCHEMA.

In the original, the actual text is considered part of EXPRESS code, but is inconsistent in application.

The EXPRESS snippets in the document are wrapped with ) ... text ... (. This means the document text is treated to be contained within "remark tags", i.e. the whole file is considered an .exp file. (I suspect this practice started off in real EXPRESS files).

However this application is inconsistent in at least 2 ways:

The document header definitely does not start with ( https://user-images.githubusercontent.com/11865/105453795-c1809000-5cbb-11eb-893c-2a558f17f796.png The document footer does not end with ). https://user-images.githubusercontent.com/11865/105453838-d9581400-5cbb-11eb-87ef-ec2dafcdd942.png This practice seems to have originated in the early days where the "document is fully expressed in the EXPRESS code" and that each document contains only one schema.

The current publication chain already breaks the assumption that "the document is a valid EXPRESS file": therefore the ) ... text ... ( wrapping is semantically irrelevant, it only "pretends" to be such. Therefore I believe this practice is probably outdated.

In Metanorma, we are stitching three things together:

Text from the document EXPRESS code with annotations The former exists in the .xml files, and the latter exist in the .exp files (in the new way). Therefore 2 is extracted out from .exp files to place inside of the document.

Thus this task is to find a way to extract only the SCHEMA action_schema + REFERENCE FROM lines from the .exp file without the other content from the SCHEMA.

Is this assumption correct?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lutaml/expressir/issues/53#issuecomment-765164991, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMMKVASL5VP4ZIX2FBEYK3S3EKYVANCNFSM4WK6QTJA.

TRThurman commented 3 years ago

There was some issue I cannot recall with putting the full header in the document. I think it may have been that the tool pipeline removes all comments from the EXPRESS. However there is an outside chance ISO CS Editors may object to the statements in the header being published in the document, even though they insisted that those statements be put in the electronic insert.

Is there something smaller we could prototype and send to them for approval?

I have no issue with the full header being in the document myself.

Keith

On Jan 22, 2021, at 9:26 AM, Thomas Thurman thomas.r.thurman@imonmail.com wrote:

Keith, We need your input here: Context is the metanorma EXPRESS source to HTML tool chain.

There is information loss from the EXPRESS resource schemata source to the html in the current publication pipeline. The following screenshots are from SMRLv8. 1 is the clause4 of part 41, 2 is the html of the Annex C computer interpretable listing 3 is the EXPRESS source in Annex C computer interpretable listing.

The Metanorma team is requesting guidance from us on presentation of the SCHEM declaration in the SRL moving forward. This is a document presentation issue only. No change to EXPRESS source.

My recommendation to WG12: Align the clause 4 html and Annex C html with the EXPRESS source for the SCHEMA declaration. Rationale: This aligns all presentations of the EXPRESS schema in the document and aids the reader in correlating the clause 4 with the actual EXPRESS.

regards, Tom

ps. Lower in this email is the original issue from the Metanorma team. Their current generated artifact has a minor layout issue in the 'lists' of items in e.g.,, REFERENCE FROM, USE FROM.. that will be corrected.

Item 1:

Item 2: Item 3: > On Jan 22, 2021, at 12:24 AM, Ronald Tse wrote: > > > @TRThurman the left of this image is the original; the right is from Metanorma. > > > Notice the left side the "EXPRESS specification" block ends at REFERENCE FROM, but the right side it includes TYPE and all the way down to END_SCHEMA. This is because on the right side we have taken the entire SCHEMA block from the .exp file. > > The difference stems from the differing document models. > > The point is the "EXPRESS specification" block that describes SCHEMA action_schema only shows a small part of the schema (SCHEMA and REFERENCE FROM), and does not cover until END_SCHEMA. > > In the original, the actual text is considered part of EXPRESS code, but is inconsistent in application. > > The EXPRESS snippets in the document are wrapped with *) ... text ... (*. This means the document text is treated to be contained within "remark tags", i.e. the whole file is considered an .exp file. (I suspect this practice started off in real EXPRESS files). > > However this application is inconsistent in at least 2 ways: > > The document header definitely does not start with (* > The document footer does not end with *). > This practice seems to have originated in the early days where the "document is fully expressed in the EXPRESS code" and that each document contains only one schema. > > The current publication chain already breaks the assumption that "the document is a valid EXPRESS file": therefore the *) ... text ... (* wrapping is semantically irrelevant, it only "pretends" to be such. Therefore I believe this practice is probably outdated. > > In Metanorma, we are stitching three things together: > > Text from the document > EXPRESS code with annotations > The former exists in the .xml files, and the latter exist in the .exp files (in the new way). Therefore 2 is extracted out from .exp files to place inside of the document. > > Thus this task is to find a way to extract only the SCHEMA action_schema + REFERENCE FROM lines from the .exp file without the other content from the SCHEMA. > > Is this assumption correct? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub, or unsubscribe. > Thomas Thurman Principal Electrical Engineer, Rockwell Collins (Retired) Confidentiality Notice: This e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited. Please reply to the sender that you have received the message in error, then delete/destroy any and all copies of the original message. E-mail transmissions cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late, incomplete, or contain viruses. Therefore, the recipient should check this email and any attachments for the presence of viruses. The sender accepts no liability for any damage caused by any virus transmitted by this email.
TRThurman commented 3 years ago

Thanks!

ISO required the full header in the EXPRESS text file, which we do comply with. This issue is only about this line:

"SCHEMA action_schema.....;"

Therefore we will go ahead with making the html reflect the EXPRESS for that declaration.

Tom

On Jan 22, 2021, at 10:41 AM, Keith kahunten@gmail.com wrote:

There was some issue I cannot recall with putting the full header in the document. I think it may have been that the tool pipeline removes all comments from the EXPRESS. However there is an outside chance ISO CS Editors may object to the statements in the header being published in the document, even though they insisted that those statements be put in the electronic insert.

Is there something smaller we could prototype and send to them for approval?

I have no issue with the full header being in the document myself.

Keith

On Jan 22, 2021, at 9:26 AM, Thomas Thurman thomas.r.thurman@imonmail.com wrote:

Keith, We need your input here: Context is the metanorma EXPRESS source to HTML tool chain.

There is information loss from the EXPRESS resource schemata source to the html in the current publication pipeline. The following screenshots are from SMRLv8. 1 is the clause4 of part 41, 2 is the html of the Annex C computer interpretable listing 3 is the EXPRESS source in Annex C computer interpretable listing.

The Metanorma team is requesting guidance from us on presentation of the SCHEM declaration in the SRL moving forward. This is a document presentation issue only. No change to EXPRESS source.

My recommendation to WG12: Align the clause 4 html and Annex C html with the EXPRESS source for the SCHEMA declaration. Rationale: This aligns all presentations of the EXPRESS schema in the document and aids the reader in correlating the clause 4 with the actual EXPRESS.

regards, Tom

ps. Lower in this email is the original issue from the Metanorma team. Their current generated artifact has a minor layout issue in the 'lists' of items in e.g.,, REFERENCE FROM, USE FROM.. that will be corrected.

Item 1:

Item 2: Item 3: > On Jan 22, 2021, at 12:24 AM, Ronald Tse > wrote: > > > @TRThurman the left of this image is the original; the right is from Metanorma. > > Notice the left side the "EXPRESS specification" block ends at REFERENCE FROM, but the right side it includes TYPE and all the way down to END_SCHEMA. This is because on the right side we have taken the entire SCHEMA block from the .exp file. > > The difference stems from the differing document models. > > The point is the "EXPRESS specification" block that describes SCHEMA action_schema only shows a small part of the schema (SCHEMA and REFERENCE FROM), and does not cover until END_SCHEMA. > > In the original, the actual text is considered part of EXPRESS code, but is inconsistent in application. > > The EXPRESS snippets in the document are wrapped with *) ... text ... (*. This means the document text is treated to be contained within "remark tags", i.e. the whole file is considered an .exp file. (I suspect this practice started off in real EXPRESS files). > > However this application is inconsistent in at least 2 ways: > > The document header definitely does not start with (* > The document footer does not end with *). > This practice seems to have originated in the early days where the "document is fully expressed in the EXPRESS code" and that each document contains only one schema. > > The current publication chain already breaks the assumption that "the document is a valid EXPRESS file": therefore the *) ... text ... (* wrapping is semantically irrelevant, it only "pretends" to be such. Therefore I believe this practice is probably outdated. > > In Metanorma, we are stitching three things together: > > Text from the document > EXPRESS code with annotations > The former exists in the .xml files, and the latter exist in the .exp files (in the new way). Therefore 2 is extracted out from .exp files to place inside of the document. > > Thus this task is to find a way to extract only the SCHEMA action_schema + REFERENCE FROM lines from the .exp file without the other content from the SCHEMA. > > Is this assumption correct? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub , or unsubscribe . > Thomas Thurman Principal Electrical Engineer, Rockwell Collins (Retired) Confidentiality Notice: This e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited. Please reply to the sender that you have received the message in error, then delete/destroy any and all copies of the original message. E-mail transmissions cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late, incomplete, or contain viruses. Therefore, the recipient should check this email and any attachments for the presence of viruses. The sender accepts no liability for any damage caused by any virus transmitted by this email.
ronaldtse commented 3 years ago

Therefore we will go ahead with making the html reflect the EXPRESS for that declaration.

Thank you @TRThurman and @kahunten!!

ronaldtse commented 3 years ago

@zakjan so there is no need to strip away the EXPRESS version string anymore, the EXPRESS definitions should be faithfully reproduced in the document.

opoudjis commented 3 years ago

@zakjan

Re the bug, it's weird, source attribute should contain original tokens without any change. I added a test that confirms it. https://github.com/lutaml/expressir/blob/7ae5967b01895b7be82427697543495142eb721a/spec/expressir/express_exp/source_spec.rb

Is it possible that the string changes along the way?

I am having difficulty seeing how. I invoke

[source%unnumbered]
--
{{ schema.sourcecode }}
--

and I get:

<sourcecode id="_0aea3a57-938e-4049-a2a1-6f3b270ffa72" unnumbered="true">SCHEMA action_schema '{iso standard 10303 part(41) version(8) object(1) action_schema(1)}';
REFERENCE FROM basic_attribute_schema
  (description_attributedescription_attribute_selectget_description_valueget_id_valueget_name_valueget_roleid_attributeid_attribute_selectname_attributename_attribute_selectobject_rolerole_selectrole_association);
REFERENCE FROM support_resource_schema
  (bag_to_setidentifierlabeltext);
TYPE as_description_attribute_select = SELECT BASED_ON description_attribute_select WITH (action_request_solution);
END_TYPE;
....

schema is all one big dump of stuff. I am not doing a thing to that dump, it's a one-liner in lutaml, and I have no idea what to debug.

The fact that the commas, not just the carriage returns are being collapsed in one specific semantic class of text does make me think that this is happening in expressir; metanorma has no reason to differentiate any of those lines, and certainly doesn't touch punctuation.

zakjan commented 3 years ago

Ok, I'm sorry, this is a bug in formatter indeed. I'll fix it later today.

Note that the new attribute with original source is called source.

ronaldtse commented 3 years ago

Let's deal with the source formatter formatting issue in #57.

ronaldtse commented 3 years ago

@zakjan just to clarify, this current task is still active -- we still need to have the ability to obtain the source of the "current node" WITHOUT the inner nodes.

In this particular case, it is SCHEMA ... + the schema's REFERENCE FROM ... nodes.

zakjan commented 3 years ago

Original source or formatter output? What are the other cases?

ronaldtse commented 3 years ago

Original source or formatter output? What are the other cases?

Preferably configurable.

We must at least support SCHEMA ... + the schema's REFERENCE FROM. As long as there is a way to get this output.

zakjan commented 3 years ago

Preferably configurable

They are completely separate implementations, one must be chosen first

ronaldtse commented 3 years ago

They are completely separate implementations, one must be chosen first

Both should be accessible, because the use cases are different. Sometimes we need the raw source. Sometimes we want the formatted source.

I don't understand what needs to be "chosen" here?

zakjan commented 3 years ago

There is nothing shared between them, they come from different places. I'll need to choose one to start with in, and it is going to be the first one available. Which one is preferred for the current use case?

ronaldtse commented 3 years ago

If the formatted version can provide all the original remarks (tail remarks, embedded remarks), we should do that first. Otherwise, the raw version.

zakjan commented 3 years ago

Formatted version currently provides only tagged remarks (both tail and embedded), but no untagged remarks. They are printed at the end of entire output, after the last END_SCHEMA. So it seems we're going for raw version, thanks.

ronaldtse commented 3 years ago

@zakjan For the formatted version I'm concerned with this type of tail remarks:

REFERENCE FROM basic_attribute_schema
  (description_attribute,
   description_attribute_select, -- this kind of comment

Is this doable in the formatted version?

zakjan commented 3 years ago

Hmm, I'm not sure, I'll do more research about it and let you know.

zakjan commented 3 years ago

Reopening for research to add untagged remarks to the parsed model, so that it's formatted roughly at a position similar to the original source.

zakjan commented 3 years ago

@w00lf @opoudjis Please use a custom formatter for this from now on:

class CustomFormatter < Expressir::ExpressExp::Formatter
  def format_schema(node)
    [
      "SCHEMA #{node.id}#{node.version ? " #{format(node.version)}" : ""};",
      *node.interfaces.map{|x| format(x)}
    ].join("\n")
  end
end

schema.to_hash(formatter: CustomFormatter)

Let me know when you do the change, I'll remove schema.head_source.

zakjan commented 3 years ago

Created https://github.com/lutaml/expressir/issues/64 for including untagged remarks into parse tree and formatter output