Tag Snippet with Lexer - Githubissues

flywire commented 2 years ago

The user selects the lexer [language] manually, or Pygments can guess, but the lexer is not saved. It would be useful to save the lexer with the snippet and allow Code Highlighter 2 to read a tag associated with a snippet (eg similar to markdown tagging code blocks with language) :

Pygments could be rerun over a mixed coding language document without user selecting lexer
a snippet would only have one lexer tag
select whole snippet with a click
current interface would allow user to assign new lexer tag
Pygments could be rerun over document with mixed coding languages without selecting lexer

jmzambon commented 2 years ago

I was thinking this would be easy to implement, but I'm facing some unexpected problems. I will turn back soon on that issue.

flywire commented 2 years ago

Can you expose your code for this?

jmzambon commented 2 years ago

I pushed a dedicated branch, with a first attempt.

1. Pygments could be rerun over a mixed coding language document without user selecting lexer

This is done by selecting "automatic" as language.

3. select whole snippet with a click

Don't see a easy way to do this... The best option will remain to put snippet in a text table.

4. current interface would allow user to assign new lexer tag

Could you elaborate? Isn't this already the case?

5. Pygments could be rerun over document with mixed coding languages without selecting lexer

So far the last choosen language must be "automatic".

flywire commented 2 years ago

Can you briefly explain your approach? I'll look at code and xml but I might not pick up subtleties.

Use case - Rerun: snippet code is changed and needs to be highlighted.

rerun over a mixed coding language document - by selecting "automatic" - want it to use any existing snippet language tag not guess again
...
select whole snippet with a click - no easy way - agree, it's a hotkey or context-sensitive menu select, app already contains selection code
Yes, already the case, if user selects a language the lexer will use it and overwrite snippet language tag, maybe a skip lexer tag will be needed
Pygments could be rerun over document - last chosen language must be "automatic" - agree, see first point
user probably needs to be able to restrict language (lexers) available
~I'm concerned automatic might be too greedy, highlighting random words in text or words styled as code that should not be highlighted as in-line snippets, say OS commands.~

flywire commented 2 years ago

Automatic language lacks integrity and is fairly random. It misses too much (about half), almost all lexers were wrong, and the same code in different files gave different results: HelloWorld.zip HelloWorld.txt

I would be more interested in tagging snippets with language as the document was developed then select the whole document and update the style based on the snippet language tags. Say, forget about language automatic and use <tagged>.

flywire commented 2 years ago

Sorry, I've edited your message instead of replying. And I see no way to revert to your original post...

Despite CH2 being run with Use character styles in Writer, tags are in <office:automatic-styles>

I'd expect an <office:body> lexer tag at the start of each code block (currently coded as paragraphs, ie lines of code)

This is how opendocument standard works. See the specifications if you need more infos.

LibreOffice is responsible for outputting the document in the strict respect of that standard. Code Highlighter 2 is intended to be a LibreOffice extension, not to create document from scratch.

a method is needed to verify the lexer for each code block from within the LO document

Please give concrete hints on how you think the user will use it and where he'd find it.

sequential paragraphs of the same style are the same code block

Of course. Please try to be less cryptic, I'm tired of guessing what you mean...

jmzambon commented 2 years ago

Can you briefly explain your approach? I'll look at code and xml but I might not pick up subtleties.

No subtelties here: if "automatic" is choosen as lexer, CH2 will first search for a lexer name associated with the snippet and apply it, otherwise it will ask Pygments to guess one.

In other words, the first time you highlight a snippet, CH2 will store the lexer name with it (this is actually not a tag, but we can think it is). The next time, CH2 will apply the current lexer or, if this is "automatic", will apply the saved one. This way, you may rerun CH2 on mixed language snippets without worrying about erroneous guessing...

Use case - Rerun: snippet code is changed and needs to be highlighted.
1. rerun over a mixed coding language document - by selecting "automatic" - _want it to use any existing snippet language tag **not guess again**_

So this is done.

2. ...

3. select whole snippet with a click - no easy way - agree, it's a hotkey or context-sensitive menu select, app already contains selection code

Need more time to think about it. But why not using text frame for your snippets, this is the best approach?

4. Yes, already the case, if user selects a language the lexer will use it and overwrite snippet language tag, maybe a `skip` lexer tag will be needed

This is already done by choosing "automatic" language.

6. user probably needs to be able to restrict language (lexers) available

Why?

jmzambon commented 2 years ago

Automatic language lacks integrity and is fairly random. It misses too much (about half), almost all lexers were wrong, and the same code in different files gave different results: HelloWorld.zip HelloWorld.txt

Please see my previous comment.

I would be more interested in tagging snippets with language as the document was developed then select the whole document and update the style based on the snippet language tags.

This would be a totally new feature. I can't multiply shortcuts and buttons. But maybe I can think about a small api that you'll be able to use as you want. Please add a new issue for a feature "select whole document and update all previously highlighted snippets."

jmzambon commented 2 years ago

Despite CH2 being run with Use character styles in Writer, tags are in <office:automatic-styles>

I'd expect an <office:body> lexer tag at the start of each code block (currently coded as paragraphs, ie lines of code)

This is how opendocument standard works. See the specifications if you need more infos.

LibreOffice is responsible for outputting the document in the strict respect of that standard. Code Highlighter 2 is intended to be a LibreOffice extension, not to create document from scratch.

a method is needed to verify the lexer for each code block from within the LO document

Please give concrete hints on how you think the user will use it and where he'd find it.

sequential paragraphs of the same style are the same code block

Of course. Please try to be less cryptic, I'm tired of guessing what you mean...

flywire commented 2 years ago

The instructions weren't clear at https://github.com/jmzambon/libreoffice-code-highlighter/issues/7#issuecomment-1189273094, which I suppose is reasonable to test how intuitive an interface is.

The next time, CH2 will apply the current lexer

Got it. Tested automatic language with python, LibreOffice Basic, and java highlighted snippets - fails and it is not clear why:

Edit:

It seems the problem was the java code block was formatted with the wrong lexer (BBC Basic??). If so, why wouldn't it run again?
How can a user fault-find?
Inserted 3rd line into python code a = 5 before running again and it highlighted the whole block correctly

Test code:

Start>
def open_greeting(args=None):
    # Code lines with a maximum length of 80 characters will not wrap over lines
    print("Hello World" + 1 * "!")
<End

BASIC

PRINT "Hello, World!"

Java

public class Main {
  public static void main(String[] args) {
    System.out.println("Hello, World!");
  }
}

flywire commented 2 years ago

I'm wondering if a better approach would be to take the start and end of the snippet selected by the user, adjust it for start/end of paragraph and leading/trailing blank lines, and write a tag with the lexer. That would provide code blocks.

jmzambon commented 2 years ago

It seems the problem was the java code block was formatted with the wrong lexer (BBC Basic??). If so, why wouldn't it run again?

I try your test code with no problem. What I did:

highlight each snippet providing explicit lexer (first three entries in history)
modify some dummy text
select the three snippet and rerun only once highlighting with "automatic" lexer (next three entries in the history).

flywire_1

jmzambon commented 2 years ago

I'm wondering if a better approach would be to take the start and end of the snippet selected by the user, adjust it for start/end of paragraph and leading/trailing blank lines, and write a tag with the lexer. That would provide code blocks.

This is exactly what is already implemented.

flywire commented 2 years ago

There is still cross communication, let's align it here.

take the start and end of the snippet selected by the user

The snippet selected by a user can be many lines, a code block, which becomes a paragraph for each line in Writer.

This is exactly what is already implemented.

No, the xml shows the snippet (code block) can be comprised of many paragraphs, each wrapped in a paragraph style (ie lexer style). Previously you suggested the code blocks might be able to be placed in a frame, presumably to associate with the lexer instead of each paragraph.

def open_greeting(args=None):
    # Code lines with a maximum length of 80 characters will not wrap over lines
    print("Hello World" + 1 * "!")

Occurs as three P1 paragraphs having <style:paragraph-properties ch2_lexer="Python"/>

    <office:automatic-styles>
        <style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
            <style:paragraph-properties ch2_lexer="Python"/>
            <style:text-properties fo:language="zxx" fo:country="none"/>
        </style:style>
        <style:style style:name="P2" style:family="paragraph" style:parent-style-name="Standard">
            <style:paragraph-properties ch2_lexer="BBC Basic"/>
            <style:text-properties fo:language="zxx" fo:country="none"/>
        </style:style>
        <style:style style:name="P3" style:family="paragraph" style:parent-style-name="Standard">
            <style:paragraph-properties ch2_lexer="VB.net"/>
            <style:text-properties fo:font-size="12pt" fo:language="zxx" fo:country="none"/>
        </style:style>
    </office:automatic-styles>
    <office:body>
        <office:text>
            <text:sequence-decls>
                <text:sequence-decl text:display-outline-level="0" text:name="Illustration"/>
                <text:sequence-decl text:display-outline-level="0" text:name="Table"/>
                <text:sequence-decl text:display-outline-level="0" text:name="Text"/>
                <text:sequence-decl text:display-outline-level="0" text:name="Drawing"/>
                <text:sequence-decl text:display-outline-level="0" text:name="Figure"/>
            </text:sequence-decls>
            <text:p text:style-name="Standard">Start&gt;</text:p>
            <text:p text:style-name="P1">
                <text:span text:style-name="Code.Keyword">def</text:span>
                <text:span text:style-name="Code.Text"></text:span>
                <text:span text:style-name="Code.Name.Function">open_greeting</text:span>
                <text:span text:style-name="Code.Punctuation">(</text:span>
                <text:span text:style-name="Code.Name">args</text:span>
                <text:span text:style-name="Code.Operator">=</text:span>
                <text:span text:style-name="Code.Keyword.Constant">None</text:span>
                <text:span text:style-name="Code.Punctuation">):</text:span>
            </text:p>
            <text:p text:style-name="P1">
                <text:span text:style-name="Code.Text">
                    <text:s text:c="4"/>
                </text:span>
                <text:span text:style-name="Code.Comment.Single"># Code lines with a maximum length of 80 characters will not wrap over lines</text:span>
            </text:p>
            <text:p text:style-name="P1">
                <text:span text:style-name="Code.Text">
                    <text:s text:c="4"/>
                </text:span>
                <text:span text:style-name="Code.Name.Builtin">print</text:span>
                <text:span text:style-name="Code.Punctuation">(</text:span>
                <text:span text:style-name="Code.Literal.String.Double">&quot;Hello World&quot;</text:span>
                <text:span text:style-name="Code.Text"></text:span>
                <text:span text:style-name="Code.Operator">+</text:span>
                <text:span text:style-name="Code.Text"></text:span>
                <text:span text:style-name="Code.Literal.Number.Integer">1</text:span>
                <text:span text:style-name="Code.Text"></text:span>
                <text:span text:style-name="Code.Operator">*</text:span>
                <text:span text:style-name="Code.Text"></text:span>
                <text:span text:style-name="Code.Literal.String.Double">&quot;!&quot;</text:span>
                <text:span text:style-name="Code.Punctuation">)</text:span>
            </text:p>
            <text:p text:style-name="Standard">&lt;End</text:p>
            <text:p text:style-name="Standard"/>
            <text:p text:style-name="Standard">BASIC</text:p>
            <text:p text:style-name="Standard"/>
            <text:p text:style-name="P3">
                <text:span text:style-name="Code.Name">PRINT</text:span>
                <text:span text:style-name="Code.Text.Whitespace"></text:span>
                <text:span text:style-name="Code.Literal.String">&quot;Hello, World!&quot;</text:span>
            </text:p>
            <text:p text:style-name="Standard"/>
            <text:p text:style-name="Standard"/>
            <text:p text:style-name="Standard">Java</text:p>
            <text:p text:style-name="Standard"/>
            <text:p text:style-name="P2">
                <text:span text:style-name="Code.Name.Variable">public</text:span>
                <text:span text:style-name="Code.Text.Whitespace"></text:span>
                <text:span text:style-name="Code.Name.Variable">class</text:span>
                <text:span text:style-name="Code.Text.Whitespace"></text:span>
                <text:span text:style-name="Code.Name.Variable">Main</text:span>
                <text:span text:style-name="Code.Text.Whitespace"></text:span>
                <text:span text:style-name="Code.Error">{</text:span>
            </text:p>
            <text:p text:style-name="P2">
                <text:span text:style-name="Code.Text.Whitespace">
                    <text:s text:c="2"/>
                </text:span>
                <text:span text:style-name="Code.Name.Variable">public</text:span>
                <text:span text:style-name="Code.Text.Whitespace"></text:span>
                <text:span text:style-name="Code.Name.Variable">static</text:span>
                <text:span text:style-name="Code.Text.Whitespace"></text:span>
                <text:span text:style-name="Code.Name.Variable">void</text:span>
                <text:span text:style-name="Code.Text.Whitespace"></text:span>
                <text:span text:style-name="Code.Name.Variable">main</text:span>
                <text:span text:style-name="Code.Operator">(</text:span>
                <text:span text:style-name="Code.Name.Variable">String</text:span>
                <text:span text:style-name="Code.Error">[]</text:span>
                <text:span text:style-name="Code.Text.Whitespace"></text:span>
                <text:span text:style-name="Code.Name.Variable">args</text:span>
                <text:span text:style-name="Code.Operator">)</text:span>
                <text:span text:style-name="Code.Text.Whitespace"></text:span>
                <text:span text:style-name="Code.Error">{</text:span>
            </text:p>
            <text:p text:style-name="P2">
                <text:span text:style-name="Code.Text.Whitespace">
                    <text:s text:c="4"/>
                </text:span>
                <text:span text:style-name="Code.Name.Variable">System</text:span>
                <text:span text:style-name="Code.Error">.</text:span>
                <text:span text:style-name="Code.Name.Variable">out</text:span>
                <text:span text:style-name="Code.Error">.</text:span>
                <text:span text:style-name="Code.Name.Variable">println</text:span>
                <text:span text:style-name="Code.Operator">(</text:span>
                <text:span text:style-name="Code.Literal.String.Double">&quot;Hello, World!&quot;</text:span>
                <text:span text:style-name="Code.Operator">);</text:span>
            </text:p>
            <text:p text:style-name="P2">
                <text:span text:style-name="Code.Text.Whitespace">
                    <text:s text:c="2"/>
                </text:span>
                <text:span text:style-name="Code.Error">}</text:span>
            </text:p>
            <text:p text:style-name="P2">
                <text:span text:style-name="Code.Error">}</text:span>
            </text:p>
        </office:text>
    </office:body>
</office:document-content>

jmzambon commented 2 years ago

Lexer tag (actually a user defined paragraph property) is applied by the extension to the whole code-block, at once. LibreOffice translates this in the XML content in the manner you observed. There is nothing we can change about this, and it doesn't matter as, when the code-block is selected again, we can retrieve the tag transparently.

I don't think it's possible to add a object that could be translated as a XML tag spanning multiple paragraphs.

flywire commented 2 years ago

Tested with user paragraph style Code. No conflicts, works as expected.
Frame snippet test. Frame can be formatted to look identical to other paragraphs in Writer and pdf. Using XpdfReader to select/copy/paste pdf to np++ gives a different result with leading spaces in code between a frame and paragraphs.

                <draw:frame draw:style-name="fr1" draw:name="Frame3" text:anchor-type="paragraph" draw:z-index="2">
                    <draw:text-box fo:min-height="0.499cm" fo:min-width="17cm">
                        <text:p text:style-name="P1">
                            <text:span text:style-name="Code.Keyword">def</text:span>
                            <text:span text:style-name="Code.Text"></text:span>
                            <text:span text:style-name="Code.Name.Function">open_greeting</text:span>
                            <text:span text:style-name="Code.Punctuation">(</text:span>
                            <text:span text:style-name="Code.Name">args</text:span>
                            <text:span text:style-name="Code.Operator">=</text:span>
                            <text:span text:style-name="Code.Keyword.Constant">None</text:span>
                            <text:span text:style-name="Code.Punctuation">):</text:span>
                        </text:p>
                        <text:p text:style-name="P1">
                            <text:span text:style-name="Code.Text">
                                <text:s text:c="4"/>
                            </text:span>
                            <text:span text:style-name="Code.Comment.Single"># Code lines with a maximum length of 80 characters will not wrap over lines</text:span>
                        </text:p>
                        ...
                    </draw:text-box>
                </draw:frame>

jmzambon commented 2 years ago

Frame snippet test. Frame can be formatted to look identical to other paragraphs in Writer and pdf. Using XpdfReader to select/copy/paste pdf to np++ gives a different result with leading spaces in code between a frame and paragraphs.

This is a problem related to how text is internally stored in PDF file. Nothing to do with LibreOffice or Code Highlighter 2.

By the way, with Xreader and sublime-text, I see no difference: leading spaces are all removed.

jmzambon / libreoffice-code-highlighter

Tag Snippet with Lexer #7