uyha / tree-sitter-cmake

A Tree-sitter parser for CMake
MIT License
42 stars 9 forks source link

Unquoted argument parsing issue #18

Open mahtab-nejati opened 1 year ago

mahtab-nejati commented 1 year ago

Hi,

I've noticed that for unquoted arguments such as in the following code snippet (which is quite a common use case)

single_argument_command(var_name_${varA}_is_compound)

The unquoted text payload is lost ("varname" and "_is_compound").

I've tried unhiding the _unquoted_text but it parses each letter as a node. Another approach is to define some external rules.

Do you think there is a cleaner solution to this?

uyha commented 1 year ago

I'm not sure what you mean by lost here. Could you give a snippet, its parse result (using tree-sitter parse) and the expected parse result?

mahtab-nejati commented 1 year ago

For example, when parsing the following (note that all arguments are valid syntax according to CMake documentation and used in real projects):

include_directories(
  ${outer_${inner_variable}_variable}
  "${BASE_DIR}/sub/directory"
  ${OTHER_DIR}/${SUB_DIR}/included/ir
  )

the output tree looks like this:

<?xml version="1.0" ?>
<tree type="source_file" pos="0" length="128">
    <tree type="normal_command" pos="0" length="127">
        <tree type="identifier" pos="0" length="19" label="include_directories"/>
        <tree type="(" pos="19" length="1" label="("/>
        <tree type="argument_list" pos="20" length="106">
            <tree type="argument" pos="23" length="35">
                <tree type="unquoted_argument" pos="23" length="35">
                    <tree type="variable_ref" pos="23" length="35">
                        <tree type="normal_var" pos="23" length="35">
                            <tree type="$" pos="23" length="1" label="$"/>
                            <tree type="{" pos="24" length="1" label="{"/>
                            <tree type="variable" pos="25" length="32">
                                <tree type="variable_ref" pos="31" length="17">
                                    <tree type="normal_var" pos="31" length="17">
                                        <tree type="$" pos="31" length="1" label="$"/>
                                        <tree type="{" pos="32" length="1" label="{"/>
                                        <tree type="variable" pos="33" length="14" label="inner_variable"/>
                                        <tree type="}" pos="47" length="1" label="}"/>
                                    </tree>
                                </tree>
                            </tree>
                            <tree type="}" pos="57" length="1" label="}"/>
                        </tree>
                    </tree>
                </tree>
            </tree>
            <tree type="argument" pos="61" length="32">
                <tree type="quoted_argument" pos="61" length="32">
                    <tree type="&quot;" pos="61" length="1" label="&quot;"/>
                    <tree type="quoted_element" pos="62" length="30">
                        <tree type="variable_ref" pos="62" length="14">
                            <tree type="normal_var" pos="62" length="14">
                                <tree type="$" pos="62" length="1" label="$"/>
                                <tree type="{" pos="63" length="1" label="{"/>
                                <tree type="variable" pos="64" length="11" label="ZEPHYR_BASE"/>
                                <tree type="}" pos="75" length="1" label="}"/>
                            </tree>
                        </tree>
                        <tree type="$" pos="84" length="1" label="$"/>
                    </tree>
                    <tree type="&quot;" pos="92" length="1" label="&quot;"/>
                </tree>
            </tree>
            <tree type="argument" pos="96" length="27">
                <tree type="unquoted_argument" pos="96" length="27">
                    <tree type="variable_ref" pos="96" length="11">
                        <tree type="normal_var" pos="96" length="11">
                            <tree type="$" pos="96" length="1" label="$"/>
                            <tree type="{" pos="97" length="1" label="{"/>
                            <tree type="variable" pos="98" length="8" label="ARCH_DIR"/>
                            <tree type="}" pos="106" length="1" label="}"/>
                        </tree>
                    </tree>
                    <tree type="variable_ref" pos="108" length="7">
                        <tree type="normal_var" pos="108" length="7">
                            <tree type="$" pos="108" length="1" label="$"/>
                            <tree type="{" pos="109" length="1" label="{"/>
                            <tree type="variable" pos="110" length="4" label="ARCH"/>
                            <tree type="}" pos="114" length="1" label="}"/>
                        </tree>
                    </tree>
                </tree>
            </tree>
        </tree>
        <tree type=")" pos="126" length="1" label=")"/>
    </tree>
</tree>

The problem is that the text immediately concatenated to a variableref node, e.g., `outerandvariabefrom${outer${inner_variable}_variable}or/sub/directoryfrom"${BASE_DIR}/sub/directory"` is lost in the tree, i.e., no node has a payload (label) with these contents.

Another issue with the unquoted argument parsing (according to the documentation) is that the character ' is actually allowed but you have excluded it in your rule. I have seen this used in CMake scripts and the parser runs into an error when encountering this.

uyha commented 1 year ago

so you want them to be named nodes instead of anonymous nodes? I remember having difficulties trying to do it, that's why I hide those nodes since it's good enough for me. Could you share why you want to get that information?

mahtab-nejati commented 1 year ago

Yes, I'd like to have them as name nodes. I am trying to verify the existence of CMake scripts in the code base when an include or add_subdirectory command is invoked. As you can see in the last two arguments, losing this information makes it impossible to find the scripts...

uyha commented 1 year ago

they are not really lost, they are just not explicitly named. You have do the interpretation yourself, looking at your output

<tree type="argument" pos="96" length="27">
  <tree type="unquoted_argument" pos="96" length="27">
    <tree type="variable_ref" pos="96" length="11">
      <tree type="normal_var" pos="96" length="11">
        <tree type="$" pos="96" length="1" label="$"/>
        <tree type="{" pos="97" length="1" label="{"/>
        <tree type="variable" pos="98" length="8" label="ARCH_DIR"/>
        <tree type="}" pos="106" length="1" label="}"/>
      </tree>
    </tree>
    <tree type="variable_ref" pos="108" length="7">
      <tree type="normal_var" pos="108" length="7">
        <tree type="$" pos="108" length="1" label="$"/>
        <tree type="{" pos="109" length="1" label="{"/>
        <tree type="variable" pos="110" length="4" label="ARCH"/>
        <tree type="}" pos="114" length="1" label="}"/>
      </tree>
    </tree>
  </tree>
</tree>

you can collect the content from position 96 and take 27 tokens, do more processing if they are refered to in the nested nodes, otherwise you can assume they are normal text. Like I said, I couldn't figure out how to do this easily using just tree-sitter, so I can't help you further.

uyha commented 1 year ago

and thank you for reporting the errorneous exclusion of single quote in unquoted arguments