microsoft / ApplicationInspector

A source code analyzer built for surfacing features of interest and other characteristics to answer the question 'What's in the code?' quickly using static analysis with a json based rules engine. Ideal for scanning components before use or detecting feature level changes.
MIT License
4.25k stars 356 forks source link

Pattern type for structured data files (yaml, json...) #420

Closed valinha closed 2 years ago

valinha commented 2 years ago

Is your feature request related to a problem? Please describe. In order to be able to detect feature usage in structured configuration files (yaml, json, xml...) it would be very useful to be able to search for occurrences of certain entries regardless of the order of declaration in the file.

An example would be:

If we have a config.yaml file and we have the following configuration:

app:
  myfeature:
     prop1: x
     prop2: x
     propN: x
     enabled: true

If I want to detect with a rule if the 'myfeature' functionality is enabled I would not have a reliable solution at the moment. Detecting if app.my-feature.enabled = true with patterns, modiffiers or conditions does not cover all cases, e.g. if the declaration order is changed or the configuration is put in one line. I.e. a pattern such as:

"pattern": "app:\s+myfeature:\s+enabled: *true".

It would not cover the above example since the order of the enabled property can be changed as it is a structured data type.

Describe the solution you'd like Ability to be able to perform patterns on structured data types.

Describe alternatives you've considered

gfs commented 2 years ago

Thanks for the suggestion. This is non trivial but I think its doable.

Here's a first thought about how we might be able to specify that from a user perspective in the rule object without changing the schema too much.

If path is specified, then if the file can be parsed as structured data (XML, JSON, YML), find all the paths in the document that match the path, and run the pattern against their value.

  1. This doesn't change the operations you can perform on the end value, so string, regex etc would remain available.
  2. It should be possible to do paths with wildcard parents, but I'm not sure about using wildcards in the path itself.

Here's the rule you want above implemented in this schema:

{
    "name": "My Feature",
    "id": "MYID00001",
    "description": "Detects if My Feature is enabled",
    "applies_to": [
      "json"
    ],
    "tags": [
      "Features.MyFeature"
    ],
    "severity": "Moderate",
    "patterns": [
      {
        "confidence": "High",
        "pattern": "true",
        "type": "string"
      },
    "path": "app.myfeature.enabled"
    ]
  }

What do you think of this proposal?

valinha commented 2 years ago

Thanks for considering the functionality. I'm fine with it, if it's true that I would like it not to be dependent on the applies_to field, something like that would work as well?

{
    "name": "My Feature",
    "id": "MYID00001",
    "description": "Detects if My Feature is enabled",
    "applies_to_file_regex": [
      "application-?.*.yml"
    ],
    "tags": [
      "Features.MyFeature"
    ],
    "severity": "Moderate",
    "patterns": [
      {
        "confidence": "High",
        "pattern": "true",
        "type": "string"
      },
    "path": "app.myfeature.enabled"
    ]
  }
gfs commented 2 years ago

That’s fine it wouldnt change the apply to logic.

On Thu, Jan 20, 2022 at 10:27 AM, Alberto Valiña Lema @.***> wrote:

Thanks for considering the functionality. I'm fine with it, if it's true that I would like it not to be dependent on the applies_to field, something like that would work as well?

... "applies_to_file_regex": [ "application-?.*.yml". ], ...

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

valinha commented 2 years ago

Consider to be able to check the value of a path, example:


{
    "name": "My Feature",
    "id": "MYID00001",
    "description": "Detects if My Feature is enabled",
    "applies_to_file_regex": [
      "application-?.*.yml"
    ],
    "tags": [
      "Features.MyFeature"
    ],
    "severity": "Moderate",
    "patterns": [
      {
        "confidence": "High",
        "pattern": "true",
        "type": "string"
      }
     ],
    "paths":  [
      {
       "path": "app.myfeature.enabled",
       "value-pattern": "*true"
      }
    ]
  }

Where path expression depends of the file type and use json-path, yaml-path or xpath… for example and value-pattern could be optional for check the existence or the value of this path.

In addition, a list of paths could make sense as well as the list of patterns. If one is fulfilled it will be a match.

Can it make sense?

gfs commented 2 years ago

Consider to be able to check the value of a path, example:

{
    "name": "My Feature",
    "id": "MYID00001",
    "description": "Detects if My Feature is enabled",
    "applies_to_file_regex": [
      "application-?.*.yml"
    ],
    "tags": [
      "Features.MyFeature"
    ],
    "severity": "Moderate",
    "patterns": [
      {
        "confidence": "High",
        "pattern": "true",
        "type": "string"
      }
     ],
    "paths":  [
      {
       "path": "app.myfeature.enabled",
       "value-pattern": "* true"
      }
    ]
  }

Where path expression depends of the file type and use json-path, yaml-path or xpath…

I don't understand the distinction between them from a rule creation perspective. It seems to me they could all be represented in the same format app.myfeature.enabled. Is there something I'm missing?

for example and value-pattern could be optional for check the existence or the value of this path.

  1. What is the purpose of the original patterns field you have populated here if you add the "value-pattern" field? My proposal is already using the patterns on the value of the paths.
  2. I don't know what format the value-pattern you provided is but it doesn't look like valid regex.
  3. Can you give an example of a file where a .* pattern wouldn't detect the existence of a path? I'm not sure we need a separate mechanism to check if something exists at all vs has a specific value.

In addition, a list of paths could make sense as well as the list of patterns. If one is fulfilled it will be a match.

I think a list of paths would be fine. Matches will only be found if the patterns match the value of a path that is specified.

gfs commented 2 years ago

Updated proposal. I haven't decided which default value to use for allow-prefixes yet.

{
    "name": "My Feature",
    "id": "MYID00001",
    "description": "Detects if My Feature is enabled",
    "applies_to": [
      "json"
    ],
    "tags": [
      "Features.MyFeature"
    ],
    "severity": "Moderate",
    "patterns": [
      {
        "confidence": "High",
        "pattern": "true",
        "type": "string"
      },
    ],
    "paths": [
        {   
            "pattern": "app.myfeature.enabled",
            // Require the first specified path component to be at the root of the document. For example:
            // app: 
            //     myfeature: 
            //         enabled:
            "allow-prefixes": false
        },
        {
            "pattern": "app.myfeature.enabled",
            // Allows arbitrary prefixes before the first specified path component, for example:
            // parent: 
            //     app: 
            //         myfeature: 
            //             enabled:
            "allow-prefixes": true
        }
    ]
  }
valinha commented 2 years ago

Sorry, I thought you were proposing something excluding the patterns part and that's why I didn't see how to set the value in the expression. I understand now that your idea is to combine paths and patterns.

In my opinion the 'allow-prefixes' option would default to false in order to set complete paths by default.

I believe this capability will give to application inspector more power to detect more cases. Great work and great tool.

gfs commented 2 years ago

Sorry, I thought you were proposing something excluding the patterns part and that's why I didn't see how to set the value in the expression. I understand now that your idea is to combine paths and patterns.

Got it. The idea here is that pattern functionality will not change - we are just changing the target on which the patterns are run - if the path field is populated.

In my opinion the 'allow-prefixes' option would default to false in order to set complete paths by default.

That makes sense to me.

I believe this capability will give to application inspector more power to detect more cases. Great work and great tool.

Thanks for the great suggestion. I think this will add a lot of flexibility.

gfs commented 2 years ago

I think the suggestion for xpath compatibility makes sense as well. although that would be restricted to xml only. I wasn't able to find an equivalent query syntax for json or yml.

Reference for later: https://en.wikipedia.org/wiki/XPath

It looks like can be used as a character in a yml tag so will need to use a different separator.

Additionally need to consider how to query lists in json/yml.

Maybe convert json/yml to xml and then run xpath queries on it? That would make it difficult to provide the correct line number for the original file however.

valinha commented 2 years ago

I think the suggestion for xpath compatibility makes sense as well. although that would be restricted to xml only. I wasn't able to find an equivalent query syntax for json or yml.

Reference for later: https://en.wikipedia.org/wiki/XPath

It looks like can be used as a character in a yml tag so will need to use a different separator.

Additionally need to consider how to query lists in json/yml.

Maybe convert json/yml to xml and then run xpath queries on it? That would make it difficult to provide the correct line number for the original file however.

Equivalent to xpath are (I don't know any libraries in #C)

Perhaps it makes sense to do this in different phases and in different tasks? In my opinion having a simple key search (by path) covers many scenarios.

gfs commented 2 years ago

I was able to find a json path library for C#.

https://github.com/danielaparker/JsonCons.Net

On Mon, Jan 24, 2022 at 6:44 AM, Alberto Valiña Lema @.***> wrote:

I think the suggestion for xpath compatibility makes sense as well. although that would be restricted to xml only. I wasn't able to find an equivalent query syntax for json or yml.

Reference for later: https://en.wikipedia.org/wiki/XPath

It looks like can be used as a character in a yml tag so will need to use a different separator.

Additionally need to consider how to query lists in json/yml.

Maybe convert json/yml to xml and then run xpath queries on it? That would make it difficult to provide the correct line number for the original file however.

Equivalent to xpath are (I don't know any libraries in #C)

Perhaps it makes sense to do this in different phases and in different tasks? In my opinion having a simple key search (by path) covers many scenarios.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

gfs commented 2 years ago

I have started an implementation of this and realized the path needs to be inside the pattern in case you have a path and condition on different elements.

{
    "name": "My Feature",
    "id": "MYID00001",
    "description": "Detects if My Feature is enabled",
    "applies_to": [
      "json"
    ],
    "tags": [
      "Features.MyFeature"
    ],
    "severity": "Moderate",
    "patterns": [
      {
        "confidence": "High",
        "pattern": "true",
        "type": "string"
        "paths": [
            {   
                "pattern": "app.myfeature.enabled",
                // Require the first specified path component to be at the root of the document. For example:
                // app: 
                //     myfeature: 
                //         enabled:
                "allow-prefixes": false
            },
            {
                "pattern": "app.myfeature.enabled",
                // Allows arbitrary prefixes before the first specified path component, for example:
                // parent: 
                //     app: 
                //         myfeature: 
                //             enabled:
                "allow-prefixes": true
            }
        ],
    }
}
valinha commented 2 years ago

Hi, Any plans to do this? Thank you very much

gfs commented 2 years ago

It has been partially implemented but there have been higher priority items so I have not been able to finish it.

I do plan to add this but at this time I cannot provide a date it will be done by.

valinha commented 2 years ago

Don't worry, I just wanted to know if this was still on. I'm sorry I can't really contribute as I have no knowledge of . Net 😓

jaimebp commented 2 years ago

Hi @gfs any news on this issue? It would be really useful

gfs commented 2 years ago

Thanks for the reminder. I may be able to squeeze this into 1.6 that I'm currently working on.

gfs commented 2 years ago

I revisited this and rediscovered the issue I had hit before. The issue is that once it is parsed to JSON/XML etc we lose tracking of location in the file where the match is. For example to extract the value at a specific location in an XML document you can do something like this, but the NodeIter and the elements it iterates do not provide the offset in the original file that they were derived from.

XPathDocument? xmlDoc;
                try
                {
                    xmlDoc = new XPathDocument(new StringReader(FullContent));
                    DocType = StructuredDocType.Xml;
                }
                catch (Exception)
                {
                    xmlDoc = null;
                }

                if (xmlDoc is not null)
                {
                    var navigator = xmlDoc.CreateNavigator();
                    var nodeIter = navigator.Select(Path);
                    while (nodeIter.MoveNext())
                    {
                        if (nodeIter.Current is not null)
                        {
                            yield return (nodeIter.Current.Value, null);
                        }
                    }
                }
gfs commented 2 years ago

I made some progress with an experimental XML implementation which searches the document for the xml node found to get the location. JSON has been less successful, I tried both JsonCons and JsonEverything but neither provide a Parent element for a secondary search or an index (though it seems that the index is a private field of the JsonElement in System.Text.Json, unfortunately there's no way to access that due to protection level).

gfs commented 2 years ago

489 has an implementation for XML and JSON that I'd be interested in receiving feedback on. I still need to implement this for the substring (non-regex) matching and then I'll merge it.

If you can find edge cases after the beta with the functionality is released that don't work that would be helpful.

To use, in the SearchPattern portion of a rule add a "xmlpath" or "jsonpath".

Here are the samples from the test cases and the data they match.

 private const string jsonRule = @"[
        {
            ""id"": ""SA000005"",
            ""name"": ""Testing.Rules.JSON"",
            ""tags"": [
                ""Testing.Rules.JSON""
            ],
            ""severity"": ""Critical"",
            ""description"": ""This rule finds books from the JSON titled with Sheep."",
            ""patterns"": [
                {
                    ""pattern"": ""Sheep"",
                    ""type"": ""regex"",
                    ""confidence"": ""High"",
                    ""scopes"": [
                        ""code""
                    ],
                    ""jsonpath"" : ""$.books[*].title""
                }
            ],
            ""_comment"": """"
        }
    ]";

        private const string xmlRule = @"[
    {
        ""id"": ""SA000005"",
        ""name"": ""Testing.Rules.XML"",
        ""tags"": [
            ""Testing.Rules.XML""
        ],
        ""severity"": ""Critical"",
        ""description"": ""This rule finds books from the XML titled with Franklin."",
        ""patterns"": [
            {
                ""pattern"": ""Franklin"",
                ""type"": ""regex"",
                ""confidence"": ""High"",
                ""scopes"": [
                    ""code""
                ],
                ""xpath"" : ""/bookstore/book/title""
            }
        ],
        ""_comment"": """"
    }
]";

        private const string jsonData = 
@"{
    ""books"":
    [
        {
            ""category"": ""fiction"",
            ""title"" : ""A Wild Sheep Chase"",
            ""author"" : ""Haruki Murakami"",
            ""price"" : 22.72
        },
        {
            ""category"": ""fiction"",
            ""title"" : ""The Night Watch"",
            ""author"" : ""Sergei Lukyanenko"",
            ""price"" : 23.58
        },
        {
            ""category"": ""fiction"",
            ""title"" : ""The Comedians"",
            ""author"" : ""Graham Greene"",
            ""price"" : 21.99
        },
        {
            ""category"": ""memoir"",
            ""title"" : ""The Night Watch"",
            ""author"" : ""David Atlee Phillips"",
            ""price"" : 260.90
        }
    ]
}
";

        private const string xmlData = 
@"<?xml version=""1.0"" encoding=""utf-8"" ?>   
  <bookstore>  
      <book genre=""autobiography"" publicationdate=""1981-03-22"" ISBN=""1-861003-11-0"">  
          <title>The Autobiography of Benjamin Franklin</title>  
          <author>  
              <first-name>Benjamin</first-name>  
              <last-name>Franklin</last-name>  
          </author>  
          <price>8.99</price>  
      </book>  
      <book genre=""novel"" publicationdate=""1967-11-17"" ISBN=""0-201-63361-2"">  
          <title>The Confidence Man</title>  
          <author>  
              <first-name>Herman</first-name>  
              <last-name>Melville</last-name>  
          </author>  
          <price>11.99</price>  
      </book>  
      <book genre=""philosophy"" publicationdate=""1991-02-15"" ISBN=""1-861001-57-6"">  
          <title>The Gorgias</title>  
          <author>  
              <name>Plato</name>  
          </author>  
          <price>9.99</price>  
      </book>  
  </bookstore>
";

YML is blocked by https://github.com/aaubry/YamlDotNet/issues/333 unless you know of an alternate YAML parser for .NET which supports YamlPath functionality.

gfs commented 2 years ago

I just merged this for XML and JSON if you can give it a try, the new 1.6-beta with the functionality should be published shortly and I'd appreciate any feedback before finalizing the interfaces/calling the release stable. If you can provide any samples of XML/JSON + Rule combos that don't work as you expect that would be very helpful.

@jaimebp @valinha

gfs commented 2 years ago

Rereading the thread I see there was a request for the paths to be a list. I'll have a revised version shortly with that

gfs commented 2 years ago

491 will change the parameters to be arrays.

I did not go with a unified query type. You can instead use standard JsonPath for JSON and standard xpath for XML. I would recommend limiting use of this to files of the appropriate type using applies_to - it will attempt to parse each file as a JSON file or XML which matches the applies_to (or the regex version) filter - there is no additional hidden filtering - and performing that operation for many files may cause high overhead - it will, however, only be done once for each file.

Sample Rule:

[
    {
        "id": "SA000005",
        "name": "Testing.Rules.JSONandXML",
        "tags": [
            "Testing.Rules.JSON.JSONandXML"
        ],
        "severity": "Critical",
        "description": "This rule finds books titled with Franklin located either at the specified JSONPath in JSON or the specified xpath in XML files.",
        "patterns": [
            {
                "pattern": "Franklin",
                "type": "regex",
                "confidence": "High",
                "scopes": [
                    "code"
                ],
                "jsonpaths" : ["$.books[*].title"],
                "xpaths" : ["/bookstore/book/title"]
            }
        ],
        "_comment": ""
    }
]

This matches these sample files:

{
    "books":
    [
        {
            "category": "fiction",
            "title" : "A Wild Sheep Chase",
            "author" : "Haruki Murakami",
            "price" : 22.72
        },
        {
            "category": "fiction",
            "title" : "The Night Watch",
            "author" : "Sergei Lukyanenko",
            "price" : 23.58
        },
        {
            "category": "fiction",
            "title" : "The Comedians",
            "author" : "Graham Greene",
            "price" : 21.99
        },
        {
            "category": "memoir",
            "title" : "The Night Watch",
            "author" : "David Atlee Phillips",
            "price" : 260.90
        },
        {
            "category": "memoir",
            "title" : "The Autobiography of Benjamin Franklin",
            "author" : "Benjamin Franklin",
            "price" : 123.45
        }
    ]
}
<?xml version="1.0" encoding="utf-8" ?>   
<bookstore>  
    <book genre="autobiography" publicationdate="1981-03-22" ISBN="1-861003-11-0">  
        <title>The Autobiography of Benjamin Franklin</title>  
        <author>  
            <first-name>Benjamin</first-name>  
            <last-name>Franklin</last-name>  
        </author>  
        <price>8.99</price>  
    </book>  
    <book genre="novel" publicationdate="1967-11-17" ISBN="0-201-63361-2">  
        <title>The Confidence Man</title>  
        <author>  
            <first-name>Herman</first-name>  
            <last-name>Melville</last-name>  
        </author>  
        <price>11.99</price>  
    </book>  
    <book genre="philosophy" publicationdate="1991-02-15" ISBN="1-861001-57-6">  
        <title>The Gorgias</title>  
        <author>  
            <name>Plato</name>  
        </author>  
        <price>9.99</price>  
    </book>  
</bookstore>
valinha commented 2 years ago

I just merged this for XML and JSON if you can give it a try, the new 1.6-beta with the functionality should be published shortly and I'd appreciate any feedback before finalizing the interfaces/calling the release stable. If you can provide any samples of XML/JSON + Rule combos that don't work as you expect that would be very helpful.

Of course, I will test the functionality today and give you feedback. Thank you very much.

valinha commented 2 years ago

I just tried xpath and it didn't work: Rule:

{
    "name": "Source code: Java 17",
    "id": "CODEJAVA000000",
    "description": "Java 17 maven configuration",
    "applies_to": [
      "pom.xml"
    ],
    "tags": [
      "Code.Java.17"
    ],
    "severity": "critical",
    "patterns": [
      {
        "pattern": "17",
        "xpaths" : ["/project/properties/java.version"],
        "type": "regex",
        "scopes": [
          "code"
        ],
        "modifiers": [
          "i"
        ],
        "confidence": "high"
      }
    ]
  }

Xml:


<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>xxx</groupId>
  <artifactId>xxx</artifactId>
  <version>0.1.0-SNAPSHOT</version>
  <packaging>pom</packaging>

  <name>${project.groupId}:${project.artifactId}</name>
  <description />

  <properties>
    <java.version>17</java.version>
  </properties>

</project>
gfs commented 2 years ago

Thanks for the feedback. I'll check this today.

gfs commented 2 years ago

I found two issues with your example.

  1. Applies to is for languages - pom.xml is not a language by default so you'd need to provide custom languages. If you are already doing that, then this isn't an issue. You could instead use applies_to_file_regex with pom.xml if you don't want to provide custom languages.
  2. Your xpath expression may not be valid. '.' is a reserved character in xpath expressions. When I use an online checker it has doesn't like the period

image

I think this alternate way should work for the Xpath expression: /project/properties/*[name(.) = 'java.version']

image

However, I'm having trouble getting any xpath query to work with your sample xml. Will continue to investigate to see if I can resolve.

gfs commented 2 years ago

After a bit more testing the above modified query does work - however, it only works when I remove the attributes from the root element. It's not clear to me why this is the case yet, or how I can work around it.

So this rule:

{
    "name": "Source code: Java 17",
    "id": "CODEJAVA000000",
    "description": "Java 17 maven configuration",
    "applies_to_file_regex": [
      "pom.xml"
    ],
    "tags": [
      "Code.Java.17"
    ],
    "severity": "critical",
    "patterns": [
      {
        "pattern": "17",
        "xpaths" : ["/project/properties/*[name(.)='java.version']"],
        "type": "regex",
        "scopes": [
          "code"
        ],
        "modifiers": [
          "i"
        ],
        "confidence": "high"
      }
    ]
  }

Matches:

<?xml version="1.0" encoding="UTF-8"?>
<project>
  <modelVersion>4.0.0</modelVersion>

  <groupId>xxx</groupId>
  <artifactId>xxx</artifactId>
  <version>0.1.0-SNAPSHOT</version>
  <packaging>pom</packaging>

  <name>${project.groupId}:${project.artifactId}</name>
  <description />

  <properties>
    <java.version>17</java.version>
  </properties>

</project>

but not

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>xxx</groupId>
  <artifactId>xxx</artifactId>
  <version>0.1.0-SNAPSHOT</version>
  <packaging>pom</packaging>

  <name>${project.groupId}:${project.artifactId}</name>
  <description />

  <properties>
    <java.version>17</java.version>
  </properties>

</project>
gfs commented 2 years ago

I opened a new issue (#497) to cover adding support for xml docs with namespace specified and will continue tracking this issue there.

gfs commented 2 years ago

I can successfully query your sample now with a small modification to the xpath. Because the xml has a namespace querying with just the local name doesn't work you need to specify the namespace too - or alternately you can specify it with the 'local-name' XPath method. See #499 for the modified query.

valinha commented 2 years ago
  1. Applies to is for languages - pom.xml is not a language by default so you'd need to provide custom languages. If you are already doing that, then this isn't an issue. You could instead use applies_to_file_regex with pom.xml if you don't want to provide custom languages.

In all the rules we use I have always considered pom.xml as the language as indicated in the doc and it has always worked fine:

https://github.com/microsoft/ApplicationInspector/wiki/3.4-Applies_to-(languages)#language-support

gfs commented 2 years ago

That is odd, Pom is listed as a language there but I wasn't seeing it working with applies to. Ill double check that as well.

gfs commented 2 years ago

Double checked by updating the test rule I created for #499 and you are correct, it also works with pom.xml as applies to.