microsoft / infersharp

Infer# is an interprocedural and scalable static code analyzer for C#. Via the capabilities of Facebook's Infer, this tool detects null dereferences, resource leaks, and thread-safety violations. It also performs taint flow tracking to detect critical security vulnerabilities like SQL injections.
MIT License
727 stars 28 forks source link

Questions about infersharp implementation #191

Closed skyleaworlder closed 1 year ago

skyleaworlder commented 1 year ago

Hello!

Thanks for this tool and the PR about language-agnostic layer.

This issue is not going to report some bugs and features. I just wanna get answers to my a few questions, which are mainly about infersharp implementation.

Many thanks!

matjin commented 1 year ago

Hello!

Thanks for this tool and the PR about language-agnostic layer.

This issue is not going to report some bugs and features. I just wanna get answers to my a few questions, which are mainly about infersharp implementation.

  • Why MethodBody can be found everywhere in cfg.json? And what's the logic of it? Perhaps I fail to figure out what MethodBody means? (e.g Why LdargParser generates MethodBody node, while LdcParser doesn't.)

MethodBody type nodes refer to nodes in the Control-Flow-Graph (CFG) which contain instructions representing the body of the method. As for your second question: there is no guarantee that a particular CIL instruction has a 1:1 mapping to a particular SIL instruction. In this case, there is an SIL instruction corresponding to ldarg because ldarg is the pushing of an argument value onto the program stack; in other words, this is the loading of a value of a location on the heap, or an SIL Load instruction applied to the method argument. On the other hand, ldc, or the pushing of a constant onto the stack, doesn't have a corresponding SIL representation because it has no association with locations on the heap.

  • Is Metadata instruction useful? I wanna know the reason infersharp doesn't implement it.

Well, we do use metadata information insofar as we do use complete method signatures (including method name, class name, method arguments) to identify translated methods. From a control flow representation perspective, CIL Metadata doesn't really have much relevance as it does not actually convey information about a program's control flow.

  • The point I mostly concerned is whether infer can run some language-unrelated checkers with data (cfg.json & tenv.json) received from language-agnostic layer interface.

Depending on the checkers you have in mind, you might want to look at Infer's taint analysis or other checkers they have implemented on top of the SIL, to save yourself work.

Many thanks!

skyleaworlder commented 1 year ago

Thanks for your kindly answers, but I still feel loss understanding some concepts.

MethodBody type nodes refer to nodes in the Control-Flow-Graph (CFG) which contain instructions representing the body of the method. As for your second question: there is no guarantee that a particular CIL instruction has a 1:1 mapping to a particular SIL instruction. In this case, there is an SIL instruction corresponding to ldarg because ldarg is the pushing of an argument value onto the program stack; in other words, this is the loading of a value of a location on the heap, or an SIL Load instruction applied to the method argument. On the other hand, ldc, or the pushing of a constant onto the stack, doesn't have a corresponding SIL representation because it has no association with locations on the heap.

Oh, I see. I also noticed the same description in wiki: "The nodes which contain the instructions associated with the body of the method body are mostly identified with the MethodBody type, though nodes containing invocations to other methods are identified as Call nodes."

So does MethodBody mean "a statement in original source code"? Or can I say that MethodBody represents one statement and the very next successors of MethodBody node represent some components of this statement.

Well, we do use metadata information insofar as we do use complete method signatures (including method name, class name, method arguments) to identify translated methods. From a control flow representation perspective, CIL Metadata doesn't really have much relevance as it does not actually convey information about a program's control flow.

Sorry I didn't describe my question clearly. Metadata mentioned here is not "C# Metadata", but a type of instruction of SIL.

matjin commented 1 year ago

Thanks for your kindly answers, but I still feel loss understanding some concepts.

MethodBody type nodes refer to nodes in the Control-Flow-Graph (CFG) which contain instructions representing the body of the method. As for your second question: there is no guarantee that a particular CIL instruction has a 1:1 mapping to a particular SIL instruction. In this case, there is an SIL instruction corresponding to ldarg because ldarg is the pushing of an argument value onto the program stack; in other words, this is the loading of a value of a location on the heap, or an SIL Load instruction applied to the method argument. On the other hand, ldc, or the pushing of a constant onto the stack, doesn't have a corresponding SIL representation because it has no association with locations on the heap.

Oh, I see. I also noticed the same description in wiki: "The nodes which contain the instructions associated with the body of the method body are mostly identified with the MethodBody type, though nodes containing invocations to other methods are identified as Call nodes."

So does MethodBody mean "a statement in original source code"? Or can I say that MethodBody represents one statement and the very next successors of MethodBody node represent some components of this statement.

A MethodBody node stores SIL instructions which are translated from CIL instructions composing the original .NET method. The successor nodes of a MethodBody node contain the succeeding instructions in the control flow -- they are not "components" of the prior instructions.

Well, we do use metadata information insofar as we do use complete method signatures (including method name, class name, method arguments) to identify translated methods. From a control flow representation perspective, CIL Metadata doesn't really have much relevance as it does not actually convey information about a program's control flow.

Sorry I didn't describe my question clearly. Metadata mentioned here is not "C# Metadata", but a type of instruction of SIL.

We haven't translated that because Infer# is focused only on representing the semantic control flow of the program; as the program documents, Metadata is not strictly needed for this purpose. I believe other frontends, like the Java one (see jTrans.ml), similarly does not generate Metadata in its output.

skyleaworlder commented 1 year ago

We haven't translated that because Infer# is focused only on representing the semantic control flow of the program; as the program documents, Metadata is not strictly needed for this purpose. I believe other frontends, like the Java one (see jTrans.ml), similarly does not generate Metadata in its output.

Make sense, Thank you!

But I still have a quick question. If the succeeding nodes of MethodBody represent succeeding instructions, I wonder if the number of MethodBody should be "1" in each method, because there is only one "method body" of each method.

image

The above image is CFG of Examples/Examples/ResourceLeakInterproceduralOK:

        /// <summary>
        /// Interprocedural resource usage example, no leaks expected.
        /// </summary>
        public void ResourceLeakInterproceduralOK(){
            StreamWriter stream = AllocateStreamWriter();
            if (stream == null)
                return;

            try
            {
                stream.WriteLine(12);
            }
            finally
            {
                stream.Close();
            }
        }

Actually, I want to figure out the usage of MethodBody from the beginning, but I found MethodBody seems not representing "the body of method".

matjin commented 1 year ago

We haven't translated that because Infer# is focused only on representing the semantic control flow of the program; as the program documents, Metadata is not strictly needed for this purpose. I believe other frontends, like the Java one (see jTrans.ml), similarly does not generate Metadata in its output.

Make sense, Thank you!

But I still have a quick question. If the succeeding nodes of MethodBody represent succeeding instructions, I wonder if the number of MethodBody should be "1" in each method, because there is only one "method body" of each method.

image

The above image is CFG of Examples/Examples/ResourceLeakInterproceduralOK:

        /// <summary>
        /// Interprocedural resource usage example, no leaks expected.
        /// </summary>
        public void ResourceLeakInterproceduralOK(){
            StreamWriter stream = AllocateStreamWriter();
            if (stream == null)
                return;

            try
            {
                stream.WriteLine(12);
            }
            finally
            {
                stream.Close();
            }
        }

Actually, I want to figure out the usage of MethodBody from the beginning, but I found MethodBody seems not representing "the body of method".

Ah, MethodBody nodes store at least 1 instruction from the body of the method, but not necessarily the entire body of the method. Hence there can be greater than one MethodBody node.

skyleaworlder commented 1 year ago

Ah, MethodBody nodes store at least 1 instruction from the body of the method, but not necessarily the entire body of the method. Hence there can be greater than one MethodBody node.

emm... if so, I want to know how you design for it. I can hardly find the relationship and summarize the rules between CIL instruction and the reason why InstructionParsers of those emit MethodBody.

matjin commented 1 year ago

Essentially, we simulate the CIL program stack (that's the program stack object in C#) as we iterate over each CIL instruction in succession. As we do so, we render the corresponding SIL instructions. Were there particular InstructionParsers/program translations in particular you were thinking about?

skyleaworlder commented 1 year ago

Essentially, we simulate the CIL program stack (that's the program stack object in C#) as we iterate over each CIL instruction in succession. As we do so, we render the corresponding SIL instructions. Were there particular InstructionParsers/program translations in particular you were thinking about?

Yes, I have read all C# files under Cilsil/Cil/Parsers including InstructionParsers. Though I haven't learnt .NET platform structure, I knonw the parsers extend InstructionParsers and process expression simulate a stack (I notice there is stack.Pop and Push).

I guess Expression is not the part I feel confused. After having a look at CFG output, I found there are several MethodBody nodes and I just cannot figure out what MethodBody means and "when and why" infersharp should register a new MethodBody node into CFG.

skyleaworlder commented 1 year ago

I find several examples in infersharp

1. InitObjParser

https://github.com/microsoft/infersharp/blob/main/Cilsil/Cil/Parsers/InitObjParser.cs#L89

var node = AddMethodBdyInstructionsToCfg(
                        state,
                        structInitializationInstructions.ToArray());

Obejct initialization seems a part of statement.

2. Ldarg

                if (ParameterIsByReference(index, state.Method))
                {
                    // Strips out the pointer because the type produced by CreateArg already
                    // carries it.
                    var argAddressType =
                        new Address(Tptr.PtrKind.Pk_pointer,
                                    argType.StripPointer(),
                                    argVar,
                                    referenceKind: Address.ReferenceKind.Parameter);
                    state.PushExpr(argVar, argAddressType);
                }
                else
                {
                    var loadArgument = state.PushAndLoad(argVar, argType);
                    node = AddMethodBodyInstructionsToCfg(state, loadArgument);
                    state.AppendToPreviousNode = true;
                }

If parameters are not passed in as reference, LdargParser will create a MethodBody node. But what's the relationship between loading an argument and method body node.

3. Newarr

https://github.com/microsoft/infersharp/blob/main/Cilsil/Cil/Parsers/NewarrParser.cs#L66

// Represents memory allocation.
                    var callInstr = new Call(returnId: arrayIdentifier,
                                             returnType: arrayTypeWithPtr,
                                             functionExpression: new ConstExpression(
                                                 ProcedureName.BuiltIn__new_array),
                                             args: args,
                                             flags: new Call.CallFlags(),
                                             location: state.CurrentLocation);
                    var newNode = AddMethodBodyInstructionsToCfg(state, callInstr);
                    state.PushExpr(new VarExpression(arrayIdentifier), arrayTypeWithPtr);
                    state.PushInstruction(instruction.Next, newNode);

New an array seems a statement in C#, and once I create an array, I create a related MethodBody. This makes me really confused.

Some other examples like Stind, Stloc, Ret, Ldelem are the same for me.

matjin commented 1 year ago

In general, when the CIL corresponds to an SIL instruction, we create the corresponding SIL instruction which in turn gets added to a MethodBody node. Not all CIL have such a correspondence -- for example, ldc refers to the pushing of a constant onto the CIL program stack but has no direct corresponding operation in SIL because the constant doesn't come from a register. By contrast, ldarg refers to the pushing of an argument's value onto the CIL program stack; this corresponds to the load of the argument register. Similarly, Initobj/Newarr refer to the creation of objects, which correspond to Call instructions in the SIL, hence we create MethodBody nodes to store these instructions.

skyleaworlder commented 1 year ago

In general, when the CIL corresponds to an SIL instruction, we create the corresponding SIL instruction which in turn gets added to a MethodBody node. Not all CIL have such a correspondence -- for example, ldc refers to the pushing of a constant onto the CIL program stack but has no direct corresponding operation in SIL because the constant doesn't come from a register. By contrast, ldarg refers to the pushing of an argument's value onto the CIL program stack; this corresponds to the load of the argument register. Similarly, Initobj/Newarr refer to the creation of objects, which correspond to Call instructions in the SIL, hence we create MethodBody nodes to store these instructions.

Thanks for your patient answer! Make sense here.