joernio / joern

Open-source code analysis platform for C/C++/Java/Binary/Javascript/Python/Kotlin based on code property graphs. Discord https://discord.gg/vv4MH284Hc
https://joern.io/
Apache License 2.0
1.95k stars 263 forks source link

[C#]:Type recovery and data flow completion in CPG #4674

Closed chengkenyong closed 4 days ago

chengkenyong commented 2 months ago

When I use joern to scan C# projects, I encountered several issues:

  1. For nodes of type 'fieldAccess', it cannot determine the type of the node. Additionally, there are occasional missing properties for other types of nodes. 微信截图_20240618135636 微信截图_20240618135942
  2. Joern seems to miss scanning content within Object Initializer (or anonymous method bodies) in C#. 微信截图_20240618143512 微信截图_20240618142953
  3. There are instances where data flow is incomplete in certain nodes. 微信截图_20240614162500 I am particularly interested in understanding the principles behind attribute recovery, node recovery, and data flow completion. I would like to contribute to addressing these issues and hope you can provide me with guidance in this area. I understand you may be busy, but I sincerely hope you can offer me some relevant advice. I am committed to making a meaningful contribution. Thank you.
DavidBakerEffendi commented 2 months ago

Hi @chengkenyong, development of C# frontend has been largely funded by Privado, but they have since scaled it back a bit due to priorities on other languages. The C# frontend relies on two components (1) the Roslyn parser wrapped in a neat binary which can be found here https://github.com/joernio/DotNetAstGen, and (2) the AstCreation in Joern.

For the missing AST features you've described, it could be that they need to be filtered in on the parser. So that change would be in DotNetAstGen. One way to check is to create a JSON AST dump of the file you're interested in, and see if it is present.

If the entity is there, then it probably needs a handler in one of the classes within the astcreation package in csharpsrc2cpg.

Regarding type information, it is not uncommon for fieldAccess nodes to not have type info, especially if the receiver is an object, since this might have multiple types in the case of polymorphism. This may require a bit more of a "dynamic" query that looks at the fieldAccess.argument(1).evalType.baseTypeDeclTransitive step to find all possible types, and then determine which of those types have members matching the canonicalName of the fieldIdentifier.

chengkenyong commented 2 months ago

Hi @DavidBakerEffendi , Your explanation has been very helpful to me, and I will follow your advice to learn more about the relevant topics. Thank you for your assistance.

chengkenyong commented 2 months ago

Hi @DavidBakerEffendi ,Based on your suggestion, I have restored the missing type of the fieldAccess node to the desired state in the first case. However, I am still unclear about how to address the third case of missing data flow. Specifically, I want to restore the data flow in the third case where the variable childSqlMapSource cannot reach sqlMapPath through the foreach loop. Could you please provide me with some more advice on how to achieve this? Thank you!

DavidBakerEffendi commented 2 months ago

Hi @chengkenyong, congrats! So the simplest would be to have a look at how javasrc2cpg tests a similar structure, and to check using something like dotAst/dotCfg/dotDdg to determine what the differences are. My guess is, usually the equivalent data-flow is supported in Java, but that the resulting AST in C# might be a bit off and need adjusting in order to produce the same CFG/DDG that will eventually give you your flow. If your finding isn't obvious, feel free to post the comparison in this thread and I can continue to guide you

chengkenyong commented 1 month ago

Hi @DavidBakerEffendi, I'm sorry for the delay in responding lately as I had some things to handle! Since Java doesn't have the same "foreach" structure as in C#, I compared "for" and "foreach" loops using dotDdg and dotAst in C# as you suggested. 微信截图_20240711151637 微信截图_20240711151708 I found that in dotAst, both loops appear quite similar, dotAst dotAst_2 but in dotDdg, it seems that there's a missing edge in the foreach loop, dotDdg_2 dotDdg which results in a lack of data flow from the identifier 'numbers' to the call 'WriteLine'. 微信截图_20240711153226 微信截图_20240711153326

DavidBakerEffendi commented 1 month ago

@chengkenyong Thanks for the info! This should put me in the right direction. Will have a look into it when I get some more bandwidth (most likely early next week)

DavidBakerEffendi commented 2 weeks ago

Sorry for the delay, back from vacation and picking up a bunch of work that's been waiting around. Will get back on this

chengkenyong commented 2 weeks ago

Hi! @DavidBakerEffend,I would be immensely grateful if you could respond to my message at your convenience. In the meantime, I have studied some of Joern's implementation logic and have done some simple work, including restoring some structure types and data flows. I will submit the relevant modifications once I have organized everything.

chengkenyong commented 2 weeks ago

Hi @DavidBakerEffendi ,It seems that the C# engine hasn't implemented special handling for polymorphism. For example, in the following code snippet, the methodFullName property of the method invocation node for a.Display() is still assigned the value IA.Display:void(). This causes an issue where it cannot correspond to the fullName of the Display method in class B (B.Display:void()), making it impossible to establish the edge from the callInvoke node in a.Display() to the methodDecl node for the Display method in class B. 微信截图_20240816143014

DavidBakerEffendi commented 4 days ago

Hi @chengkenyong, finally got around to it, but this should fix the missing flow in foreach https://github.com/joernio/joern/pull/4877