sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
20 stars 13 forks source link

Scitools' Understand Parse #308

Open RavenMarQ opened 2 months ago

RavenMarQ commented 2 months ago

Purpose

The purpose of this issue is to create functionality similar to both parse_r_dependencies and parse_dependencies using Scitool as a third party to generate the list of dependencies in a readable format for Kaiaulu. As we see, parse_r_dependencies is a function that reads the native language R and can look into files at the file level as well as the function level. However, parse_dependencies only allows us to see at file level- meaning that we can't generate anything regarding function dependencies in other languages. Scitools would allow us to see this, meaning that the end goal is to mimic the endpoint functionality of parse_r_dependencies with the method used in parse_dependencies.

Process

After familiarizing myself with the documentation and function of both Understand and the parse_dependencies functions, the first step is to get Understand to generate an xml file for Kaiaulu to read. Afterwards, generating a table from the data found in the xml files and transforming them into a network. At that point, Kaiaulu is able to generate a graph from the parsed network.

Tasks

Functions


build_understand_project (project_path , language, output_dir)

Description: build_understand_project will build the necessary project data it needs to generate xml data used for parse_understand_dependencies

Dependencies: Scitools' Understand - Must be installed on local machine to execute bash script to output dependencies using Understand

Parameter(s): project_path - Path to the Understand project folder to scan language - Primary language of project output_dir - Directory to the output folder


parse_understand_dependencies(output_dir, parse_type)

Description: parse_understand_dependencies will create a dependencies xml of either "file" or "class" specified by parse_type from the built Understand project folder provided by understand_dir.

Parameter(s): understand_dir - Directory to the output folder parse_type - Type of xml dependencies to generate (either "file" or "class")


transform_understand_dependencies_to_network(parsed, weights)

Description: transform_und_dependencies_to_network will filter through the parsed data from parse_understand_dependencies to prune edges that have no weights that fall under the category. After, recreates the parsed data into a network

Parameter(s): parsed- The parsed Understand data from parse_understand_dependencies weights- The weights to maintain the edges of (can be a vector or single dependency kind string)

carlosparadis commented 2 months ago

@RavenMarQ, excellent task description, well done!

RavenMarQ commented 2 months ago

As I read through the documentation for SciTools Understand, the only APIs that would allow us to externally call for the sending/creation of a file (which it turns it sends out as a CSV file) are for Python, Perl, C, Java, and .NET. However, I found that Understand can be run using command line prompts, which I am unsure of if I am looking in the right directions as I look to update the issue specification and work on tasks 2 & 3 listed in the issue.

carlosparadis commented 2 months ago

@RavenMarQ

I was looking at some old scripts and I believe what you seek lies within this:

# Analyze By Understand
cd ~/
lang="java"
lang2="C++"

for i in $projectPath/git-repo/*;
do
    echo $i

    udb=$(echo ${i} | sed -e "s|/git-repo|/udb|")
    udb=$udb.udb
    echo $udb

    xml=$(echo ${i} | sed -e "s|/git-repo|/xml|")
    xml=$xml.xml
    echo $xml

    ~/opt/scitools/bin/linux64/und create -db $udb -languages $lang $lang2
    ~/opt/scitools/bin/linux64/und -db $udb add $i
    ~/opt/scitools/bin/linux64/und settings -MetricMetrics all $udb
    ~/opt/scitools/bin/linux64/und settings -MetricFileNameDisplayMode FullPath $udb
    ~/opt/scitools/bin/linux64/und analyze $udb
    ~/opt/scitools/bin/linux64/und export -dependencies file cytoscape $xml $udb
    ~/opt/scitools/bin/linux64/und metrics $udb
done

If you are in a Macbook, the und command can be found under /Applications/Understand.app/Contents/MacOS/und.

You can also use the GUI of Scitools to try to generate the said file, it may be easier to explore first via GUI to then try to do via terminal using the "und" command.

See: https://documentation.scitools.com/pdf/understand.pdf p.352 talks a bit more about it, but what you want is the configuration mentioned above.

If the bash script does not make sense, let me know. The way this scitools interface works is, you first use und to parse data from a project, which outputs a database file that Scitools will be able to get things out from:

    ~/opt/scitools/bin/linux64/und create -db $udb -languages $lang $lang2
    ~/opt/scitools/bin/linux64/und -db $udb add $i

I recommend you manually replace the bash variables and simply put an actual project. Say, I believe the $udb variable is just the path to the source code folder. $lang should be replaced by the project language (you can find these on the project github, just usee the one the project is mostly written in).

The command that give dependencies is:

    ~/opt/scitools/bin/linux64/und analyze $udb
    ~/opt/scitools/bin/linux64/und export -dependencies file cytoscape $xml $udb

The metrics commands is for something else. You can ignore them for now.

carlosparadis commented 2 months ago

@RavenMarQ

Here's some more context on how to get the files via the GUI. You can use one of the sample projects Scitools offer to learn about it.

Class-level (the one we are interested):

Screen Shot 2024-09-13 at 1 12 41 PM

File level (we have another tool that does it, but it is still insightful for you to learn about it):

Screen Shot 2024-09-13 at 1 12 47 PM

Once you do this, you can save these files on disk, and open with Visual Studio Code. They are XML files. This is the file you will want to learn the syntax to write a parser for to convert to a table. Here is a headstart. The file can be thought of as having a part to define the nodes (the circles in the network):

   <node id="16109" label="Bitbucket id:16109">
           <att type="string" name="node.shape" value="rect"/>
           <att type="string" name="node.fontSize" value="5"/>
           <att type="string" name="node.label" value="Bitbucket"/>
           <att type="string" name="longName" value="Bitbucket"/>
           <att type="string" name="kind" value="Class"/>
           <graphics type="RECTANGLE" h="35" w="35" x="3680" y="115" fill="#ffffff" width="1" outline="#000000" cy:nodeTransparency="1.0" cy:nodeLabelFont="Default-0-8" cy:borderLineType="solid"/>
   </node>
   <node id="589" label="Repository id:589">
           <att type="string" name="node.shape" value="rect"/>
           <att type="string" name="node.fontSize" value="5"/>
           <att type="string" name="node.label" value="Repository"/>
           <att type="string" name="longName" value="Repository"/>
           <att type="string" name="kind" value="Class"/>
           <graphics type="RECTANGLE" h="35" w="35" x="3450" y="3220" fill="#ffffff" width="1" outline="#000000" cy:nodeTransparency="1.0" cy:nodeLabelFont="Default-0-8" cy:borderLineType="solid"/>
   </node>

And the edges that connect said circles:

    <edge source="16109" target="589" label="Bitbucket(Depends On)Repository">
             <att type="string" name="edge.targetArrowShape" value="ARROW"/>
             <att type="string" name="edge.color" value="#0000FF"/>
             <att type="string" name="canonicalName" value="Bitbucket(Depends On)Repository"/>
             <att type="string" name="interaction" value="Depends On"/>
             <att type="string" name="dependency kind" value="Call, Use"/>
     </edge>

Notice how in the edge tag it says which are the ids of the nodes block I selected above. In tabular format, you would have a table for the nodes, and another table for the edges. The node table could have as columns: id, label. The edges table could have id_from, id_to, dependency_kind. So in the example above, this table would have two rows and the node table one row:

node table id label
16109 BitBucket
589 Repository
edge table id_from id_to dependency_kind
16109 589 Call
16109 589 Use

Also note this is an open source project being analyzed, so you can actually find said class Bitbucket: https://github.com/gitahead/gitahead/blob/master/src/host/Bitbucket.cpp and see that it has a call to the Repository class:

https://github.com/gitahead/gitahead/blob/74398336cb2779c2b3864a4be0ba25d8de5c6a3b/src/host/Bitbucket.cpp#L61

I'd recommend you pick a tiny project like https://github.com/HouariZegai/Calculator and run the steps above against it. Then look at the code side by side to the dependency graph file to get more familiarity with what is being extracted from the source code. This will greatly improve your understanding of what you are doing and substantially make your M2 easier in the other tool.

RavenMarQ commented 1 month ago

For parsing through the xml file, I have found an R package called xml2 that would simplify the process of creating the data frame by allowing me to extract the attributes from the xml file and read it in the first place. For example, from this xml list of nodes:

cy:borderLineType="solid"/>
   </node>
   <node id="13" label="Theme.java id:13">
           <att type="string" name="node.shape" value="rect"/>
           <att type="string" name="node.fontSize" value="5"/>
           <att type="string" name="node.label" value="Theme.java"/>
           <att type="string" name="longName" value="F:\GitHub Desktop\Calc\src\main\java\com\houarizegai\calculator\theme\properties\Theme.java"/>
           <graphics type="RECTANGLE" h="35" w="35" x="115" y="0" fill="#ffffff" width="1" outline="#000000" cy:nodeTransparency="1.0" cy:nodeLabelFont="Default-0-8" cy:borderLineType="solid"/>
   </node>
   <node id="45" label="ThemeList.java id:45">
           <att type="string" name="node.shape" value="rect"/>
           <att type="string" name="node.fontSize" value="5"/>
           <att type="string" name="node.label" value="ThemeList.java"/>
           <att type="string" name="longName" value="F:\GitHub Desktop\Calc\src\main\java\com\houarizegai\calculator\theme\properties\ThemeList.java"/>
           <graphics type="RECTANGLE" h="35" w="35" x="0" y="115" fill="#ffffff" width="1" outline="#000000" cy:nodeTransparency="1.0" cy:nodeLabelFont="Default-0-8" cy:borderLineType="solid"/>
   </node>
   <node id="63" label="ThemeLoader.java id:63">
           <att type="string" name="node.shape" value="rect"/>
           <att type="string" name="node.fontSize" value="5"/>
           <att type="string" name="node.label" value="ThemeLoader.java"/>
           <att type="string" name="longName" value="F:\GitHub Desktop\Calc\src\main\java\com\houarizegai\calculator\theme\ThemeLoader.java"/>
           <graphics type="RECTANGLE" h="35" w="35" x="115" y="115" fill="#ffffff" width="1" outline="#000000" 

I can possibly process it with the code piece:

  # Read the XML file using xml2's read_xml
  # Currently file_path is hard-coded
  xml_data <- read_xml(file_path)

  # Extract node headers using xml2's xml_find_all
  nodes <- xml_find_all(xml_data, ".//node")

  # Iterating through all the nodes using lapply
  # Create a data frame extracting using xml2's xml_attr with the id and the label
  data_list <- lapply(nodes, function(node) {
    data.frame(
      id = xml_attr(node, "id"),
      label = xml_find_first(node, ".//att[@name='node.label']") %>% xml_attr("value"),
      stringsAsFactors = FALSE
    )
  })

If this is a valid direction, since I'm not familiar with R, would it be possible to explain how we could work on using this library? Or should I manually sort through the xml file? If so, could you direct me to functions that would help me do so?

A possible function output in the end is to create a list similar to parse_dependencies with:

graph <- list(nodes=data_list,edgelist=edge_list)

This is why I haven't completed the exact specification due to being unsure with the direction I'm taking with the function.

carlosparadis commented 1 month ago

In https://github.com/sailuh/kaiaulu/issues/308#issue-2518232771:

understand_parse_dependencies(project_path , language_one, language_two output_dir)

This language_one and language_two does not make sense. Please discuss with your group why that may be the case.

Description: understand_parse_dependencies will create a dependencies xml from the project specified in project_path analyzing files and dependencies of files using language_one and language_two. Afterwards, a dependencies file will be created in output_dir for later use.

Describe for me what the parse_dependencies function does (that we went over call), and if yours has equivalent behavior, this does not seem correct.

I do not see the other function needed as part of the task, which we also discussed on call.

In https://github.com/sailuh/kaiaulu/issues/308#issuecomment-2372526889:

This should be edited into your specification comment, not throughout the issue.

For parsing through the xml file, I have found an R package called xml2 that would simplify the process of creating the data frame by allowing me to extract the attributes from the xml file and read it in the first place. For example, from this xml list of nodes:

On the call, I said you should use the libraries that Kaiaulu already depend on. See the repo DESCRIPTION file. We use the xml library there.

If this is a valid direction, since I'm not familiar with R, would it be possible to explain how we could work on using this library? Or should I manually sort through the xml file? If so, could you direct me to functions that would help me do so?

Code review should be done via PR even if the code is not yet functional. I can't comment in line on an issue. However, this is not the way forward, you should use the data.table library (also in DESCRIPTION). The native data.frame library is slow and will not scale with the data processing pipeline.

You should look at the code that already exists for parse_dependencies and go by example.

This is why I haven't completed the exact specification due to being unsure with the direction I'm taking with the function.

This is why discussing early and often so we can iterate on checking directions will be very important to the success of this milestone.

Thanks!

RavenMarQ commented 1 month ago

For the parsing, does it matter to you if I use xmlTreeParse or xmlParse? I am unsure if there's some kind of optimization difference or some other factor (like using data.table instead of data.frame) that I should be aware of.

carlosparadis commented 1 month ago

Not that I am aware of, could you check?

RavenMarQ commented 1 month ago

I have already asked my team members, but I will also post this here:

When attempting to grab out data from the xml file, I always receive NULL or an empty list using the xpathSApply function as so:

library(XML)

understand_parse_dependencies <- function(project_path, language, output_dir = "../tmp/") {
    # Create the variables used in command lines
    project_path <- paste0("\"", project_path, "\"")
    db_dir <- paste0(output_dir, "/Understand.und")
    xml_dir <- paste0(db_dir, "/Dependencies.xml")
    command <- "und"

    # Generate the XML file
    system2(command, c("create", "-db", db_dir, "-languages", language))
    system2(command, c("-db", db_dir, "add", project_path))
    system2(command, c("analyze", db_dir))
    system2(command, c("export", "-dependencies", "file", "cytoscape", xml_dir, db_dir))

    # Parse the XML file
    xml_data <- xmlParse(xml_dir)

    result <- xpathSApply(xml_data, "//graph/node", xmlValue)

    return(result)
}

# Call the function and view the result
result <- understand_parse_dependencies(project_path = "F:/GitHub Desktop/Calc", language = "java")
View(result)

At this point, actually retrieving the data is the issue- as I've managed to create tables out of hard-coded lists. I have tried using a different combination, even checking the formatting of the xml file. The xml hierarchy is a graph which contains all the nodes, as shown by Carlos' snippet of the file above. The issue also isn't in xml_data, as xml_data properly points to the file, and printing it yields the file.

I have tried all combinations of changing:

//graph/node, //node, and //object/graph/node xmlValue, xmlToList, and custom:


function(node) {
id <- xmlGetAttr(node, "id")
label <- xpathSApply(node, ".//att[@name='node.label']", xmlGetAttr, "value")
longName <- xpathSApply(node, ".//att[@name='longName']", xmlGetAttr, "value")
data.table(id = id, label = label, longName = longName)

}


Searching online handles simple xml files, in which the directories and attributes aren't nested as so. Thus, the question is what should I do to even manage to retrieve the data. 
carlosparadis commented 1 month ago

You should check how in Kaiaulu how we used the XML library we depend on to parse files. You should not be trying to access the elements via a xpath string, just try to access one element at the tree at a time. It is far more sane and less assumption heavy on the xpath.

In conceptual terms: You access one node in the XML tree, then you obtain its children, then you access the next, etc.

RavenMarQ commented 1 month ago

Just to ensure that the output of the tables AFTER calling transform to network functions would be something like (if I filtered for only Call dependencies):

node table label
BitBucket
Repository
edge table id_from id_to
BitBucket Repository

Right? Or if not:

node table label
BitBucket
Repository
edge table id_from id_to dependency_kind
BitBucket Repository Call
BitBucket Repository Use

Or would I just change the id table into the labels from node table and release the same, and only need to put the edge table into the igraph function?

carlosparadis commented 1 month ago

Preserve the original ids and labels as separate columns, just in case the same file name or class name re-appears on the code. i.e. id_from, id_to, id_label_from, id_label_to. Similar to the node table.

Node table should also preserve the full path info on its own column:

           <att type="string" name="longName" value="F:\GitHub Desktop\Calc\src\main\java\com\houarizegai\calculator\theme\properties\Theme.java"/>

Second option, we need to know the dependency_kind. You also want to preserve interaction:

             <att type="string" name="interaction" value="Depends On"/>

Basically we want to preserve as much relevant information as possible. The coordinates or shape scitools use on its graph is not interesting for us, details about the interaction from the source code is.

RavenMarQ commented 1 month ago

Just to be clear, the weights parameter in both transform to network functions filters by the provided weight, correct?

calling transform_und_class_dependencies_to_network(parsed, weights) with weights = "Use" would return the preserved node table with the edge table pruned for only those that contain a "Use" dependency_kind?

carlosparadis commented 1 month ago

The parameter can be a vector c=("Use","Some Other valid dependency"). You can then throw an error if the user passed a parameter that does not exist. If it exists, then that becomes the weight count on the transform function.

Think what you can do with the data, and what is the easiest way you can allow the user to do that!

RavenMarQ commented 1 month ago

I found that the four functions I am creating are eerily similar. If I may, I could add a new parameter in the parse function to check if they want "file" or "class" dependencies. This way, I don't have two functions that act the same but only changes a few characters here and there... The same applies to the transform to network, since their outputs can be handled in the same way

carlosparadis commented 1 month ago

Sure!

RavenMarQ commented 1 month ago

For the notebook, would you like me to create a separate one or would you like me to work towards linking it / adding to an existing one?

I am unsure on this front, but I have already discussed with Dao on how she made her notebook.

RavenMarQ commented 1 month ago

Also, we as a team are having difficulties committing on our branches, and are having errors like so:

@github-actions R-CMD-check / R-CMD-check (4.2) (pull_request) Failing after 4m Details @github-actions test-coverage / test-coverage (4.2) (pull_request) Failing after 5m Details

Is there something we should know?

carlosparadis commented 1 month ago

https://github.com/sailuh/kaiaulu/issues/308#issuecomment-2387501147

Make a new one.

https://github.com/sailuh/kaiaulu/issues/308#issuecomment-2387520608

I am not sure I understand, are you saying you can't push the commits, or that you can push them but the GitHub Actions fail?

RavenMarQ commented 1 month ago

#308 (comment)

I am not sure I understand, are you saying you can't push the commits, or that you can push them but the GitHub Actions fail?

The merging jobs fail. I don't know what this means, but am concerned from it due to sending me fail emails every time I commit.

Besides that, I am currently in the process of creating the notebook. It is completed, however, I can't seem to get the added functions to src.R to run- even after I updated the NAMESPACE file. What should I do so that way the notebook can run on its own without throwing out the error:

could not find function "understand_parse_dependencies"

?

carlosparadis commented 1 month ago

The merging jobs fail. I don't know what this means, but am concerned from it due to sending me fail emails every time I commit.

Give me a screenshot of the issue where you see it, I am not sure I understand the issue since I am missing the context.

Besides that, I am currently in the process of creating the notebook. It is completed, however, I can't seem to get the added functions to src.R to run- even after I updated the NAMESPACE file. What should I do so that way the notebook can run on its own without throwing out the error:

I do not see any text on the notebook, unless I am missing something, you still are ways to go on it. The not find function likely means you did not build the code.

Also, can you rename the functions and update the specification to parse_understand_dependencies so it is consistent with parse_r_dependencies?

RavenMarQ commented 1 month ago

@beydlern @crepesAlot

This comment is for documentation sake, as discussions have already occurred on Discord.

As it stands, I'll be needing assistance on changing global settings for my working directories and suggesting a section in the yml like:

#text explaining something about the filepath to understand_showcase.Rmd
understand:
    file_path: path_to
beydlern commented 1 month ago

@RavenMarQ In RStudio, to change the working directory for notebooks (.Rmd) files:

Tools -> Global Options -> R Markdown -> Evalute chunks in directory: -> Project

crepesAlot commented 1 month ago

@carlosparadis @RavenMarQ The proposed conf file formatting using kaiaulu.yml as an example is:

understand:
  # accepts one language at a time: ada, assembly, c, c++, c#, fortran, java, jovial, delphi, pascal, python, vhdl, visual-basic, javascript
  code_language: java
  # Where the files to analyze should be stored
  input_path: ../../rawdata/kaiaulu/git_repo/understand/
  # Where the output for the understands analysis is stored
  output_path: ../../analysis/kaiaulu/understand/
carlosparadis commented 1 month ago

What does the current Depends config section look like? Since they are both extracting dependencies, it is a good idea to check for consistency.

Can more than one code language be specified in Scitools @RavenMarQ ?

Scitools will export two types of files too but I guess it is okay they just sit on the folder for the time being.

crepesAlot commented 1 month ago

The current Depends config section looks like:

tool:
  # Depends allow to parse file-file static dependencies.
  depends:
    # accepts one language at a time: cpp, java, ruby, python, pom
    # You can obtain this information on OpenHub or the project GiHub page right pane.
    code_language: java
    # Specify which types of Dependencies to keep - see the Depends tool README.md for details.
    keep_dependencies_type:
      - Cast
      - Call
      - Import
      - Return
      - Set
      - Use
      - Implement
      - ImplLink
      - Extend
      - Create
      - Throw
      - Parameter
      - Contain

What Raven was showing me looked as though understand supports more languages than depends, but still only accepts one language at a time, as the code_language is a parameter used in one of the function calls.

carlosparadis commented 1 month ago

@crepesAlot Ah perfect. Do you see how Depends lets you specify the types the user can select in the config? We should have the same for scitools functions too, since @RavenMarQ also lets you do that. @RavenMarQ you need to look on their docs to see what types are possible. Our wiki should also refer to their docs / section / page on where users can find the definition (but don't add that to Kaiaulu since thats their work, not ours --- we are just creating an interface to the tool).

RavenMarQ commented 3 days ago

As I work on this, I just want to clear up the specifications for the workbook as I await for the responses from SciTools for dependency types and language inputs:

For your viewing, I have uploaded the html of the knitted workbook I have not edited into the shared Google drive.

RavenMarQ commented 3 days ago

I have just heard back from SciTools' support team, and have received some answers. As for all the languages that can be analyzed and used as input strings in our config file, these are the possible string inputs (case-insensitive) for telling what project's language is (shorthands and abbreviations like cpp or py do not work):

Languages:
     Ada                                     Assembly
     Basic                                   C++
     C#                                      Fortran
     Java                                    Jovial
     Pascal                                  Python
     VHDL                                    Web

As well as a link to all the possible dependency kinds. Each language has its different dependency kinds, and there's a lot- so I'm thinking simply linking the list in the notebook, or possibly even in the wiki.

carlosparadis commented 2 days ago

@RavenMarQ please move your findings here so we can find in the future: https://github.com/sailuh/kaiaulu/wiki/Scitools

Thank you for finding this out!

RavenMarQ commented 2 days ago

What should I move over to the wiki page? Should it be a simple description of all the functions (like the specification here) and the specific details of the related configurations that I found out?

carlosparadis commented 2 days ago

Code and docs of functions lives on itm0. Wiki has information about scitools, regardless of Kaiaulu. That way, if the function name changes in the future, we don't need to worry about the wiki going outdated. So it is really the language type, and the dependency types URL (be mindful of copy and pasting anything from the page, this enters re-distribution). So please reference everything you can to their website.

carlosparadis commented 2 days ago

To be very specific: We want just to make it easier for users that own a Scitools license to be able to use Kaiaulu and located their relevant documentation in our wiki, we do not want to copy their content in our wiki. If the information only exists on the e-mail response, I guess it is OK that we place in our wiki (but if you know where it is in their docs, reference the source there too please).

RavenMarQ commented 2 days ago

Wiki updated, I inserted all the user-sided ways they could find ways to access the information so that we do not have to update the wiki.