sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
18 stars 12 forks source link

gitlog_to_hdsmj() number of variables differ from DV8 output + Refactor into R/graph.R and R/network.R #184

Closed leilani-reich closed 1 year ago

leilani-reich commented 1 year ago

Refer to https://github.com/sailuh/kaiaulu/pull/171#issuecomment-1503153951 for info.

For my project bridging the gap, my gitlog_to_hdsmj() outputs a json with 162 variables (the "variables" field in the json) whereas the json output by dv8 gives 108 variables. On the other hand, for the project QtNotepad, my gitlog_to_hdsmj() outputs a json with 8 variables whereas the json output by dv8 gives 10 variables.

gitlog-output-april-9.zip

carlosparadis commented 1 year ago

@leilani-reich As agreed on call, please let me know if you can pinpoint more precisely where the issue is introduced on your smaller project. E.g. if the gitlog table has a mismatch, if the perceval output has a mismatch in json, etc.

leilani-reich commented 1 year ago

Pinpointing filenames not matching in QNotepad

Hi Carlos, I looked at my QtNotepad project and where in the timeline that its files changed.

Here's what the filenames (variables field in the json) look like at the end of running my gitlog_to_hdsmj() versus the DV8 gitlog hdsm json.

gitlog_to_hdsmj DV8 hdsm
".gitignore" ".gitignore"
".gititgnore" ".gititgnore"
"QtNotepad/QtNotepad.pro" "QtNotepad/QtNotepad.pro"
"QtNotepad/main.cpp" "QtNotepad/main.cpp"
"QtNotepad/mainwindow.cpp" "QtNotepad/mainwindow.cpp"
"QtNotepad/mainwindow.h" "QtNotepad/mainwindow.h"
"QtNotepad/mainwindow.ui" "QtNotepad/mainwindow.ui"
"QtNotepad/resources.qrc" "QtNotepad/resources.qrc"
"README.md" "README.md"
- LICENSE
- LICENSE.md

So the LICENSE and LICENSE.md files end up missing by the end of my gitlog_to_hdsmj().

How does LICENSE.md go missing? - parse_gitlog()

I found that the LICENSE.md file goes missing in the parse_gitlog() function. In particular, on this line https://github.com/sailuh/kaiaulu/blob/851097eed2462c437535379fd82d71a101ecf771/R/parser.R#L67-L74

we are only getting the columns called data.files.file, data.files.added, and data.files.removed. However, if we print the perceval_parsed before setting perceval_parsed to this line, we see that the file "LICENSE.md" is in a different column called "data.files.newfile".

Screenshot 2023-04-13 at 10 21 24 PM

So since the "newfile" column isn't accounted for in parse_gitlog, this results in the LICENSE.md missing from the parsed_gitlog table.

leilani-reich commented 1 year ago

How does LICENSE go missing? - bipartite_graph_projection()

To see how LICENSE went missing, I next ran the code

transform_gitlog_to_bipartite_network(gitlog_table, mode ="commit-file")
# To check the files
unique(gitlog_graph$edgelist$to)

The LICENSE file was still in the table, so I moved on to the next step.

gitlog_graph_proj <- bipartite_graph_projection(gitlog_graph, mode = FALSE, is_intermediate_projection = FALSE)
# To check the files
unique(c(gitlog_graph_proj[[2]]$from, gitlog_graph_proj[[2]]$to))

When I looked at my gitlog_graph_proj, I noticed the LICENSE file is missing. I took a closer look at the bipartite_graph_function and this is the line where it seems the LICENSE file disappears.

https://github.com/sailuh/kaiaulu/blob/851097eed2462c437535379fd82d71a101ecf771/R/graph.R#L106-L108

Thanks.

carlosparadis commented 1 year ago

Reply to https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1508154883

Hi,

Thank you so much for narrowing the problem space for me. It seems the fundamental issue here is how Perceval handles file renaming on the git log, when no changes are made to the file. Below is the full output of what you provided on screenshot.

[[1]]
  action added        file          indexes          modes removed
1      A     0 .gititgnore 0000000, e69de29 000000, 100644       0
2      A     1   README.md 0000000, 68722c3 000000, 100644       0

[[2]]
  action added                     file          indexes          modes removed
1      A    34  QtNotepad/QtNotepad.pro 0000000, 76d2668 000000, 100644       0
2      A    11       QtNotepad/main.cpp 0000000, b48f94e 000000, 100644       0
3      A    14 QtNotepad/mainwindow.cpp 0000000, 49d64fc 000000, 100644       0
4      A    22   QtNotepad/mainwindow.h 0000000, a3948a9 000000, 100644       0
5      A    24  QtNotepad/mainwindow.ui 0000000, 6050363 000000, 100644       0
6      A     2  QtNotepad/resources.qrc 0000000, 90f4a83 000000, 100644       0

[[3]]
  action added                     file          indexes          modes removed
1      A     4               .gitignore 0000000, 087848a 000000, 100644       0
2      M     4  QtNotepad/QtNotepad.pro 76d2668, 4b35934 100644, 100644       1
3      M    26 QtNotepad/mainwindow.cpp 49d64fc, 21629b2 100644, 100644       0
4      M     8   QtNotepad/mainwindow.h a3948a9, abc2fcb 100644, 100644       0
5      M   115  QtNotepad/mainwindow.ui 6050363, e49f526 100644, 100644      11
6      M     3  QtNotepad/resources.qrc 90f4a83, c07d98c 100644, 100644       2

[[4]]
  action added                     file          indexes          modes removed
1      M    16 QtNotepad/mainwindow.cpp 21629b2, ac72143 100644, 100644       3
2      M     7   QtNotepad/mainwindow.h abc2fcb, 8a5e0dd 100644, 100644       0

[[5]]
  action added                     file          indexes          modes removed
1      M    12 QtNotepad/mainwindow.cpp ac72143, 2573681 100644, 100644       0
2      M     2   QtNotepad/mainwindow.h 8a5e0dd, 1d962a8 100644, 100644       0

[[6]]
  action added                     file          indexes          modes removed
1      M     5 QtNotepad/mainwindow.cpp 2573681, 0d5eac4 100644, 100644       0
2      M     2   QtNotepad/mainwindow.h 1d962a8, c0b8db7 100644, 100644       0
3      M     1  QtNotepad/mainwindow.ui e49f526, e4132ac 100644, 100644       0

[[7]]
  action added                     file          indexes          modes removed
1      M     5 QtNotepad/mainwindow.cpp 0d5eac4, 443ee09 100644, 100644       0
2      M     2   QtNotepad/mainwindow.h c0b8db7, 07eb6a5 100644, 100644       0

[[8]]
  action added                     file          indexes          modes removed
1      M     5 QtNotepad/mainwindow.cpp 443ee09, 7987c57 100644, 100644       0
2      M     2   QtNotepad/mainwindow.h 07eb6a5, 3340b73 100644, 100644       0

[[9]]
  action added                     file          indexes          modes removed
1      M    15 QtNotepad/mainwindow.cpp 7987c57, c528b0c 100644, 100644       0
2      M     6   QtNotepad/mainwindow.h 3340b73, 4015afc 100644, 100644       0

[[10]]
  action added                     file          indexes          modes removed
1      M     3 QtNotepad/mainwindow.cpp c528b0c, 5e149b9 100644, 100644       3
2      M    11  QtNotepad/mainwindow.ui e4132ac, 1085b2f 100644, 100644      13

[[11]]
  action added       file          indexes          modes removed
1      M     1 .gitignore 087848a, 10e0e7c 100644, 100644       1

[[12]]
  action added        file          indexes          modes removed
1      D     0 .gititgnore e69de29, 0000000 100644, 000000       0

[[13]]
  action added      file          indexes          modes removed
1      M     8 README.md 68722c3, 89a661c 100644, 100644       0

[[14]]
  action added    file          indexes          modes removed
1      A    24 LICENSE 0000000, cf1ab25 000000, 100644       0

[[15]]
  action added    file          indexes          modes    newfile removed
1   R100     0 LICENSE cf1ab25, cf1ab25 100644, 100644 LICENSE.md       0

[[16]]
  action added      file          indexes          modes removed
1      M     1 README.md 89a661c, cc65b57 100644, 100644       0

Moreover, here's the table version of it, mid-way through parse_gitlog():

Screen Shot 2023-04-14 at 2 15 12 AM

Originally I thought newfile would apply to every commit that introduced a new file, but if the name is a bit misleading, and if that was the case, we would see the column being handled here. The file ends up being missed in the log, likely because no further changes were ever made to LICENSE.md after being renamed. So it never had a chance to appear under file.

So the question that follows is if, all the files Kaiaulu misses in the other project you tested are just file renames that never had any code change afterward. Would you be able to verify that? In essence, take one of the filepaths of setdiff you found are not included in the other project, and search the repository github's for it. Check if the commits option to see when it was modified. Let me know if this doesn't make sense. You should be able to do all this via your browser with the information you already posted in a prior issue.

This is an annoying problem to deal with, as there is a whole research literature on how to trace files that were originally the same and just got a new name versus treating them as a completely different entity. It seems DV8 does the latter.

carlosparadis commented 1 year ago

Reply to https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1508155948

Ok, I will take a look at this second part of the problem too.

The issue likely lies within get_combinations() not getting the name of the variable from the to field in

https://github.com/sailuh/kaiaulu/blob/851097eed2462c437535379fd82d71a101ecf771/R/graph.R#L61

Edit: The error for part 2 lies in the assumption the DV8 variables field should be extracted from the edgelist table in the associated PR. It is not a bug in the existing codebase. If the edgelist are used, then nodes that are isolated will not appear on the edgelist. This is the case for the pointed files in LICENSE.md and LICENSE. The code should use the nodes table for variables. This is consistent to DV8 behavior: LICENSE.md and LICENSE only appear on the "variables" field, but not in the "cells" field (which represent nodes and edgelist respectively in DV8).

carlosparadis commented 1 year ago

Reply to: https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1508154883

This should now be fixed with commit: https://github.com/sailuh/kaiaulu/commit/6d407d629a2646b3bd00d45e483447b1db7b7c01

Could you please check LICENSE.md now appears? For both being missing on your final file, please see my comment here: https://github.com/sailuh/kaiaulu/pull/171/files#r1167230670

If so, then I can close the issue.

leilani-reich commented 1 year ago

Hi Carlos, sorry for the late response. I can see the LICENSE.md file now. Thank you!

carlosparadis commented 1 year ago

@leilani-reich Sounds good, I guess what is left then are the changes to your gitlog_hdsmj function and the other sub-tasks I sent via e-mail so we can see if all the remaining files match. Let me know how it goes, so I can then replace into the code I am testing on the Notebook I am making 👍 .

leilani-reich commented 1 year ago

Hi Carlos, on that note, I am noticing the number of cells and the cochange values produced by my gitlog_to_hdsmj() do not match with DV8's hdsm json (which we get from dv8_gitlog_to_gitnumstat() -> dv8_gitnumstat_to_hdsmb() -> dv8_dsmb_to_dsmj()). There are more cells in the json for the DV8 output.

Here's the first few sequence of steps I am using for my gitlog_to_hdsmj().

gitlog_table <- parse_gitlog(perceval_path,git_repo_path,save_path=NA)
gitlog_graph <- transform_gitlog_to_bipartite_network(gitlog_table, mode ="commit-file")
cochange_table <- bipartite_graph_projection(gitlog_graph, mode = FALSE, is_intermediate_projection = FALSE)

cochange_table

Looking at the cochange_table produced by bipartite_graph_projection(), it has 21 rows of data, which gives 21 cells for my hdsm json. Meanwhile, DV8 has 43 cells in the hdsmj.

I can try and narrow the problem again. Should this be on a separate issue?

Zip with my hdsmj output file versus DV8. gitlog-check-april16th-github.zip

carlosparadis commented 1 year ago

Reply to https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1510988502

I think it is ok for now to stick to this issue since the first part is addressed and since the second part is related to this overall problem, the issue title describes. Here's a question I have before you dive deep into this:

What is the situation with the variables field versus Kaiaulu nodes[["filepath"]]? Does it now match to DV8?

Also, how many days do you think you would need for DV8 Cheatsheet? I think that is also important for your posted / presentation, which is why I want to see how we can shuffle this around if possible.

Lastly, what project are you using again for this hdsmj? Is it the helix project/config?

leilani-reich commented 1 year ago

Hi Carlos, no, the variables do not always match in my gitlog_to_hdsmj() vs DV8's. Sorry, should have said that first. For QTNotepad it does, but for larger projects like helix my number of variables exceeds DV8's. I will look into this.

For the DV8 cheatsheet, maybe 3 days?

I am trying the helix project for the hdsmj. I also tested QTNotepad.

carlosparadis commented 1 year ago

Try to focus on the Cheatsheet for now so we get that out of the way. I have the Notebook mostly done for core functionality. Hopefully this will help you diagnose the issue faster by the time you are done with the cheatsheet.

carlosparadis commented 1 year ago

Hi Leilani,

Let's focus on #165 Cheat Sheet and then this issue for now (and of course your poster/presentation as deadlines demand). I closed your group DV8 integration functions issues and PRs and pushed into the codebase, now that I had the chance to test most of them.

Please create a new pull request against the current codebase for this issue when you are ready. You can work on this issue while simplifying the code for R/graph.R and R/network.R too, between dependencies_to_sdsmj() and gitlog_to_hdsmj().

Expect a few commits from my end to normalize R/dv8.R docs, and function parameters, but you can initiate your PR in parallel as we have done insofar.

@malialiu and @nicoelee123 can focus on wrapping Milestones 1 and 2 and working Milestone 2 Notebook. I will be unable to take that one from you, so please make sure to account for the level of detail in the documentation and doe QoL we discussed.

Let me know if you have any questions!

Thanks!

leilani-reich commented 1 year ago

Got it.

will be unable to take that one from you, so please make sure to account for the level of detail in the documentation and doe QoL we discussed.

Also, what is "doe QoL"?

carlosparadis commented 1 year ago

Oops! Quality of Life :) Documentation clarity, parameter name clarity, hyperlinking parameters that require other function output, descriptions of function body that explain the files extensions, or basically anything that keep the user of the function from losing sleep!

leilani-reich commented 1 year ago

Oh I see now, thanks!

leilani-reich commented 1 year ago

Hi Carlos, I am noticing something strange with parse_gitlog(), which seems to be the reason (or at least of the reasons) my variable filenames from my hdsm.json don't match dv8's.

Here's what the parse_gitlog() table looks like. The "build/classes/Calculator$10.class" and similar files seem like bugs.

Screenshot 2023-04-21 at 11 17 02 PM

I believe the section in the code this issue stems from is here: https://github.com/sailuh/kaiaulu/blob/57c64b512b3cf7134f9bdb31ca26e2529ed047f8/R/parser.R#L95-L98

In particular, I think it should be replaced by

  perceval_parsed <- perceval_parsed[, .(file=unlist(data.files[[1]]$file[[1]]),
                                         added=unlist(data.files[[1]]$added[[1]]),
                                         removed=unlist(data.files[[1]]$removed[[1]]),
                                         newfile=unlist(data.files[[1]]$newfile[[1]])),, ...

When I do so, the table looks like this:

Screenshot 2023-04-21 at 11 26 08 PM

Here's the hdsm.json from DV8 versus mine from my gitlog_to_hdsmj() for comparison. calc-gitlog-apr21.zip

Edit: Also not sure why the screenshot files are in my parsed_gitlog() table, but those aren't in dv8's hdsm.json.

Also, the project I tried this on was https://github.com/HouariZegai/Calculator

carlosparadis commented 1 year ago

Could you give me the project configuration file you have for this?

Also, you said:

Edit: Also not sure why the screenshot files are in my parsed_gitlog() table, but those aren't in dv8's hdsm.json.

In your computer folder, was the screenshot actually there? Meaning, does that mean DV8 has a built in filter we were unaware of, that is removing the .png from the log?

leilani-reich commented 1 year ago

I'm not sure what you mean by project configuration file. I didn't use any of the conf files. Here's the gitlog I used to get DV8 output if it's of interest: calc-log.txt

Not all of the screenshots are in the directory for the calculator git project, although they all seem to be mentioned in the git log. I think DV8 does have a built in filter of some sort to get rid of .png and probably other images too, based on the DV8 output.

carlosparadis commented 1 year ago

I see, so you just used the parse_gitlog() directly.

If DV8 has some sort of built-in filter, then it stands we should expect file mismatches. The question then is, if we can incrementally add filter conditions to ours to see if we eventually arrive in the same set of files.

My suggestion: Create a project configuration file for this project. On the filter conditions:

https://github.com/sailuh/kaiaulu/blob/57c64b512b3cf7134f9bdb31ca26e2529ed047f8/conf/apr.yml#L81-L91

Let's start by not having any assumptions being made. Add only to remove_filepaths_containing that it should not contain .png, as it seems is the case DV8 does. Additionally, try also adding .class to be removed. You can see an example of how these functions are used after parse_gitlog() on the Git Log Notebook. You can, of course, just apply them directly in memory, but at least this would save us time, since I believe having a config for this project would be helpful for test discussions like this.

Then try to use the setdiff as we discussed before to see where we have more or less files to DV8.

I am also not sure why you are suggesting the replacement to:

  perceval_parsed <- perceval_parsed[, .(file=unlist(data.files[[1]]$file[[1]]),
                                         added=unlist(data.files[[1]]$added[[1]]),
                                         removed=unlist(data.files[[1]]$removed[[1]]),
                                         newfile=unlist(data.files[[1]]$newfile[[1]])),, ...

Is it because the .class files no longer appear, or because of your understanding of what it is doing?

A single commit may modify multiple files. If memory serves me right, data.files[[1]] is serving the intent of unlisting the table of files changed in a given commit, since it is a list of only one element. However, the $file, $added, etc, are referring to the table of files modified in that particular commit. Your change suggestion, again if memory serves me right, would only obtain the first file of every commit, instead of all file changes of every commit, which would not make sense to me since we are losing actual data here.

Is my understanding incorrect to your proposal?

carlosparadis commented 1 year ago

Here's the commit that add a bunch of .class. It is the very first one:

https://github.com/HouariZegai/Calculator/commit/89668bf3f7dfa6336e4ae6432b34c0d7058d2b75

Also here's the config file I made for this:

# -*- yaml -*-
# https://github.com/sailuh/kaiaulu
#
# Copying and distribution of this file, with or without modification,
# are permitted in any medium without royalty provided the copyright
# notice and this notice are preserved.  This file is offered as-is,
# without any warranty.

# Project Configuration File #
#
# To perform analysis on open source projects, you need to manually
# collect some information from the project's website. As there is
# no standardized website format, this file serves to distill
# important data source information so it can be reused by others
# and understood by Kaiaulu.
#
# Please check https://github.com/sailuh/kaiaulu/tree/master/conf to
# see if a project configuration file already exists. Otherwise, we
# would appreciate if you share your curated file with us by sending a
# Pull Request: https://github.com/sailuh/kaiaulu/pulls
#
# Note, you do NOT need to specify this entire file to conduct analysis.
# Each R Notebook uses a different portion of this file. To know what
# information is used, see the project configuration file section at
# the start of each R Notebook.
#
# Please comment unused parameters instead of deleting them for clarity.
# If you have questions, please open a discussion:
# https://github.com/sailuh/kaiaulu/discussions

project:
  #website: https://apr.apache.org/
  #openhub: https://www.openhub.net/p/apache_portable_runtime

version_control:
  # Where is the git log located locally?
  # This is the path to the .git of the project repository you are analyzing.
  # The .git is hidden, so you can see it using `ls -a`
  log: ../../rawdata/git_repo/Calculator/.git
  # From where the git log was downloaded?
  log_url: https://github.com/HouariZegai/Calculator
  # List of branches used for analysis
  branch:
    - master

mailing_list:
  # Where is the mbox located locally?
  #mbox: ../../rawdata/mbox/apr-dev_2012_2019.mbox
  # What is the domain of the chosen mailing list archive?
  #domain: http://mail-archives.apache.org/mod_mbox
  # Which lists of the domain will be used?
  #list_key:
  #  - apr-dev

#issue_tracker:
#  jira:
    # Obtained from the project's JIRA URL
#    domain: https://issues.apache.org/jira
    #project_key: HELIX
    # Download using `download_jira_data.Rmd`
    #issues: ../../rawdata/issue_tracker/helix_issues.json
    #issue_comments: ../../rawdata/issue_tracker/helix_issue_comments.json
  github:
    # Obtained from the project's GitHub URL
    owner: HouariZegai
    repo: Calculator
    # Download using `download_github_comments.Rmd`
    replies: ../../rawdata/github/Calculator/

#vulnerabilities:
  # Folder path with nvd cve feeds (e.g. nvdcve-1.1-2018.json)
  # Download at: https://nvd.nist.gov/vuln/data-feeds
  #nvd_feed: rawdata/nvdfeed

# Commit message CVE or Issue Regular Expression (regex)
# See project's commit message for examples to create the regex
commit_message_id_regex:
  issue_id: \#[0-9]+
  #cve_id: ?

filter:
  keep_filepaths_ending_with:
    - cpp
    - c
    - h
    - java
    - js
    - py
    - cc
  remove_filepaths_containing:
    - .png
  #  - test

# Third Party Tools Configuration #
#
# See Kaiaulu's README.md for details on how to setup these tools.
tool:
  # Depends allow to parse file-file static dependencies.
  depends:
    # accepts one language at a time: cpp, java, ruby, python, pom
    # You can obtain this information on OpenHub or the project GiHub page right pane.
    code_language: cpp
    # Specify which types of Dependencies to keep - see the Depends tool README.md for details.
    keep_dependencies_type:
      - Cast
      - Call
      - Import
      - Return
      - Set
      - Use
      - Implement
      - ImplLink
      - Extend
      - Create
      - Throw
      - Parameter
      - Contain
  dv8:
    # The project folder path to store various intermediate
    # files for DV8 Analysis
    # The folder name will be used in the file names.
    folder_path: ../../analysis/dv8/apr
    # the architectural flaws thresholds that should be used
    architectural_flaws:
      cliqueDepends:
        - call
        - use
      crossingCochange: 2
      crossingFanIn: 4
      crossingFanOut: 4
      mvCochange: 2
      uiCochange: 2
      uihDepends:
        - call
        - use
      uihInheritance:
        - extend
        - implement
        - public
        - private
        - virtual
      uiHistoryImpact: 10
      uiStructImpact: 0.01
  # Uctags allows finer file-file dependency parsing (e.g. functions, classes, structs)
  uctags:
    # See https://github.com/sailuh/kaiaulu/wiki/Universal-Ctags for details
    # What types of file-file dependencies should be considered? If all
    # dependencies are specified, Kaiaulu will use all of them if available.
    keep_lines_type:
      c:
        - f # function definition
      cpp:
        - c # classes
        - f # function definition
      java:
        - c # classes
        - m # methods
      python:
        - c # classes
        - f # functions
      r:
        - f # functions

# Analysis Configuration #
analysis:
  # You can specify the intervals in 2 ways: window, or enumeration
  window:
    # If using gitlog, use start_commit and end_commit. Timestamp is inferred from gitlog
    start_commit: 9eae9e96f15e1f216162810cef4271a439a74223
    end_commit: f8f9ec1f249dd552065aa37c983bed4d4d869bb0
    # Use datetime only if no gitlog is used in the analysis.
    #start_datetime: 2013-05-01 00:00:00
    #end_datetime: 2013-11-01 00:00:00
    size_days: 90
#  enumeration:
     # If using gitlog, specify the commits
#    commit:
#      - 9eae9e96f15e1f216162810cef4271a439a74223
#      - f1d2d568776b3708dd6a3077376e2331f9268b04
#      - c33a2ce74c84f0d435bfa2dd8953d132ebf7a77a
     # Use datetime only if no gitlog is used in the analysis. Timestamp is inferred from gitlog
#    datetime:
#      - 2013-05-01 00:00:00
#      - 2013-08-01 00:00:00
#      - 2013-11-01 00:00:00

on gitlog_showcase.Rmd I am just executing up to and before the filter functions and experimenting with them now.

carlosparadis commented 1 year ago

Before filter, I get 302 rows total. I also see the commit "Adding screenshots of app" that adds the .png. After filtering for .png using the config file above, I now get 291 rows in my parsed table.

I then added a .class to the filter:

  remove_filepaths_containing:
    - .png
    - .class

Notebook code is now re-run up and to before the filter with the two conditions. Before filter, I get 302. After filter with the two conditions, I get 117.

How many files you got through DV8?

carlosparadis commented 1 year ago

(Going to keep going on the commentary while I do this in parallel so you also have the thought process that is easier for me to reason through this for the subsequent inconsistencies).

I see the DV8 json "variables" field is actually i a very nice format for me to just paste in R at least. Typing c( <paste_variables_sequenceof"strings"> ) and assigning it to a vector in R skips me dealing with jsonlite for variables at least.

I get 37 files from DV8, and 45 files from Kaiaulu with the .png and .class filters. Here are the DV8 variables field and the unique files that occurs on the commit changes.

DV8_variables
 [1] "src/main/java/com/houarizegai/calculator/ui/CalculatorUI.java"           
 [2] "pom.xml"                                                                 
 [3] "src/main/java/com/houarizegai/calculator/App.java"                       
 [4] "src/main/java/com/houarizegai/calculator/Calculator.java"                
 [5] "src/main/java/com/houarizegai/calculator/Theme.java"                     
 [6] "src/main/java/com/houarizegai/calculator/theme/ThemeLoader.java"         
 [7] "src/main/java/com/houarizegai/calculator/theme/properties/Theme.java"    
 [8] "src/main/java/com/houarizegai/calculator/theme/properties/ThemeList.java"
 [9] "src/main/java/com/houarizegai/calculator/util/ColorUtil.java"            
[10] "src/main/resources/application.yaml"                                     
[11] "src/test/java/com/houarizegai/calculator/CalculatorTest.java"            
[12] "src/test/java/com/houarizegai/calculator/CalculatorUITest.java"          
[13] "README.md"                                                               
[14] ".classpath"                                                              
[15] ".project"                                                                
[16] ".settings/org.eclipse.jdt.apt.core.prefs"                                
[17] ".settings/org.eclipse.jdt.core.prefs"                                    
[18] ".settings/org.eclipse.m2e.core.prefs"                                    
[19] ".gitignore"                                                              
[20] "src/com/houarizegai/calculator/Calculator.java"                          
[21] "src/com/houarizegai/calculator/CalculatorTest.java"                      
[22] ".idea/misc.xml"                                                          
[23] ".idea/modules.xml"                                                       
[24] ".idea/vcs.xml"                                                           
[25] ".idea/workspace.xml"                                                     
[26] "Calculator.iml"                                                          
[27] "build.xml"                                                               
[28] "build/classes/.netbeans_automatic_build"                                 
[29] "build/classes/.netbeans_update_resources"                                
[30] "build/classes/com/houarizegai/calculator/Calculator.rs"                  
[31] "manifest.mf"                                                             
[32] "nbproject/build-impl.xml"                                                
[33] "nbproject/genfiles.properties"                                           
[34] "nbproject/private/private.properties"                                    
[35] "nbproject/project.properties"                                            
[36] "nbproject/project.xml"                                                   
[37] "LICENSE" 
> unique(project_git$file_pathname)
 [1] "build.xml"                                                               
 [2] "manifest.mf"                                                             
 [3] "nbproject/build-impl.xml"                                                
 [4] "nbproject/genfiles.properties"                                           
 [5] "nbproject/private/private.properties"                                    
 [6] "nbproject/project.properties"                                            
 [7] "nbproject/project.xml"                                                   
 [8] "src/com/houarizegai/calculator/Calculator.java"                          
 [9] "README.md"                                                               
[10] "LICENSE"                                                                 
[11] ".idea/misc.xml"                                                          
[12] ".idea/modules.xml"                                                       
[13] ".idea/vcs.xml"                                                           
[14] ".idea/workspace.xml"                                                     
[15] "Calculator.iml"                                                          
[16] ".gitignore"                                                              
[17] "screenshots/colored_calculator.jpg"                                      
[18] "screenshots/sample_calculator.jpg"                                       
[19] "screenshots/colored_calculator_v1.0.jpg"                                 
[20] "screenshots/sample_calculator_v1.0.jpg"                                  
[21] "screenshots/v1.0/colored.jpg"                                            
[22] "screenshots/v1.0/sample.jpg"                                             
[23] "screenshots/v1.2/colored.PNG"                                            
[24] "screenshots/v1.2/sample.PNG"                                             
[25] "pom.xml"                                                                 
[26] "src//com/houarizegai/calculator/Calculator.java"                         
[27] "src/main/java/com/houarizegai/calculator/Calculator.java"                
[28] ".project"                                                                
[29] "src/com/houarizegai/calculator/CalculatorTest.java"                      
[30] "src/test/java/com/houarizegai/calculator/CalculatorTest.java"            
[31] "screenshots/colored.PNG"                                                 
[32] "screenshots/simple.PNG"                                                  
[33] ".settings/org.eclipse.jdt.apt.core.prefs"                                
[34] ".settings/org.eclipse.jdt.core.prefs"                                    
[35] ".settings/org.eclipse.m2e.core.prefs"                                    
[36] "screenshots/dark.PNG"                                                    
[37] "src/main/java/com/houarizegai/calculator/Theme.java"                     
[38] "src/main/java/com/houarizegai/calculator/App.java"                       
[39] "src/main/java/com/houarizegai/calculator/theme/ThemeLoader.java"         
[40] "src/main/java/com/houarizegai/calculator/theme/properties/Theme.java"    
[41] "src/main/java/com/houarizegai/calculator/theme/properties/ThemeList.java"
[42] "src/main/java/com/houarizegai/calculator/ui/CalculatorUI.java"           
[43] "src/main/java/com/houarizegai/calculator/util/ColorUtil.java"            
[44] "src/main/resources/application.yaml"                                     
[45] "src/test/java/com/houarizegai/calculator/CalculatorUITest.java" 
leilani-reich commented 1 year ago

Yes, in the variables field in my DV8 hdsm.json, I also get 37 filenames.

carlosparadis commented 1 year ago

Then we do the setdiff both ways:

> setdiff(DV8_variables,unique(project_git$file_pathname))
[1] ".classpath"                                             "build/classes/.netbeans_automatic_build"               
[3] "build/classes/.netbeans_update_resources"               "build/classes/com/houarizegai/calculator/Calculator.rs"
> setdiff(unique(project_git$file_pathname),DV8_variables)
 [1] "screenshots/colored_calculator.jpg"              "screenshots/sample_calculator.jpg"              
 [3] "screenshots/colored_calculator_v1.0.jpg"         "screenshots/sample_calculator_v1.0.jpg"         
 [5] "screenshots/v1.0/colored.jpg"                    "screenshots/v1.0/sample.jpg"                    
 [7] "screenshots/v1.2/colored.PNG"                    "screenshots/v1.2/sample.PNG"                    
 [9] "src//com/houarizegai/calculator/Calculator.java" "screenshots/colored.PNG"                        
[11] "screenshots/simple.PNG"                          "screenshots/dark.PNG"     
carlosparadis commented 1 year ago

So what else Kaiaulu is getting than DV8 seems to be basically..images! And the reason why our filter didn't capture it is because they are either .jpg or .PNG. So we again go back to the config file and add those.

remove_filepaths_containing:
    - .png
    - .class
    - .PNG
    - .jpg

Then we re-run the notebook again up to the filter part:

setdiff(DV8_variables,unique(project_git$file_pathname))
[1] ".classpath"                                             "build/classes/.netbeans_automatic_build"               
[3] "build/classes/.netbeans_update_resources"               "build/classes/com/houarizegai/calculator/Calculator.rs"
> 
> setdiff(unique(project_git$file_pathname),DV8_variables)
[1] "src//com/houarizegai/calculator/Calculator.java"
> 
carlosparadis commented 1 year ago

I went to your dv8 file and I see that DV8 is representing the path correctly:

src/com/houarizegai/calculator/Calculator.java

So this second block seems to be due to that. Now this may be an actual bug in parse_gitlog() or in Perceval. We would need to check the Perceval output directly, since I don't see how Kaiaulu would introduce that to nonel but this one file. Did you generate that .json too?

leilani-reich commented 1 year ago

Looking at the Perceval output json directly from the Perceval command, the 26th json (index 25) has the "src//com/houarizegai/calculator/Calculator.java".

perceval-apr19.json.zip

carlosparadis commented 1 year ago

So it seems Perceval introduces that instead of Kaiaulu then. But either way, it is fine: Ultimately even if we call one file different than DV8, the dependencies should still work as intended since we would name every instance of it with //.

As for this:

setdiff(DV8_variables,unique(project_git$file_pathname))
[1] ".classpath"                                             "build/classes/.netbeans_automatic_build"               
[3] "build/classes/.netbeans_update_resources"               "build/classes/com/houarizegai/calculator/Calculator.rs"

It makes sense that we would filter more than DV8 since the filter used was a "anywhere that class appears should be eliminated" (it seems for the stringi function under the hood doing the filtering job, typing a prefix dot will not enforce it). However, this is still okay, since normally we would not even worry about these files.

Throughout all of this exchange I omitted the first filter:

filter:
  keep_filepaths_ending_with:
    - cpp
    - c
    - h
    - java
    - js
    - py
    - cc

Note this would auto eliminate both .class and the images too. So let's try now running with the usual config that Kaiaulu uses and see what we get:

setdiff(DV8_variables,unique(project_git$file_pathname))
 [1] "pom.xml"                                                        "src/main/resources/application.yaml"                           
 [3] "src/test/java/com/houarizegai/calculator/CalculatorTest.java"   "src/test/java/com/houarizegai/calculator/CalculatorUITest.java"
 [5] "README.md"                                                      ".classpath"                                                    
 [7] ".project"                                                       ".settings/org.eclipse.jdt.apt.core.prefs"                      
 [9] ".settings/org.eclipse.jdt.core.prefs"                           ".settings/org.eclipse.m2e.core.prefs"                          
[11] ".gitignore"                                                     ".idea/misc.xml"                                                
[13] ".idea/modules.xml"                                              ".idea/vcs.xml"                                                 
[15] ".idea/workspace.xml"                                            "Calculator.iml"                                                
[17] "build.xml"                                                      "build/classes/.netbeans_automatic_build"                       
[19] "build/classes/.netbeans_update_resources"                       "build/classes/com/houarizegai/calculator/Calculator.rs"        
[21] "manifest.mf"                                                    "nbproject/build-impl.xml"                                      
[23] "nbproject/genfiles.properties"                                  "nbproject/private/private.properties"                          
[25] "nbproject/project.properties"                                   "nbproject/project.xml"                                         
[27] "LICENSE"                                       
> setdiff(unique(project_git$file_pathname),DV8_variables)
[1] "src//com/houarizegai/calculator/Calculator.java"
carlosparadis commented 1 year ago

So, we lose one file because of Perceval inconsistency as we saw. But it does not affect us, since all files are named consistently throughout our analysis.

As for what DV8 keeps in this case by default, and Kaiaulu does not as part of our filter choice, I'd say ours is the more accurate one for Rick needs: We are normally interested in analyzing the main program source code. And if ever not the case, we can just change, as I just did, with a few words!

You see DV8 keeps some of the test files, such as "src/test/java/com/houarizegai/calculator/CalculatorTest.java" and "src/test/java/com/houarizegai/calculator/CalculatorUITest.java". These are unit tests, as the ones you did on Milestone 1 and 2. If we want to analyze them, we normally do so separately.

There are also some .idea files. These are configuration files generated by Intelij's IDEA IDE. Again, nothing we care about.

And so goes on. So, I guess what this little exercise showed us is that, while when in Kaiaulu you remove the filters of the project configuration file you know nothing is being filtered, we in turn do not know what DV8 is filtering by default (unless I missed on the docs of the command). This is why I insisted parameters surfaced to the project configuration file as part of this milestone, like the ones that appear on arch:issue-arch:issue.

Moreover, you also now can see the value of your task 3.4 here: Thanks to us now being able to convert from a table in R we can filter, we can have a much higher level of precision on what we wish to filter or not before passing for DV8 for analysis.


So, for what remains for us to test this function, since we can't compare with DV8 directly, we should make an example git log with a few files, similarly to how I made a function to generate a git log for your unit tests.

For this case, more specifically, what I want you to do is simple. Create a bunch of text files in one folder with at least one letter in them so they are not all empty. Let's call this our prototype folder. Make sure you give them names you can easy distinguish. Say, Files 1,2,3,4,5... etc.

Then, pick 3 of these files and make your first commit. Then next, pick 2 files, maybe 1 that overlap with the 3, etc. Make 5 or so commits like this, experimenting overlaps with the files or isolation of the files.

Then run your function against this git log. The content of the files doesn't matter, nor that they are source code for this function. All that matters is whether they were part of the same commit or not, because that's what co-change is counting!.

Now you have an example you know how you committed, and you can validate if the value of co-change of each file makes sense to reality once you compute it.

Does that make sense? We can no longer compare to DV8, since with a different number of commits due to a different number of files, co-change will inevitably diverge.

carlosparadis commented 1 year ago

And that concludes the thought process on this and the next step! Let me know if anything is not clear 👍

leilani-reich commented 1 year ago

update on testing cochange

Hi Carlos, thank you for going through this process of filtering and seeing the difference between files.

I made a new git repo and added some files and did some co-commits as you suggested.

I am noticing that the Cochange values I get in my hdsm.json from my gitlog_to_hdsmj() are always twice the value of what I think they should be.

For instance, files 2.txt and 5.txt were co-committed 3 times looking a the git log, but the Cochange value in the hdsm.json is 6. Is that supposed to be the case for Kaiaulu?

Here is my git log, my gitlog_to_hdsmj hdsm.json, and the dv8 hdsm.json. cochange-git-tests.zip

leilani-reich commented 1 year ago

DV8 hdsm json concerns

Also, I know you said we shouldn't compare to dv8, at least for filtering, but just to be sure, I wanted to note that the dv8 hdsm.json makes double the cells as the gitlog_to_hdsmj json does, since it uses one cell for a certain file being a src and the other being a dest, and other for the files switched but still the same Cochange. For example,

  "cells": [
    {
      "src": 0,
      "dest": 1,
      "values": {
        "Cochange": 1
      }
    },
....
    {
      "src": 1,
      "dest": 0,
      "values": {
        "Cochange": 1
      }
    },

Also the DV8 hdsm.json Cochange values match up more with what I thought I would get from my hdsm.json. So for example my file 2.txt and 5.txt have a Cochange of 3, which I thought I would get from the gitlog_to_hdsmj json (but instead I got 6).

carlosparadis commented 1 year ago

Reply to: https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1518970305

Also, I know you said we shouldn't compare to dv8, at least for filtering, but just to be sure, I wanted to note that the dv8 hdsm.json makes double the cells as the gitlog_to_hdsmj json does, since it uses one cell for a certain file being a src and the other being a dest, and other for the files switched but still the same Cochange. For example,

This is important, so thank you for bringing it up. In graph theory, we have what is called directed and undirected graphs. As the name suggests, directed means the edges have directions, or are visually arrows. The undirected do not. And this affects how a clustering algorithm process grouping. Can you check the DV8 SDSM file to see if it duplicates? I would assume not.

Conceptually, when graphs are derived from files they should be a directed graph. If file A calls a function from file B, but file B never calls a function from file A, then you have direction in your edge between the two files. A co-change graph you don't. You just know all 3 files are involved in the commit, but there is no special relation among the files. This is likely why in DV8 you see the file duplicating the src and dest in reverse. Since if direction flows both way, it is equivalent of an undirected graph. Does that make sense conceptually to you?

Could you send me a google drive link for this example project you did so I could try it on my end too?

I want to check something on the code. I will reply to this in a bit with further details.

leilani-reich commented 1 year ago

Do you mean the github repo? It is here https://github.com/leilani-reich/testing-cochange.

leilani-reich commented 1 year ago

Can you check the DV8 SDSM file to see if it duplicates? I would assume not.

Yeah, I'm not seeing any duplication in terms of having two cells for src and dest with the same two files switching src/dest position.

carlosparadis commented 1 year ago

A github repository instead of Google Drive? Even better :) Good call.

Also, I know you said we shouldn't compare to dv8, at least for filtering, but just to be sure, I wanted to note that the dv8 hdsm.json makes double the cells as the gitlog_to_hdsmj json does, since it uses one cell for a certain file being a src and the other being a dest, and other for the files switched but still the same Cochange. For example,

Yeah, I'm not seeing any duplication in terms of having two cells for src and dest with the same two files switching src/dest position.

1) So if this is the case, then it confirms my conceptual assumption explained above. What you need to do then is, add an additional parameter to your graph_to_dsmj function. Let's call it is_directed_graph, and it will accept, as the name suggests, either TRUE or FALSE.

This should be a required parameter, do not include defaults.

If is_directed_graph is TRUE, then the behavior of the function is as-is: Simply map the edgelist to the json format as you would do, and nodes to variables.

However, if is_directed_graph is FALSE, then you now have to do what DV8 is doing. Basically, for every row in the edgelist table, you will add an extra reverse arrow for it, in essence duplicating.

What your function is doing conceptually, again, is representing an undirected graph, as a bi-directional graph, since DV8 probably adopted in memory a directed graph model as it is more generic.

2) Now, your transform function has to know somehow whether a graph is directed or undirected, to decide what to pass to your graph_dsmj. That I need to fix the R/graph.r/model_directed_graph() function. Basically, the returned list will have not only a node and edgelist tables returned, but also a third boolean variable called is_directed. So that's where your transform functions will pull it from when passing to graph_to_dsmj. (And it seems I also need to add a node table optional parameter too, although this doesn't affect you right now as a gitlog table is basically a edgelist only table).

Does that make sense?

3.1) Also, could you please check our understanding here with Yi again by sending him an e-mail and Cc'in Rick, me and Yuanfang? I.e. more specifically that dv8_gitnumstat_to_hdsmb() scm:history:gittxt:convert-matrix duplicate edges behind the scenes for being a undirected graph represented as a directed graph, and that you observed that after using the command to turn it into a json (I can't remember the command name right now, but please mention it explicitly to him just in case) ; but the dv8_depends_to_sdsmj core:convert_matrix however does not create duplicates since it is a directed graph.

3.2) Lastly, also ask if the clustering algorithm cares about this, i.e. if the clustering algorithm cares about the graph being directed or not. This will be important to consider for other graphs we will provide as input. Might as well check now.

This is why I was hopeful for a specification of the conversions. Had you not identified the duplicates, this would probably be an impossible-to-identify problem without a lot of scrutiny. I appreciate your attention to detail! 💯

carlosparadis commented 1 year ago

p.s.: Please make sure to add some of these findings to the function documentation too. I'd add to both the transform functions, but especially to the graph function that DV8 represents undirected graphs as directed graphs by duplicating the edges in reverse direction.

In general, assume any information discussed in this sea of discussion will be lost to time. Anything essential has to live through the function docs and/or notebooks.

leilani-reich commented 1 year ago

I'd add to both the transform functions, but especially to the graph function that DV8 represents undirected graphs as directed graphs by duplicating the edges in reverse direction.

What do you mean by DV8 representing undirected graphs as directed graphs? Isn't it just undirected since the edges are duplicated in the reverse direction?

Also, I think you missed my comment here. I will repeat again here:

"I am noticing that the Cochange values I get in my hdsm.json from my gitlog_to_hdsmj() are always twice the value of what I think they should be. For instance, files 2.txt and 5.txt were co-committed 3 times looking a the git log, but the Cochange value in the hdsm.json is 6. Is that supposed to be the case for Kaiaulu?"

So is this because the current hdsm from my gitlog_to_hdsmj is directed and the Cochange values are getting added together as well so the Cochange is twice the value it should be?

carlosparadis commented 1 year ago

What do you mean by DV8 representing undirected graphs as directed graphs? Isn't it just undirected since the edges are duplicated in the reverse direction?

In concept yes, in the data structure no. I am talking about the actual structure in code/memory, not in concept. Generally in graph libraries, for example igraph, you are able to run functions against a graph representation to know if it is directed or not. This can be accomplished, for example, with the change I suggested in Kaiaulu by simply adding a third returned value is_directed. If you have that, you do not need to duplicate directed edges. Conversely, when parsing the dsmj, there is no easy way for us to tell if a graph is directed or not from the data.

We would need to parse the entire graph, then subsequently check if every edge has a pair, had we not know DV8 does that for hdsmj, and it does not for sdsmj. Long story short, the dsmj representation has limitations. It also has limitations, for example, in not being able to encode information about the nodes. We can annotate the edges with various weights in this format. But the nodes, expressed on variables, only allow for ids (note in Kaiaulu that is not the case. Nodes can be annotated with attributes to have information such as size and color). But there is nothing you can do about the format.

Hence why I am asking you to duplicate when exporting to conform to it. In Kaiaulu, however, if I were to parse from dsmj in the future, I would need to code something to cut the duplicates in half and turn the is_directed as false. I hope the distinction makes sense.

So that's what I mean by representing undirected graphs as directed graphs: In Kaiaulu, with the change, you can represent both directed and undirected graphs, by means of the is_directed variable. In the dsmj, you can't. You have to duplicate the edges so you can create the conceptual equivalent to it. Duplicating edges to do this is not the norm, as far as I remember. Graph tools like Gephi also have the distinction.

"I am noticing that the Cochange values I get in my hdsm.json from my gitlog_to_hdsmj() are always twice the value of what I think they should be. For instance, files 2.txt and 5.txt were co-committed 3 times looking a the git log, but the Cochange value in the hdsm.json is 6. Is that supposed to be the case for Kaiaulu?"

So is this because the current hdsm from my gitlog_to_hdsmj is directed and the Cochange values are getting added together as well so the Cochange is twice the value it should be?

If files 2 and 5 are co-comitted 3 times, then files 2,5 should have weight 3, not 6 in memory for Kaiaulu. I am confused on the hdsmj.json here. Is Kaiaulu saying co-change is 6? Or is dv8 saying the co-change is 6? This is also where the duplicates can be confusing. Does DV8 says weight 3 in one direction, and 3 on another?

leilani-reich commented 1 year ago

Okay, just to confirm, "representing undirected graphs as directed graphs", specifically you are representing them as bi-directional graphs.

If files 2 and 5 are co-comitted 3 times, then files 2,5 should have weight 3, not 6 in memory for Kaiaulu. I am confused on the hdsmj.json here. Is Kaiaulu saying co-change is 6? Or is dv8 saying the co-change is 6? This is also where the duplicates can be confusing. Does DV8 says weight 3 in one direction, and 3 on another?

Yes, for files 2 & 5, Kaiaulu is saying co-change is 6 whereas DV8 has two cells for the same two files (src & dest switch), each cell has cochange 3.

gitlog_to_hdsmj() json:

{
      "src": 2,
      "dest": 5,
      "values": {
        "Cochange": 6
      }

versus DV8 hdsm.json:

{
    "src" : 2,
    "dest" : 5,
    "values" : {
      "Cochange" : 3.0
    }
  }
{
    "src" : 5,
    "dest" : 2,
    "values" : {
      "Cochange" : 3.0
    }
  }
carlosparadis commented 1 year ago

Ok, thanks. This is a bug in Kaiaulu then sigh. I will take a look at it. This function was never used before in any prior study (social smells do not calculate metrics over weighted graphs yet), so I am glad this got caught now.

Here's what I suggest you do in the meantime on this task:

Take note of what their co-changes should be. And then we can test both DV8 and Kaiaulu against it.

Better yet, if at all possible, do so by creating a replica of this function and throwing on R/git.R (Don't worry about redundancy, I can generalize this in the future):

https://github.com/sailuh/kaiaulu/blob/9db3295f65748819c438becbdd11e4f4ebed8c8e/R/git.R#L125

This would allow us to write a co-change unit test straight into Kaiaulu so it never misses the mark again (the check of 3 and 6 etc would all become assertions). But if it is too much overhead, just creating the log on the repo is fine too.


Lastly, to test you truly generalized graph_sdsmj(), try replacing this transform_gitlog_to_bipartite_network:

https://github.com/sailuh/kaiaulu/blob/9db3295f65748819c438becbdd11e4f4ebed8c8e/R/network.R#L41

with:

https://github.com/sailuh/kaiaulu/blob/9db3295f65748819c438becbdd11e4f4ebed8c8e/R/network.R#L108

And choose mode author. This should still give you a node and edgelist table. The name of the edge doesn't matter nor will make sense. We just want to check the function has been properly generalized.

This would cover checking generalization for graph.R.


One of the other issues with this function was performance. But we can get to it later if there is any time left with this co-change issue...

Thanks!

carlosparadis commented 1 year ago

Ok, I identified the issue with the weights. It lies in the projection. But it is not...a bug I guess..? At least the solution is clear so you can proceed.

In this part of your code:

https://github.com/sailuh/kaiaulu/blob/0e4b3b9e0a2b42c6b93939a43672050948d72604/R/network.R#L42

You want to call bipartite_graph_projection(..., is_intermediate_projection = TRUE) with the _is_intermediate_projection as TRUE on your transform function. You will get this table:

> graph[["edgelist"]]
                                        from   to_projection from_projection from_weight to_weight weight
 1: 68f19fa1115e22a9543d84590e9bd51a0baecf63 prototype/2.txt prototype/1.txt           1         1      2
 2: 56e88ea920fd09d397728d26f8442512c2aaf84e prototype/4.txt prototype/3.txt           1         1      2
 3: 56e88ea920fd09d397728d26f8442512c2aaf84e prototype/5.txt prototype/3.txt           1         1      2
 4: 56e88ea920fd09d397728d26f8442512c2aaf84e prototype/5.txt prototype/4.txt           1         1      2
 5: 1c83105223f3e73dc919fcc83b3aeb3a2539f1ba prototype/5.txt prototype/2.txt           1         1      2
 6: 3e05c672d261c87f67607ebee00aa6fcaeacb2d8 prototype/3.txt prototype/2.txt           1         1      2
 7: 3e05c672d261c87f67607ebee00aa6fcaeacb2d8 prototype/5.txt prototype/2.txt           1         1      2
 8: 3e05c672d261c87f67607ebee00aa6fcaeacb2d8 prototype/5.txt prototype/3.txt           1         1      2
 9: a466275ac8656a0a8371962c7fcf3b69f35bcae0 prototype/4.txt prototype/2.txt           1         1      2
10: a466275ac8656a0a8371962c7fcf3b69f35bcae0 prototype/5.txt prototype/2.txt           1         1      2
11: a466275ac8656a0a8371962c7fcf3b69f35bcae0 prototype/5.txt prototype/4.txt           1         1      2

You will want to use either the from_weight or the to_weight as your co-change value. That's it.

All making is_intermediate_projection = FALSE does is:

https://github.com/sailuh/kaiaulu/blob/0e4b3b9e0a2b42c6b93939a43672050948d72604/R/graph.R#L193-L197

When this table is returned, you will instead do on your transform function the following:

https://github.com/sailuh/kaiaulu/blob/0e4b3b9e0a2b42c6b93939a43672050948d72604/R/graph.R#L196 but instead of weight=sum(weight) you will do weight=sum(from_weight)

Send a commit with this change to your PR, then try comparing to DV8 to see if it matches.

The rationale

So is this because the current hdsm from my gitlog_to_hdsmj is directed and the Cochange values are getting added together as well so the Cochange is twice the value it should be?

Basically this.

Screen Shot 2023-04-23 at 4 14 25 PM

Here's a doodle of the files 2 and 5 situation. The 2 black circles are the files 2 and 5. The blue circles are the commit hashes. During a projection, as you may recall from your Social Smells Cheat Sheet, one of the two nodes get deleted. When this happens, the files tied to it have to be inter-connected. Each weight contributes 1, so you end up with 2 on the pair of files.

However, this is the correct behavior of a weighted sum projection, so it is not a bug. If we had an entirely different graph here for the sake of example, where the black circles are 2 people, and the blue circles are 3 different e-mail threads, instead, and one person sent to e-mail replies and the other 3, it would not make sense to choose either 2 or 3 as the weight connecting them. Rather, 2+3 = 5 would be the more appropriate weight.

So... until I can wrap my head around this, or if anything comes to mind for you, I suggest we proceed with this implementation suggested above. Please do check all co-change values match to DV8 on this small example, then try on a larger example. The unit test proposed would still be helpful too just in case.

carlosparadis commented 1 year ago

Ok! I have a better conceptual view of this now.

Quantitatively, what co-change really counts during a projection transformation (i.e. the eliminator of one of two nodes in a bipartite graph and the re-wiring of adjacent nodes), is the number of the deleted nodes (blue circles with crosses in the image above) between a given pair of files, and assign to the weight of said pair of files.

The usual criteria of weight assignment to the edge is the sum of weights of re-wired edges. But you could also choose any other scheme of wether to assign the weight to the re-wired edges.

The mistake was on my part, when I told you to use the bipartite projection as is: On my first guess, I assumed co-change to be the sum of the weights, instead of the count of deleted nodes.

So the correct implementation is not the above, that's a workaround. To count the number of blue circles excluded from this table:

> graph[["edgelist"]]
                                        from   to_projection from_projection from_weight to_weight weight
 1: 68f19fa1115e22a9543d84590e9bd51a0baecf63 prototype/2.txt prototype/1.txt           1         1      2
 2: 56e88ea920fd09d397728d26f8442512c2aaf84e prototype/4.txt prototype/3.txt           1         1      2
 3: 56e88ea920fd09d397728d26f8442512c2aaf84e prototype/5.txt prototype/3.txt           1         1      2
 4: 56e88ea920fd09d397728d26f8442512c2aaf84e prototype/5.txt prototype/4.txt           1         1      2
 5: 1c83105223f3e73dc919fcc83b3aeb3a2539f1ba prototype/5.txt prototype/2.txt           1         1      2
 6: 3e05c672d261c87f67607ebee00aa6fcaeacb2d8 prototype/3.txt prototype/2.txt           1         1      2
 7: 3e05c672d261c87f67607ebee00aa6fcaeacb2d8 prototype/5.txt prototype/2.txt           1         1      2
 8: 3e05c672d261c87f67607ebee00aa6fcaeacb2d8 prototype/5.txt prototype/3.txt           1         1      2
 9: a466275ac8656a0a8371962c7fcf3b69f35bcae0 prototype/4.txt prototype/2.txt           1         1      2
10: a466275ac8656a0a8371962c7fcf3b69f35bcae0 prototype/5.txt prototype/2.txt           1         1      2
11: a466275ac8656a0a8371962c7fcf3b69f35bcae0 prototype/5.txt prototype/4.txt           1         1      2

We should instead use:

graph[["edgelist"]] <- graph[["edgelist"]][,.(weight=length(from)),by=c("from_projection","to_projection")]

graph[["edgelist"]][,.(weight=length(from)),by=c("from_projection","to_projection")]
   from_projection   to_projection weight
1: prototype/1.txt prototype/2.txt      1
2: prototype/3.txt prototype/4.txt      1
3: prototype/3.txt prototype/5.txt      2
4: prototype/4.txt prototype/5.txt      2
5: prototype/2.txt prototype/5.txt      3
6: prototype/2.txt prototype/3.txt      1
7: prototype/2.txt prototype/4.txt      1

More generally, the node weights could also be used instead to assign the weight to the pair of nodes after projection. But I will defer on that part.

Let me make a commit to this. I will let you know again when it is ready.

carlosparadis commented 1 year ago

@leilani-reich This commit should hopefully fix your issue. I have also modified your equivalent function (transform_gitlog_to_hdsmj) to use the appropriate weight scheme, which was the issue described above (it was using weight sum).

Can you try running your function to see if co-change now matches?

I still need to do #193, but you can work on the meantime on adding additional code to the graph_ function logic so that it duplicate the edges in reverse order. Once that logic is done, and assuming this did address the co-change problem, Milestone 3.4 ends.

leilani-reich commented 1 year ago

Hi Carlos, I ran my function with the new changes from your commit and the cochange is now matching with DV8. So for instance I see that my files 2 & 5 have cochange 3 instead of 6 (which I mentioned prior.

Here's the updated hdsm.json using your new commit fixes: april23-gitlog-cochange.json.zip

I still need to do https://github.com/sailuh/kaiaulu/issues/193, but you can work on the meantime on adding additional code to the graph_ function logic so that it duplicate the edges in reverse order.

I will work on this.

Also, do you want me to replicate the order of the filenames that DV8 uses in its hdsm.json output? DV8 appears to be ordering the filenames from newest (most recent committed/changed according to gitlog) to oldest. If so, I believe I could take the commiter_datetimetz column in the gitlog table to get the order of the filenames in my variables field.

I have provided the QTNotepad hdsm.json from DV8 & the QTNotepad gitlog, which shows that the order of the variables matches the most recently committed files in the gitlog. QT-gitlog-check-april23.zip

carlosparadis commented 1 year ago

@leilani-reich

I made a few small changes to your gitlog_to_hdsmj and dependenies_to_sdsmj just so I could use them on the DV8 Showcase. I have also modified the DV8 Notebook so I could test them end-to-end. A config file for the calculator project was also added and used for that.

I could run most functions, but the Excel export no longer works. Could you try re-running the Notebook on your end as is using the calculator project, and then try to remove the eval = FALSE from the excel code block and see if it works for you? This used to work fine for me on the legacy functions DV8 has, so I am a bit worried something is missing on the dsmj functions.

A few other minor things: Do not use data.frame. It is substantially slower than data.table, since the former is implemented in R, the latter in C behind the scenes. I saw graph_dsmj has a call to data.frame.

Avoid referring to columns by their indices. This is easy to break and considered a risky practice. Use the column names instead. An example of this was in these sdsmj, hdsm, and your graph_to_dsmj. Access the node and edgelist table referring to their name, not [1] and [2].


The second thing is, #193. I won't have much time this week to work on it. So I will defer. For you to work around it, on your transformer_gitlog_to_hdsmj and transformer_dependencies_to_sdsmj, simply hardcode on their call to graph_to_dsmj the passing of a third parameter, is_directed, for now.

graph_to_dsmj can then use the third parameter as a TRUE or FALSE variable on an if, to decide if should duplicate the edges reversely or not. If is_directed = TRUE, then you skip. If is_directed = FALSE, then you perform it.

With that, pending what is noted above, you should be able to finish your task.


I am not clear from your reply if, for the example co-change project you made, you believe the co-change values are now all correct. Let me know.

Thanks!

leilani-reich commented 1 year ago

reply to: https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1519888957

Hi Carlos,

Cochange concern

for the co-change example project I made, the co-change values all match dv8 and look correct to me.

Also, I thought it worked on QTNotepad too, but I noticed one thing different. In the DV8 hdsm.json, it shows cochange between LICENSE and LICENSE.md (LICENSE.md was the "newfile" that we had to account for before that was causing issues since LICENSE was renamed to LICENSE.md). However, in my gitlog_to_hdsmj() output, there is no cochange between LICENSE and LICENSE.md, since in parse_gitlog() output they are seen to be belonging to different commit_hashes.

Here's where the file gets renamed from LICENSE -> LICENSE.md in the QTNotepad gitlog.

commit fce0c10dc9dcaca394509065cb18cb93b949268d
Author: rattle99 <24495512+rattle99@users.noreply.github.com>
Date:   2019-02-09 23:35:28 +0530

    Rename LICENSE to LICENSE.md

0   0   LICENSE => LICENSE.md

Order of filenames in variables field in hdsmj

Also, I think you missed my question here.

Also, do you want me to replicate the order of the filenames that DV8 uses in its hdsm.json output? DV8 appears to be ordering the filenames from newest (most recent committed/changed according to gitlog) to oldest. If so, I believe I could take the commiter_datetimetz column in the gitlog table to get the order of the filenames in my variables field.

leilani-reich commented 1 year ago

reply to: https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1519888957

Update on excel notebook

Could you try re-running the Notebook on your end as is using the calculator project, and then try to remove the eval = FALSE from the excel code block and see if it works for you? This works for me. I didn't get an error running the function.

Screenshot 2023-04-26 at 11 38 20 PM

I will work on the other asks (issue #193, doubling cells but switching src&dest (to match dv8), fixing code)).

carlosparadis commented 1 year ago

Reply to https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1525282728

Thank you for detecting this difference. I think it is ok for us not to consider co-change between the file and its renamed variant. Just include a note on the function documentation it treats it like so.

I believe this distinction is more fundamental than the code itself, and the flags passed to Perceval on the generation of the git log, which in turn are passed to git. I am adding this here just for completeness' sake.

https://github.com/sailuh/kaiaulu/blob/2b10c5b399ae7c8b4b9e1058cf1a55c9c905211f/R/parser.R#L29-L40

I am also due one day to move these to the project configuration file. These -C, -M flags affect the behavior of the git log. In retrospect, years ago I looked a bit into this for the git blame (enables us to do git log at function level): https://github.com/sailuh/kaiaulu/issues/68#issuecomment-662234718

Git Log has the same flags: https://www.git-scm.com/docs/git-log, although much more buried.

Screen Shot 2023-04-27 at 3 57 25 AM

So again, don't worry about it. Just add a note on the function. I am going to take the chance to create another issue for myself with these links above and referring this comment so in the future I pass on these parameters to the project configuration file as well.