sailuh / kaiaulu

An R package for mining software repositories

http://itm0.shidler.hawaii.edu/kaiaulu

Mozilla Public License 2.0

18 stars 12 forks source link

Create a git blame parser #68

Closed carlosparadis closed 4 years ago

carlosparadis commented 4 years ago

Part of the effort towards #2

The idea is that we have a git_blame() parser function added to R/git.R

massihonda commented 4 years ago

Hi Carlos,

I created a function that stores the blame commit of each line of all files of a checkout. I used the conf file that I am attaching, where I had to insert the absolute path to the git directory in order to run. I was not able to do it with path.expand unfortunately. I added the functions in the git.R file and created a vignette show_case to run the blame function. The show_case uses the drill repository as an example. These files are attached too, but you would have to clone the drill repository into the git_repo folder, since it was too large to upload. If you think there are some things to change I can do it.

The function has two optional parameters: one is all, the other is since and until (to use together). If all is set to TRUE, instead of retrieving only the last commit that touched the lines, it retrieves all the commits the touched the lines in all the history. If a date is specified in since and until, it retrieves all the commits that changed the lines within these two dates. When all is set to TRUE the time to compute the data frame increases, reasonably.

Kind regards,

Massimo

carlosparadis commented 4 years ago

Hi Massimo,

I added some overall feedback concerning the code on #71 , however, I noticed one assumption you have on your code that may be going in a different direction than what you wanted to do for your thesis, which to my understanding is a collaboration network at function level, so I am double checking again.

In your git_blame() function, your parameters are:

parse_git_blame <- function(git_repo_path, all=FALSE, since, until){...}

However, the call to git blame via terminal is, for example on Kaiaulu:

git blame -p -C -w -M 9b2137579609144e98ca27e858edb76e05655eb2 git.R

A portion of the output, in turn, looks like this:

9da2897fe4dc320072fb8fd98b13c0c24430b3e6 1 1 4
author Carlos Paradis
author-mail <carlosviansi@gmail.com>
author-time 1590633783
author-tz -1000
committer Carlos Paradis
committer-mail <carlosviansi@gmail.com>
committer-time 1590633783
committer-tz -1000
summary i #12 Add MPL 2.0 License
previous 05dc1768f3e7d4c53881448222cc8cc06d2a52da R/parsers.R
filename R/parsers.R
        # This Source Code Form is subject to the terms of the Mozilla Public
9da2897fe4dc320072fb8fd98b13c0c24430b3e6 2 2
        # License, v. 2.0. If a copy of the MPL was not distributed with this
9da2897fe4dc320072fb8fd98b13c0c24430b3e6 3 3
        # file, You can obtain one at https://mozilla.org/MPL/2.0/.
9da2897fe4dc320072fb8fd98b13c0c24430b3e6 4 4

9b2137579609144e98ca27e858edb76e05655eb2 5 5 39
author Carlos Paradis
author-mail <carlosviansi@gmail.com>
author-time 1593782688
author-tz -1000
committer Carlos Paradis
committer-mail <carlosviansi@gmail.com>
committer-time 1593782688
committer-tz -1000
summary i #62 add git interface functions
filename R/git.R
        #' Performs a git checkout on specified repo
9b2137579609144e98ca27e858edb76e05655eb2 6 6
        #'

The above is the top portion of the blame shown here. In order for me to finish the git log at function level I understood you wanted, you need to create a function that takes the following parameters:

git_blame(git_repo_path,commit_hash,file_name)

And returns that to memory. In the case of git_blame, you should not need to save to /tmp since it is a small output (a single file). git_log saves to /tmp/ because it can range from 400MB to GBs. Either way, your main challenge is on parse_git_blame(git_repo_path,commit_hash,file_name). This function will be in charge of actually parsing the output above into a tabular format.

If you can parse it with those 2 functions, I can finish integrating to the rest of code already in Kaiaulu to get your git log at function level.

Also, you can still reuse the existing branch and PR. You would just need to make another commit editing so git.R contains the original code plus this function, and R/parsers.R contains the parser function. We can squash all commits later to clean up.

Let me know if you have questions, or if I misunderstood your interest was not on the git log functions part.

massihonda commented 4 years ago

I see, my purpose is indeed also to arrive to create a network at function level. I am going to adjust it as you wish, I would just like to explain to you the reason for this implementation to better understand how to evolve.

At the beginning I also put the commit in the parameter, then I took it out, because I thought I could use the git_checkout function before to go to a specific commit. I stored the output into /tmp because when I called the terminal command it did not return anything I could store in a normal variable, so I used it as a workaround (I understand it is not the proper way though).

Instead of inserting the filename as a parameter, I created a function list_files_in_repo, that lists all the files in a repository at a specific commit. I called this function just before the first while loop, so that it stores all the files into a variable. Then the while loop iteratively calls the git_blame function on each file with an extension in the form of a programming language (in order to avoid parsing an image for example). The function returns a dataframe with the information about each line of all files in a specific commit. I thought it would have been possible to filter the data frame when you have to create the network at function level, i.e. selecting only those rows belonging to one file each time to see who committed each line, and joining the data frame of parse_gitlog to see the author of the commit and the data frame from the function that parse ctags to see which function is defined in that line.

I understand you want to put this loop outside the function parse_git_blame, so that parse_git_blame would return only the blame of only one file at the time, is it correct? Is it fine that the function returns a dataframe with the format I used (file_name, path, line, commit) , but containing the information of only one file? What should I do with the function list_files_in_repo?

On Fri, Jul 17, 2020 at 5:49 PM Carlos Paradis <SRS0=Id6m=A4=github.com= notifications@mijnuvt.nl> wrote:

Hi Massimo,

I added some overall feedback concerning the code on #71 https://github.com/sailuh/kaiaulu/pull/71 , however, I noticed one assumption you have on your code that may be going in a different direction than what you wanted to do for your thesis, which to my understanding is a collaboration network at function level, so I am double checking again.

In your git_blame() function, your parameters are:

parse_git_blame <- function(git_repo_path, all=FALSE, since, until){...}

However, the call to git blame via terminal is, for example on Kaiaulu:

git blame -p -C -w -M 9b2137579609144e98ca27e858edb76e05655eb2 git.R

A portion of the output, in turn, looks like this:

9da2897fe4dc320072fb8fd98b13c0c24430b3e6 1 1 4

author Carlos Paradis

author-mail carlosviansi@gmail.com

author-time 1590633783

author-tz -1000

committer Carlos Paradis

committer-mail carlosviansi@gmail.com

committer-time 1590633783

committer-tz -1000

summary i #12 Add MPL 2.0 License

previous 05dc1768f3e7d4c53881448222cc8cc06d2a52da R/parsers.R

filename R/parsers.R
    # This Source Code Form is subject to the terms of the Mozilla Public
9da2897fe4dc320072fb8fd98b13c0c24430b3e6 2 2
    # License, v. 2.0. If a copy of the MPL was not distributed with this
9da2897fe4dc320072fb8fd98b13c0c24430b3e6 3 3
    # file, You can obtain one at https://mozilla.org/MPL/2.0/.
9da2897fe4dc320072fb8fd98b13c0c24430b3e6 4 4

9b2137579609144e98ca27e858edb76e05655eb2 5 5 39

author Carlos Paradis

author-mail carlosviansi@gmail.com

author-time 1593782688

author-tz -1000

committer Carlos Paradis

committer-mail carlosviansi@gmail.com

committer-time 1593782688

committer-tz -1000

summary i #62 add git interface functions

filename R/git.R
    #' Performs a git checkout on specified repo
9b2137579609144e98ca27e858edb76e05655eb2 6 6
    #'
The above is the top portion of the blame shown here https://github.com/sailuh/kaiaulu/blame/master/R/git.R. In order for me to finish the git log at function level I understood you wanted, you need to create a function that takes the following parameters:

git_blame(git_repo_path,commit_hash,file_name)

And returns that to memory. In the case of git_blame, you should not need to save to /tmp since it is a small output (a single file). git_log saves to /tmp/ because it can range from 400MB to GBs. Either way, your main challenge is on parse_git_blame(git_repo_path,commit_hash,file_name). This function will be in charge of actually parsing the output above into a tabular format.

If you can parse it with those 2 functions, I can finish integrating to the rest of code already in Kaiaulu to get your git log at function level.

Also, you can still reuse the existing branch and PR. You would just need to make another commit editing so git.R contains the original code plus this function, and R/parsers.R contains the parser function. We can squash all commits later to clean up.

Let me know if you have questions, or if I misunderstood your interest was not on the git log functions part.

and return a parsed table of the output above

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/sailuh/kaiaulu/issues/68#issuecomment-660182047, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL42SQE3IYD5Y6FQXWGXMHDR4BXHNANCNFSM4OXDOL6Q .

carlosparadis commented 4 years ago

At the beginning I also put the commit in the parameter, then I took it out, because I thought I could use the git_checkout function before to go to a specific commit. I stored the output into /tmp because when I called the terminal command it did not return anything I could store in a normal variable, so I used it as a workaround (I understand it is not the proper way though).

Your assumption of using git checkout is not incorrect, and it is indeed to the best of my knowledge the only way I have seen to date to compute static metrics in all implementations such as architectural flaws. However, it is my understanding also passing the commit hash will save you an immense amount of computational time, if not the only way viable to do this per commit. To be more specific:

The reason why when computing architectural flaws we have to use git checkout given a commit happens, is because even though we may know a given commit only changed, say, 4 files, we do not know by just looking these 4 files the new dependencies it created to the rest of the codebase. Specifically, without using an abstract syntax tree, which in itself adds also another overhead, we have no means to know all types of dependencies (eg. inheritances) generated to any of the other files in the codebase, even if only those 4 files changed. Hence, we must perform git_checkout, and then analyze all the files. Tools, like Understand and Depends, will require a folder path and will not leverage the git commit because of that. A side effect is that most publications, if not all, will compute these metrics per time window, or per release rather than per commit as the cost is prohibitive.

However, in your specific case, you do not need to do so: If a commit change 4 files, we can, with the implementation Kaiaulu already have of ctags and I also believe codeface does this way, only look at those 4 files instead of the entire folder tree. This means that if you are looking at a project like Chromium of thousands of files and over 900k commits, instead of ending up with 900k*5000 git blames, you may end up, on average, with something like 900k*5, as the average commit and human shouldn't be changing so many files at every commit, unless the code is really bad. Hence your execution time T should be something like O(n_commits)*O(5 ish), instead of O(n_commits)*O(n_files_in_folder) on the average case if my math serves me right.

You can also already do a git checkout iteration by using: https://github.com/sailuh/kaiaulu/blob/9b2137579609144e98ca27e858edb76e05655eb2/R/git.R#L5-L26

In the following manner on a vignette (and it is indeed what I did to compute "Data for Bob" for architectural flaws:

project_gitlog <- parse_gitlog(perceval_path,git_repo_path)
project_gitlog <- project_gitlog[order(data.AuthorDate)]
metrics <- list()
for(commit in 1:nrow(project_gitlog)){
  commit_hash <- project_gitlog$data.commit
  git_checkout(commit_hash,git_repo_path)
  metrics[[commit_hash]] <- parse_line_metrics(scc_path,git_repo_path)
}
# Reset the detached HEAD of the repo to the most current commit as it is by default
git_checkout("master",git_repo_path)

But back to answer your question:

I understand you want to put this loop outside the function parse_git_blame, so that parse_git_blame would return only the blame of only one file at the time, is it correct? Is it fine that the function returns a dataframe with the format I used (file_name, path, line, commit) , but containing the information of only one file? What should I do with the function list_files_in_repo?

You should not need to define a function that scans all file names in a folder, as we will already know what their filepath is and commit hash while looping through the gitlog. Your git_blame() function should only do one task generate the log and pass directly on memory to avoid I/O, and parse_git_blame(), which uses git_blame() should also only have one responsibility, which is to encode the logic to turn that output I've shown into a well-formatted data.table.

I will then put these together in a vignette with the other necessary functions, and finish your gitlog network at function level, and when I think it is good enough, turn into another function (which at that point I need to think which module it would go, if parse, or something else, but certainly not git as git.R is solely to define functions to interact with the git interface).

As for the file filter or time windows, don't worry about that for now. In the vignette, this information is obtained from the config file. I can just file filter in a vignette the rows of project_git and then pass to your function:

This is done in gitlog_showcase.Rmd:

project_git <- project_git  %>%
  filter_by_file_extension(file_extensions,"file")  %>% 
  filter_by_filepath_substring(substring_filepath,"file"

Also:

project_git_slice <- project_git[data.AuthorDate >= as.POSIXct("2019-01-01", format = "%Y-%m-%d",tz = "UTC") & data.AuthorDate < as.POSIXct("2019-12-31", format = "%Y-%m-%d",tz = "UTC")]

Hence in a similar manner when we call:

parse_git_blame(git_repo_path,commit_hash,file_name)

One column will contain the file_path, as shown on the output before there is one line that says filename R/parsers.R. We could use the same method above to filter, in the vignette, with the parameters passed on the config file, as we need. In the future, I may change the git() API to leverage more of its various parameters, but it's best to keep it simple for now and go on as needed basis.

On a side note, I noticed my output above, despite specifying a single file, contains information of more than one. My understanding of git blame output may be flawed. Maybe we should start by making sure when we pass a commit and a file name, it does indeed report just the lines and commit of that file, and not of all files the commit touches.

Let me know if the above makes sense or if you have questions.

carlosparadis commented 4 years ago

At the beginning I also put the commit in the parameter, then I took it out, because I thought I could use the git_checkout function before to go to a specific commit. I stored the output into /tmp because when I called the terminal command it did not return anything I could store in a normal variable, so I used it as a workaround (I understand it is not the proper way though).

Take a look at this older version of parse_gitlog to see how you can do it without saving to /tmp/:

https://github.com/sailuh/kaiaulu/blob/6c62c97eeb5668515f14495019117662184ebd15/R/parsers.R#L18-L24

The reason you can't get the output is because of the > in the list of parameters you pass:

https://github.com/sailuh/kaiaulu/blob/9b2137579609144e98ca27e858edb76e05655eb2/R/parsers.R#L66-L67

On a bash terminal, if you time a command like git log > somefile.txt what would be otherwise output to stdout is saved to somefile.txt instead. You can reason system2 as a function that copy and paste a command line to the terminal. With stdout = TRUE it will capture the output of the terminal. It may be counter intuitive to have stdout = TRUE when the output is saved to /tmp but that is required for a file to be saved as we use >. I didn't find a flag to explicitly have git to save the output. Most stackoverflow questions just suggest using >. Another example:

https://github.com/sailuh/kaiaulu/blob/9b2137579609144e98ca27e858edb76e05655eb2/R/parsers.R#L448-L467

You may need to fiddle with the exact function you read the input directly from the terminal. In the first I use jsonlite because Perceval gives a json output. In the second I use fread because the output to stdout can be a .csv. Gitlog may be neither, so you may need to use readlines instead, but it should be doable.

In the future, I will make it so all git_ functions can do both after I am certain there is no other way. For now, you can use the way the old parse_git used to obtain the output on memory.

massihonda commented 4 years ago

I understand, that's indeed a better way, I didn't think about it and I made it just intuitively. So I am going to refactor it in this way and let you know.

Regards,

Massimo

On Sat, Jul 18, 2020 at 5:29 AM Carlos Paradis <SRS0=ud+N=A5=github.com= notifications@mijnuvt.nl> wrote:

At the beginning I also put the commit in the parameter, then I took it out, because I thought I could use the git_checkout function before to go to a specific commit. I stored the output into /tmp because when I called the terminal command it did not return anything I could store in a normal variable, so I used it as a workaround (I understand it is not the proper way though).

Take a look at this older version of parse_gitlog to see how you can do it without saving to /tmp/:

https://github.com/sailuh/kaiaulu/blob/6c62c97eeb5668515f14495019117662184ebd15/R/parsers.R#L18-L24

The reason you can't get the output is because of the > in the list of parameters you pass:

https://github.com/sailuh/kaiaulu/blob/9b2137579609144e98ca27e858edb76e05655eb2/R/parsers.R#L66-L67

On a bash terminal, if you time a command like git log > somefile.txt what would be otherwise output to stdout is saved to somefile.txt instead. You can reason system2 as a function that copy and paste a command line to the terminal. With stdout = TRUE it will capture the output of the terminal. It may be counter intuitive to have stdout = TRUE when the output is saved to /tmp but that is required for a file to be saved as we use >. I didn't find a flag to explicitly have git to save the output. Most stackoverflow questions just suggest using >. Another example:

https://github.com/sailuh/kaiaulu/blob/9b2137579609144e98ca27e858edb76e05655eb2/R/parsers.R#L448-L467

You may need to fiddle with the exact function you read the input directly from the terminal. In the first I use jsonlite because Perceval gives a json output. In the second I use fread because the output to stdout can be a .csv. Gitlog may be neither, so you may need to use readlines instead, but it should be doable.

In the future, I will make it so all git_ functions can do both after I am certain there is no other way. For now, you can use the way the old parse_git used to obtain the output on memory.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/sailuh/kaiaulu/issues/68#issuecomment-660416746, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL42SQBD6SBIDSISGZDXIMTR4EJJ5ANCNFSM4OXDOL6Q .

carlosparadis commented 4 years ago

Thanks! Feel free to post questions here. I am still intrigued on why the git blame is giving information mentioning other file in Kaiaulu. I hope my understanding of it is not incorrect!

massihonda commented 4 years ago

HI Carlos,

I think that the blame message, when the parameter -p is specified, returns in the metadata the hash of the previous commit and the file it touched. However, it is only in this context and it should not harm the parsing I think. All the lines that contain an integer beside the hash commit belong to the blame of the file specified in the terminal command.

I would like to ask you a few questions. I could easily extract from the blame file you showed at the first mail the hash that modified the lines for each line in the file and put them in a data.table. For example:

line 1 '9da2897fe4dc320072fb8fd98b13c0c24430b3e6', line 2 '9da2897fe4dc320072fb8fd98b13c0c24430b3e6', ... line 41 '9da2897fe4dc320072fb8fd98b13c0c24430b3e6'

Do you think this information would be enough, considering that afterwards you could join this table with the table of git log on the commit and having more detailed info such as author, date, etc?

Secondly, should it return the hash for each line in the file, or only the lines that the commit specified in the parameter has touched?

Thanks,

Massimo

On Sat, Jul 18, 2020 at 11:05 AM Carlos Paradis <SRS0=ud+N=A5=github.com= notifications@mijnuvt.nl> wrote:

Thanks! Feel free to post questions here. I am still intrigued on why the git blame is giving information mentioning other file in Kaiaulu. I hope my understanding of it is not incorrect!

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/sailuh/kaiaulu/issues/68#issuecomment-660453416, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL42SQDOA264GUW4PC3DPWTR4FQW7ANCNFSM4OXDOL6Q .

carlosparadis commented 4 years ago

You may want to look this response on GitHub so the embedded links appear and is easier to read:

I think that the blame message, when the parameter -p is specified, returns in the metadata the hash of the previous commit and the file it touched.

If you mean by previous the ancestor commit to the one you pass as a parameter, then that should not be the case. For example,

Try: git blame -p -C -w -M 8459acac8d5d42ce2dc326269bd4ea65b1a5f0da R/filter.R

And the output should be a single file, filter.R, and the commit hash will be the same as the one provided as a parameter.

Take a look at the options man page and the formatting man page. -C and -M seem to be the reason we saw multiple files being shown despite providing only one. Because of that, I am afraid we can't just parse commit hashes but instead need to parse everything. But this shouldn't be too hard, considering the doc above explains the layout of the file.

Also, in regards to my comment 5 days ago and the associated command, try removing -p, so it is easier to read:

git blame -C -w -M 9b2137579609144e98ca27e858edb76e05655eb2 R/git.R

and you will get:

9da2897f R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000  1) # This Source Code Form is subject to the terms of the Mozilla Public
9da2897f R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000  2) # License, v. 2.0. If a copy of the MPL was not distributed with this
9da2897f R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000  3) # file, You can obtain one at https://mozilla.org/MPL/2.0/.
9da2897f R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000  4) 
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000  5) #' Performs a git checkout on specified repo
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000  6) #'
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000  7) #' @param commit_hash The commit hash the repo should be checkout
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000  8) #' @param git_repo_path The git repo path
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000  9) #' @return Any error message generated by git
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000 10) #' @export
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000 11) git_checkout <- function(commit_hash,git_repo_path){
9b213757 R/git.R     (Carlos Paradis 2020-07-03 03:24:48 -1000 12)   # Expand paths (e.g. "~/Desktop" => "/Users/someuser/Desktop")
...

As you can see, it is just the license header that adds another file instead of R/git.R. Why? If you see the commits:

You will notice the license header commit and the file were where I first added the license header, and the commit was when I first explicitly added the license to the R package. This is likely the behavior of -C and -M.

Try experimenting removing the flags -C and -M, and you will see when you remove -C, the file now shows the git blame as we originally expected:

git blame -w -M 9b2137579609144e98ca27e858edb76e05655eb2 R/git.R outputs:

9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  1) # This Source Code Form is subject to the terms of the Mozilla Public
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  2) # License, v. 2.0. If a copy of the MPL was not distributed with this
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  3) # file, You can obtain one at https://mozilla.org/MPL/2.0/.
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  4) 
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  5) #' Performs a git checkout on specified repo
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  6) #'
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  7) #' @param commit_hash The commit hash the repo should be checkout
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  8) #' @param git_repo_path The git repo path
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000  9) #' @return Any error message generated by git
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 10) #' @export
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 11) git_checkout <- function(commit_hash,git_repo_path){
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 12)   # Expand paths (e.g. "~/Desktop" => "/Users/someuser/Desktop")
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 13)   git_repo_path <- path.expand(git_repo_path)
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 14)   # Remove ".git"
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 15)   folder_path <- stri_replace_last(git_repo_path,replacement="",regex=".git")
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 16)   error <- system2('git',
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 17)                   args = c('--git-dir',
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 18)                            git_repo_path,
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 19)                            '--work-tree',
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 20)                            folder_path,
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 21)                            'checkout',
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 22)                            commit_hash),
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 23)                   stdout = TRUE,
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 24)                   stderr = FALSE)
9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 25)   return(error)
...

So the bottom line is, take a look at what is the specified format of -p and encode in your parse_git_blame() function, and reason -C and -M to pass the explanation onto the function @description. This is where, over time, I think, it pays off to build these functions to wrap around these interfaces: They never make immediate sense, and once we wrap and document them properly, we don't need to go through this again. Stackoverflow has several questions on what git blame does as well, so I am led to believe it is confusing for its name.

Let me know if you have any other questions.

massihonda commented 4 years ago

Hi Carlos,

Thanks for your reply. I made progress on the implementation, but I have a doubt on the git_blame function. You want to build the function with the parameter (git_repo_path,commit_hash,file_name), where the git_repo_path is the path of the .git folder.

I noticed that the command

git blame -p -C -w -M 9b2137579609144e98ca27e858edb76e05655eb2 git.R

works if you are in the directory that contains git.R, so using setwd() in R or using the parameter of git -C, which run the command as if git was started in instead of the current working directory.

git -C ../kaiaulu/R/ blame -p -C -w -M 9b2137579609144e98ca27e858edb76e05655eb2 git.R

Thus, the function that I wrote is:

git_blame<- function(git_repo_path,commit_hash,file_name){

blame_text <- system2("git", args = c('-C', git_repo_path, 'blame', '--line-porcelain', '-C', '-w', '-M', commit_hash, file_name), stdout = TRUE, stderr = FALSE) return(blame_text) }

Where git_repo_path is the path of the folder that contains the file.

I noticed in git_log the git_repo path is used in the parameter --git-dir, but I do not understand how I should use it in this case, since the scenario seems very different.

It seems to me that I need to specify the path of the folder containing the file in the parameter of the function, or it is not the case?

On Wed, Jul 22, 2020 at 6:31 AM Carlos Paradis wrote:

You may want to look this response on GitHub so the embedded links appear and is easier to read:

I think that the blame message, when the parameter -p is specified, returns in the metadata the hash of the previous commit and the file it touched.

If you mean by previous the ancestor commit to the one you pass as a parameter, then that should not be the case. For example,

Try: git blame -p -C -w -M 8459acac8d5d42ce2dc326269bd4ea65b1a5f0da R/filter.R

And the output should be a single file, filter.R, and the commit hash will be the same as the one provided as a parameter.

Take a look at the options man page https://www.git-scm.com/docs/git-blame#_options and the formatting man page https://www.git-scm.com/docs/git-blame#_the_porcelain_format. -C and -M seem to be the reason we saw multiple files being shown despite providing only one. Because of that, I am afraid we can't just parse commit hashes but instead parse everything. But this shouldn't be too hard, considering the doc above explains the layout of the file.

Also, in regards to my comment 5 days ago https://github.com/sailuh/kaiaulu/issues/68#issuecomment-660182047 and the associated command, try removing -p, so it is easier to read:

git blame -C -w -M 9b2137579609144e98ca27e858edb76e05655eb2 R/git.R

and you will get:

9da2897f R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000 1) # This Source Code Form is subject to the terms of the Mozilla Public 9da2897f R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000 2) # License, v. 2.0. If a copy of the MPL was not distributed with this 9da2897f R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000 3) # file, You can obtain one at https://mozilla.org/MPL/2.0/. 9da2897f https://mozilla.org/MPL/2.0/.9da2897f R/parsers.R (Carlos Paradis 2020-05-27 16:43:03 -1000 4) 9b213757 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 5) #' Performs a git checkout on specified repo 9b213757 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 6) #' 9b213757 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 7) #' @param commit_hash The commit hash the repo should be checkout 9b213757 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 8) #' @param git_repo_path The git repo path 9b213757 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 9) #' @return Any error message generated by git 9b213757 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 10) #' @export 9b213757 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 11) git_checkout <- function(commit_hash,git_repo_path){ 9b213757 R/git.R (Carlos Paradis 2020-07-03 03:24:48 -1000 12) # Expand paths (e.g. "~/Desktop" => "/Users/someuser/Desktop") ...

As you can see, it is just the license header that adds another file instead of R/git.R. Why? If you see the commits:

9da2897 https://github.com/sailuh/kaiaulu/commit/9da2897fe4dc320072fb8fd98b13c0c24430b3e6

9b21375 https://github.com/sailuh/kaiaulu/commit/9b2137579609144e98ca27e858edb76e05655eb2

You will notice the license header commit and the file was where I first added the license header, and the commit was when I first explicitly added the license to the R package. This is likely the behavior of -C and -M.

Try experimenting removing the flags -C and -M, and you will see when you remove -C, the file now shows the git blame as we originally expected:

git blame -w -M 9b2137579609144e98ca27e858edb76e05655eb2 R/git.R outputs:

9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 1) # This Source Code Form is subject to the terms of the Mozilla Public 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 2) # License, v. 2.0. If a copy of the MPL was not distributed with this 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 3) # file, You can obtain one at https://mozilla.org/MPL/2.0/. 9b213757 https://mozilla.org/MPL/2.0/.9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 4) 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 5) #' Performs a git checkout on specified repo 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 6) #' 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 7) #' @param commit_hash The commit hash the repo should be checkout 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 8) #' @param git_repo_path The git repo path 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 9) #' @return Any error message generated by git 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 10) #' @export 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 11) git_checkout <- function(commit_hash,git_repo_path){ 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 12) # Expand paths (e.g. "~/Desktop" => "/Users/someuser/Desktop") 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 13) git_repo_path <- path.expand(git_repo_path) 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 14) # Remove ".git" 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 15) folder_path <- stri_replace_last(git_repo_path,replacement="",regex=".git") 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 16) error <- system2('git', 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 17) args = c('--git-dir', 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 18) git_repo_path, 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 19) '--work-tree', 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 20) folder_path, 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 21) 'checkout', 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 22) commit_hash), 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 23) stdout = TRUE, 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 24) stderr = FALSE) 9b213757 (Carlos Paradis 2020-07-03 03:24:48 -1000 25) return(error) ...

So the bottom line is, take a look at what is the specified format of -p and encode in your parse_git_blame() function, and reason -C and -M to pass the explanation onto the function @description. This is where, over time, I think, it pays off to build these functions to wrap around these interfaces: They never make immediate sense, and once we wrap and document them properly, we don't need to go through this again. Stackoverflow has several questions on what git blame does as well, so I am led to believe it is confusing for its name.

Let me know if you have any other questions.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/sailuh/kaiaulu/issues/68#issuecomment-662234718, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL42SQDV2TTUO6VXBY6YBATR4ZTSLANCNFSM4OXDOL6Q .

carlosparadis commented 4 years ago

Hi Massimo,

I went ahead and pushed to the repo a git_blame function. That way you don't need to worry about the issues of setwd, getwd, and how it is established in the config file, and can fully focus on parse_git_blame() by calling it.

As you said, git_blame differs from the other git_functions in that it also requires a file and a commit_hash, and I can see how this can lead to confusion if it is not clear what exactly is providing the inputs to it so you can decide where your assumption of the current working directory should be. Add that to the fact git requires an extra flag to be executed from a different folder and this just becomes plain confusing.

You can use the git_blame() I just pushed as follows:

git_blame(git_repo_path = "rawdata/git_repo/OpenSSL/.git",
          flags = c('-p','-C','-w','-M'),
          commit_hash = '1940c092a52afd8bc919b8faa5f3d51004503f3a',
          file_path = 'apps/genpkey.c'
          )

This function will return a character vector, where each element of the character is a line of the commithash and the file you specified (plus what I explained earlier concerning -C and -M).

This was executed from the working directory "~/Desktop/kaiaulu". I know, the config files use ../rawdata/git_repo/OpenSSL/.git and that's also correct. This is because when you are in RStudio, if you are running the commands from an R Notebook after opening kaiaulu.Rproj, your working directory will be: "~/Desktop/kaiaulu". However, when you Knit the vignette into html, it will actually be "~/Desktop/kaiaulu/vignettes". Hence it is the later that the config it is accounting for, not the former when it states the data lies at "../rawdata/" instead of simply "rawdata". As it should be: When you knit a vignette, the start of each vignette will be used to load the config file variables and the associated paths, i.e. the later case. Whereas when you are running the code block by block on R console, you have full control of what goes in each function call, the former case, and my example can simply be pasted.

The truth is, there is no silver bullet way to code this so the config works from any possible path. This is also why the API makes no assumption whatsoever of the config files when you make function calls. It is the responsibility of the user to say where things are.

However, and not versioned yet, there is also a component of this repo that is batch mode (will eventually appear under a exec folder). That component will assume a config file as input since it is a CLI, so calling the CLI won't look something like this. But you don't need to worry about it on this PR.

Take note how I call git_blame(), because this is exactly how parse_git_log will be expected to be called. The only difference being that parse_git_log will be handling the parsing of the file provided by git blame. I should have provided you this from the beginning, but my understanding of git blame itself was crude. Hence, I think this will facilitate your life :)

I hope this helps!

carlosparadis commented 4 years ago

Just to clarify, the git_repo_path parameter will be still supplied by the user, which already is available in the config file. The others will not. That will be obtained via parse_gitlog. You can use my example call above as an example call for your parse_git_blame function to ignore these details too.

For now, don't worry about making additional files for .Rmd showcase, or additional functions. Just commit parse_git_blame(git_repo_path,flags,commit_hash,file_path) in the same branch you already have a Pull Request. Inside of the function, call git_blame with the parameters, obtain the blame data, and write your code for the parser. Make a commit with just this function.

Also, don't worry about how to update your branch with the commit I just did so git_blame appears on it (you can just copy and paste on the R Console the function definition to use it for now). We can go over that afterward to avoid mixing the code effort with the GitHub workflow nuances.

Let me know if you have questions!

massihonda commented 4 years ago

Okay. I would like to ask one thing, because in the function I wrote so far of parse_git_blame I used the flag --line-porcelain instead of -p, which shows the porcelain format, but output commit information for each line, not just the first time a commit is referenced. Since I am storing all the information for each line in a tabular format such as author name,committer name, email, etc, this made the parsing easier, otherwise I could work to re-do with -p, but it would become more complex I think. Is it right, or there are reasons to use -p instead of --line-porcelain? Is the information other than the commit hash and line relevant or it should be more minimalistic?

Thanks,

Massimo

On Thu, Jul 23, 2020 at 2:48 AM Carlos Paradis wrote:

Just to clarify, the git_repo_path parameter will be still supplied by the user, which already is available in the config file. The others will not. That will be obtained via parse_gitlog. You can use my example call above as an example call for your parse_git_blame function to ignore these details too.

For now, don't worry about making additional files for .Rmd showcase, or additional functions. Just commit parse_git_blame(git_repo_path,flags,commit_hash,file_path) in the same branch you already have a Pull Request. Inside of the function, call git_blame with the parameters, obtain the blame data, and write your code for the parser. Make a commit with just this function.

Also, don't worry about how to update your branch with the commit I just did so git_blame appears on it (you can just copy and paste on the R Console the function definition to use it for now). We can go over that afterward to avoid mixing the code effort with the GitHub workflow nuances.

Let me know if you have questions!

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/sailuh/kaiaulu/issues/68#issuecomment-662767903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL42SQF7GY56J4LKFEQEZT3R46CEBANCNFSM4OXDOL6Q .

carlosparadis commented 4 years ago

I don't know. Codeface chose -p over --line-porcelain and git blame documentation says -p is the one to be used for machine consumption, not --line-porcelain. I also do not know the output, can you paste it here as I did in my responses?

The main problem, I believe, in putting everything in 1 line is that you run the risk the actual content of the code, which should be parsed, ends up being affected by whatever you use to separate the various columns. Meanwhile -p already does the work for you, so there is no risk of the content of the code being affected.

You should parse all data made available by git blame. There is no reason not to since they can all be potentially relevant to other work.

massihonda commented 4 years ago

the beginning of the output is :

9da2897fe4dc320072fb8fd98b13c0c24430b3e6 1 1 4
author Carlos Paradis
author-mail <carlosviansi@gmail.com>
author-time 1590633783
author-tz -1000
committer Carlos Paradis
committer-mail <carlosviansi@gmail.com>
committer-time 1590633783
committer-tz -1000
summary i #12 Add MPL 2.0 License
previous 05dc1768f3e7d4c53881448222cc8cc06d2a52da R/parsers.R
filename R/parsers.R
        # This Source Code Form is subject to the terms of the Mozilla
Public
9da2897fe4dc320072fb8fd98b13c0c24430b3e6 2 2
author Carlos Paradis
author-mail <carlosviansi@gmail.com>
author-time 1590633783
author-tz -1000
committer Carlos Paradis
committer-mail <carlosviansi@gmail.com>
committer-time 1590633783
committer-tz -1000
summary i #12 Add MPL 2.0 License

It is the same as with -p, but with -p, the information of a commit such as author name,mail etc are elencated only the first time the commit appear, then, when the same commit reappear in another line, it only specifies the hash commit, the lines, and the content of the line like this:

9da2897fe4dc320072fb8fd98b13c0c24430b3e6 2 2

    # License, v. 2.0. If a copy of the MPL was not distributed with this

On the contrary, with --line-porcelain it would be like this, because it repeats the info every time:

9da2897fe4dc320072fb8fd98b13c0c24430b3e6 2 2 author Carlos Paradis author-mail carlosviansi@gmail.com author-time 1590633783 author-tz -1000 committer Carlos Paradis committer-mail carlosviansi@gmail.com committer-time 1590633783 committer-tz -1000 summary i #12 Add MPL 2.0 License

I wonder if the info like author, author-email etc should be included in the final table the function parse_git_blame gives. If that's the case, --line-porcelain would make the work easier, whereas if they are not needed -p would be a better option.

On Thu, Jul 23, 2020 at 8:54 AM Carlos Paradis wrote:

I don't know. Codeface chose -p over --line-porcelain and git blame documentation says -p is the one to be used for machine consumption, not --line-porcelain. I also do not know the output, can you paste it here as I did in my responses?

The main problem, I believe, in putting everything in 1 line is that you run the risk the actual content of the code, which should be parsed, ends up being affected by whatever you use to separate the various columns. Meanwhile -p already does the work for you, so there is no risk of the content of the code being affected.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/sailuh/kaiaulu/issues/68#issuecomment-662847967, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL42SQBJVNGQEO6UUQRDCPDR47ND7ANCNFSM4OXDOL6Q .

carlosparadis commented 4 years ago

I see what you're saying. If that is the case, then --line-porcelain just considerably inflates the number of lines of a file repeating the same information. I don't know if this will be a good price to pay on a git log like Chromium which has 20GB. Try to go with -p and if you get stuck I can try to help.

Also, try to avoid for loops if you can and use vectorized operations instead. Loops in R are terrible, and parse_git_blame will be used in a very expensive computation. A for loop here may push the process to several hours.

Looking at the file example from:

git_blame_output <- git_blame(git_repo_path = "rawdata/git_repo/OpenSSL/.git",
          flags = c('-p','-C','-w','-M'),
          commit_hash = '1940c092a52afd8bc919b8faa5f3d51004503f3a',
          file_path = 'apps/genpkey.c'
          )

For example on the above if I use stri_match_all_regex(git_blame_output,"^([a-f0-9]{40}) (\\d+) (\\d+)"):

I get (a slice of the output):

.
.
.

[[768]]
     [,1] [,2] [,3] [,4]
[1,] NA   NA   NA   NA  

[[769]]
     [,1] [,2] [,3] [,4]
[1,] NA   NA   NA   NA  

[[770]]
     [,1] [,2] [,3] [,4]
[1,] NA   NA   NA   NA  

[[771]]
     [,1] [,2] [,3] [,4]
[1,] NA   NA   NA   NA  

[[772]]
     [,1] [,2] [,3] [,4]
[1,] NA   NA   NA   NA  

[[773]]
     [,1]                                               [,2]                                      
[1,] "b3c6a331853c125985cdf0d3ae60e1894974c3eb 383 272" "b3c6a331853c125985cdf0d3ae60e1894974c3eb"
     [,3]  [,4] 
[1,] "383" "272"

[[774]]
     [,1] [,2] [,3] [,4]
[1,] NA   NA   NA   NA  

[[775]]
     [,1]                                               [,2]                                      
[1,] "0f113f3ee4d629ef9a4a30911b22b224772085e5 349 273" "0f113f3ee4d629ef9a4a30911b22b224772085e5"
     [,3]  [,4] 
[1,] "349" "273"

[[776]]
     [,1] [,2] [,3] [,4]
[1,] NA   NA   NA   NA  

[[777]]
     [,1]                                               [,2]                                      
[1,] "f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab 327 274" "f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab"
     [,3]  [,4] 
[1,] "327" "274"

.
.
.

With this function, you should be able to parse out the content and line numbers of commits. See the holes of NAs from 768 to 772? Maybe you can capitalize on the fact -p doesn't keep repeating things to get these holes properly parsed. Otherwise, you just need to parse the index + 1 of the content "as is", even if the entire line is just blank, so the code can be recreated.

The other regex symbols you can find here: https://rdrr.io/cran/stringi/man/stringi-search-regex.html

massihonda commented 4 years ago

Hi Carlos,

I tried to follow your advice as much as I could and I created a function that converts the log message into a table. The function is:

parse_git_blame<-function(git_repo_path,flags,commit_hash,file_path){

blame_file <- git_blame(git_repo_path, flags, commit_hash, file_path )

cond<-'(?<=.)(?=[a-f0-9]{40} \d+)' pasted <- paste(blame_file,collapse = ' ') splitted <- stri_split_regex(pasted,cond)

commit_line <- stri_match_all_regex(unlist(splitted),"^([a-f0-9]{40}) (\d+) (\d+)")

commits <- unlist(lapply(commit_line, [[, 2)) line_number_of_the_line_in_the_original_file <- unlist(lapply(commit_line, [[, 3)) line_number_of_the_line_in_the_final_file <- unlist(lapply(commit_line, [[, 4)) authors <- unlist(stri_match_all_regex(unlist(splitted), "(?<=author).(?=author-mail)")) author_mails <- unlist(stri_match_all_regex(unlist(splitted), "(?<=author-mail).(?=author-time)")) author_time <- unlist(stri_match_all_regex(unlist(splitted), "(?<=author-time).*(?=author-tz)"))

author_tz <- stri_match_all_regex(unlist(splitted),

"(?<=author-tz).(?=committer)") committers <- unlist(stri_match_all_regex(unlist(splitted), "(?<=committer).(?=committer-mail)")) committer_mails <- unlist(stri_match_all_regex(unlist(splitted), "(?<=committer-mail).(?=committer-time)")) committer_time <- unlist(stri_match_all_regex(unlist(splitted), "(?<=committer-time).(?=committer-tz)")) committer_tz <- unlist(stri_match_all_regex(unlist(splitted), "(?<=committer-tz).(?=summary)")) summaries <- unlist(stri_match_all_regex(unlist(splitted), "(?<=summary).(?=filename)")) file_names <- unlist(stri_match_all_regex(unlist(splitted), "(?<=filename).(?=\t)")) lines <- unlist(stri_match_all_regex(unlist(splitted), "(?<=\t).$"))

df <- data.table(commits,line_number_of_the_line_in_the_original_file,line_number_of_the_line_in_the_final_file,

authors,author_mails,author_time,committers,committer_mails,committer_time,committer_tz,summaries, file_names,lines) return(df) }

In pasted <- paste(blame_file,collapse = ' ') it joins all the lines of the blame file generated with your function git_blame(), so that all the text is in one line. Then, in splitted <- stri_split_regex(pasted,cond) it splits this block on the hash commit, so that each string contains the metadata of one line of the file to parse, for example:

"f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab 375 324 author Dr. Stephen Henson author-mail steve@openssl.org author-time 1144762132 author-tz +0000 committer Dr. Stephen Henson committer-mail steve@openssl.org committer-time 1144762132 committer-tz +0000 summary Initial keygen support. filename apps/genpkey.c \t}"

After, the regex functions are applied for retrieving each one of this metadata, and then they are converted into a table. The output is like:

parse_git_blame(git_repo_path,flags,commit_hash,file_path) commits line_number_of_the_line_in_the_original_file 1: 0f113f3ee4d629ef9a4a30911b22b224772085e5 2 2: 6738bf1417289a14758590fca5a26b62c9b2c0be 2 3: f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab 56 4: 846e33c729311169d9c988ceba29484b3783f244 4 5: 846e33c729311169d9c988ceba29484b3783f244 5

320: 0f113f3ee4d629ef9a4a30911b22b224772085e5 398 321: f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab 369 322: f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab 370 323: f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab 374 324: f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab 375 line_number_of_the_line_in_the_final_file authors author_mails author_time 1: 1 Matt Caswell < matt@openssl.org> 1421898055 2: 2 Matt Caswell < matt@openssl.org> 1518526289 3: 3 Dr. Stephen Henson < steve@openssl.org> 1144762132 4: 4 Rich Salz < rsalz@openssl.org> 1463509110 5: 5 Rich Salz < rsalz@openssl.org> 1463509110

320: 320 Matt Caswell < matt@openssl.org> 1421898055 321: 321 Dr. Stephen Henson < steve@openssl.org> 1144762132 322: 322 Dr. Stephen Henson < steve@openssl.org> 1144762132 323: 323 Dr. Stephen Henson < steve@openssl.org> 1144762132 324: 324 Dr. Stephen Henson < steve@openssl.org> 1144762132 committers committer_mails committer_time committer_tz 1: Matt Caswell matt@openssl.org 1421918409 +0000 2: Matt Caswell matt@openssl.org 1518530365 +0000 3: Dr. Stephen Henson steve@openssl.org 1144762132 +0000 4: Rich Salz rsalz@openssl.org 1463509159 -0400 5: Rich Salz rsalz@openssl.org 1463509159 -0400

320: Matt Caswell matt@openssl.org 1421918409 +0000 321: Dr. Stephen Henson steve@openssl.org 1144762132 +0000 322: Dr. Stephen Henson steve@openssl.org 1144762132 +0000 323: Dr. Stephen Henson steve@openssl.org 1144762132 +0000 324: Dr. Stephen Henson steve@openssl.org 1144762132 +0000

                     summaries

1: Run util/openssl-format-source -v -c . previous 22b52164aaed31d6e93dbd2d397ace041360e6aa apps/genpkey.c 2: Update copyright year previous abfee348c613ea8037561ee850cac94218f73f64 apps/genpkey.c 3: Initial keygen support. 4: Copyright consolidation 01/10 previous be9c8deb7de92feb5e5300f2e46d3516bcc43c00 apps/genpkey.c 5: Copyright consolidation 01/10 previous be9c8deb7de92feb5e5300f2e46d3516bcc43c00 apps/genpkey.c

320: Run util/openssl-format-source -v -c . previous 22b52164aaed31d6e93dbd2d397ace041360e6aa apps/genpkey.c 321: Initial keygen support. 322: Initial keygen support. 323: Initial keygen support. 324: Initial keygen support. file_names lines 1: apps/genpkey.c / 2: apps/genpkey.c Copyright 2006-2018 The OpenSSL Project Authors. All Rights Reserved. 3: apps/genpkey.c 4: apps/genpkey.c Licensed under the OpenSSL license (the "License"). You may not use 5: apps/genpkey.c * this file except in compliance with the License. You can obtain a copy

320: apps/genpkey.c c = '\n'; 321: apps/genpkey.c BIO_write(b, &c, 1); 322: apps/genpkey.c (void)BIO_flush(b); 323: apps/genpkey.c return 1; 324: apps/genpkey.c }

The content of the original file is on the column named lines, that is the last one. The other columns, like author, without using --line-porcelain would be mostly NA, because they are defined only once, but the columns lines, commits, line_number_of_the_line_in_the_original_file,line_number_of_the_line_in_the_final_file would be full anyway.

Please let me know what you think about it.

kind regards,

Massimo

On Thu, Jul 23, 2020 at 10:31 AM Carlos Paradis wrote:

I see what you're saying. If that is the case, then --line-porcelain just considerably inflates the number of lines of a file repeating the same information. I don't know if this will be a good price to pay on a git log like Chromium which has 20GB. Try to go with -p and if you get stuck I can try to help.

Also, try to avoid for loops if you can and use vectorized operations instead. Loops in R are terrible, and parse_git_blame will be used in a very expensive computation. A for loop here may push the process to several hours.

Looking at the file example from:

git_blame_output <- git_blame(git_repo_path = "rawdata/git_repo/OpenSSL/.git", flags = c('-p','-C','-w','-M'), commit_hash = '1940c092a52afd8bc919b8faa5f3d51004503f3a', file_path = 'apps/genpkey.c' )

For example on the above if I use stri_match_all_regex(a,"^([a-f0-9]{40}) (\d+) (\d+)"):

I get (a slice of the output):

. .. .

[[768]] [,1] [,2] [,3] [,4] [1,] NA NA NA NA

[[769]] [,1] [,2] [,3] [,4] [1,] NA NA NA NA

[[770]] [,1] [,2] [,3] [,4] [1,] NA NA NA NA

[[771]] [,1] [,2] [,3] [,4] [1,] NA NA NA NA

[[772]] [,1] [,2] [,3] [,4] [1,] NA NA NA NA

[[773]] [,1] [,2] [1,] "b3c6a331853c125985cdf0d3ae60e1894974c3eb 383 272" "b3c6a331853c125985cdf0d3ae60e1894974c3eb" [,3] [,4] [1,] "383" "272"

[[774]] [,1] [,2] [,3] [,4] [1,] NA NA NA NA

[[775]] [,1] [,2] [1,] "0f113f3ee4d629ef9a4a30911b22b224772085e5 349 273" "0f113f3ee4d629ef9a4a30911b22b224772085e5" [,3] [,4] [1,] "349" "273"

[[776]] [,1] [,2] [,3] [,4] [1,] NA NA NA NA

[[777]] [,1] [,2] [1,] "f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab 327 274" "f5cda4cbb17c908ceef33f4f52d94e8e04b7c1ab" [,3] [,4] [1,] "327" "274"

.. . ..

With this function, you should be able to parse out the content and line numbers of commits. See the holes of NAs from 768 to 772? Maybe you can capitalize on the fact -p doesn't keep repeating things to get these holes properly parsed. Otherwise, you just need to parse the index + 1 of the content "as is", even if the entire line is just blank, so the code can be recreated.

The other regex symbols you can find here: https://rdrr.io/cran/stringi/man/stringi-search-regex.html

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/sailuh/kaiaulu/issues/68#issuecomment-662884147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL42SQAYIPKZAM7WCXHF4GDR47YNNANCNFSM4OXDOL6Q .

carlosparadis commented 4 years ago

Hi Massimo,

Could you format your post on GitHub and enclose the code portions in code blocks and the table? It is really hard to read and for some reason, my editing will not work. Also for the table, maybe it is best you send me via e-mail an attachment to it. R way to break columns instead of providing a horizontal bar was always very confusing to me. Thanks for the work :)

carlosparadis commented 4 years ago

parse_git_blame<-function(git_repo_path,flags,commit_hash,file_path){

  blame_file <- git_blame(git_repo_path,
          flags,
          commit_hash,
          file_path
)

cond<-'(?<=.)(?=[a-f0-9]{40} \\d+)'
pasted <- paste(blame_file,collapse = ' ')
splitted <- stri_split_regex(pasted,cond)

commit_line <- stri_match_all_regex(unlist(splitted),"^([a-f0-9]{40})
(\\d+) (\\d+)")

commits <- unlist(lapply(commit_line, `[[`, 2))
line_number_of_the_line_in_the_original_file <- unlist(lapply(commit_line,
`[[`, 3))
line_number_of_the_line_in_the_final_file <- unlist(lapply(commit_line,
`[[`, 4))
authors <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=author).*(?=author-mail)"))
author_mails <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=author-mail).*(?=author-time)"))
author_time <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=author-time).*(?=author-tz)"))
#author_tz <- stri_match_all_regex(unlist(splitted),
"(?<=author-tz).*(?=committer)")
committers <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer).*(?=committer-mail)"))
committer_mails <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer-mail).*(?=committer-time)"))
committer_time <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer-time).*(?=committer-tz)"))
committer_tz <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=committer-tz).*(?=summary)"))
summaries <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=summary).*(?=filename)"))
file_names <- unlist(stri_match_all_regex(unlist(splitted),
"(?<=filename).*(?=\t)"))
lines <- unlist(stri_match_all_regex(unlist(splitted), "(?<=\t).*$"))

df <-
data.table(commits,line_number_of_the_line_in_the_original_file,line_number_of_the_line_in_the_final_file,

 authors,author_mails,author_time,committers,committer_mails,committer_time,committer_tz,summaries,
                 file_names,lines)
return(df)
}

Actually, scratch that. Can you go ahead and commit the code on the branch you already have a PR? We are on the same page on the function, I will make any final comments there :) Thanks.

massihonda commented 4 years ago

Ok, I committed. I'll wait for your review.

Thank you,

Massimo

On Sun, Jul 26, 2020 at 1:00 PM Carlos Paradis <SRS0=DiD/=BF=github.com= notifications@mijnuvt.nl> wrote:

parse_git_blame<-function(git_repo_path,flags,commit_hash,file_path){

blame_file <- git_blame(git_repo_path, flags, commit_hash, file_path )

cond<-'(?<=.)(?=[a-f0-9]{40} \d+)' pasted <- paste(blame_file,collapse = ' ') splitted <- stri_split_regex(pasted,cond)

commit_line <- stri_match_all_regex(unlist(splitted),"^([a-f0-9]{40}) (\d+) (\d+)")

commits <- unlist(lapply(commit_line, [[, 2)) line_number_of_the_line_in_the_original_file <- unlist(lapply(commit_line, [[, 3)) line_number_of_the_line_in_the_final_file <- unlist(lapply(commit_line, [[, 4)) authors <- unlist(stri_match_all_regex(unlist(splitted), "(?<=author).(?=author-mail)")) author_mails <- unlist(stri_match_all_regex(unlist(splitted), "(?<=author-mail).(?=author-time)")) author_time <- unlist(stri_match_all_regex(unlist(splitted), "(?<=author-time).*(?=author-tz)"))

author_tz <- stri_match_all_regex(unlist(splitted),

"(?<=author-tz).(?=committer)") committers <- unlist(stri_match_all_regex(unlist(splitted), "(?<=committer).(?=committer-mail)")) committer_mails <- unlist(stri_match_all_regex(unlist(splitted), "(?<=committer-mail).(?=committer-time)")) committer_time <- unlist(stri_match_all_regex(unlist(splitted), "(?<=committer-time).(?=committer-tz)")) committer_tz <- unlist(stri_match_all_regex(unlist(splitted), "(?<=committer-tz).(?=summary)")) summaries <- unlist(stri_match_all_regex(unlist(splitted), "(?<=summary).(?=filename)")) file_names <- unlist(stri_match_all_regex(unlist(splitted), "(?<=filename).(?=\t)")) lines <- unlist(stri_match_all_regex(unlist(splitted), "(?<=\t).$"))

df <- data.table(commits,line_number_of_the_line_in_the_original_file,line_number_of_the_line_in_the_final_file,

authors,author_mails,author_time,committers,committer_mails,committer_time,committer_tz,summaries, file_names,lines) return(df) }

Actualy, scratch that. Can you go ahead and commit the code on the branch you already have a PR? We are on the same on the function, I will make any final comments there :) Thanks.

Regards,

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/sailuh/kaiaulu/issues/68#issuecomment-663973454, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL42SQDMVEQJELNNTWJUZTDR5QEC3ANCNFSM4OXDOL6Q .

carlosparadis commented 4 years ago

@massihonda thank you for the changes, I have been working on this ever since you sent me the new commit. There are some changes I had to do locally because of how utags works.

The good news is that as of yesterday night I have a vignette that will successfully blame a file and use it's blamed content column to parse using utags, generating the additional information of whether a line is a function or not: I.e. we are able now to get all the information of git blame and add an additional column if said line is a function or not. So we're almost there with the gitlog at function level.

There is one thing I overlooked however: When we use git blame, we obtain information of every single line. I originally thought utags would do the same: but it does not.

The current flags I am using of utags do not output what every line of code is. Rather it output a few lines, some of which functions are one of them. There is a way to output the end line of functions but the output format is really weird. Technically if we are able to consistently parse where a function starts and ends given a file though, we should be able to tell if a person changed a function and which it is.

What is strange is that codeface does not use said approach, it uses more flags and doesn't include e:

  # f = functions, s = structs, c = classes, n = namespace
  # p = function prototype, g = enum, d = macro, t= typedef, u = union
  # e = end line
["u","f", "s", "c", "n", "p", "g", "d", "t"]

As you see, they do more than functons too. This is very important to understand as it affects the dependencies we are really looking at when we say "at funciton level".

So that's basically where I am at. Remember Codeface uses Exuberant Ctags according to the paper, but we use Universal Ctags (uctags). Exuberant is now dead code as far as I understand, and Universal Ctags is the sucessor. If you want to take a look at this to speed up us getting to the gitlog functions, check the README.md on how to install uctags, and try running a few examples to see if you can make sense of all these flags.

Let me know if anything doesn't make sense!

carlosparadis commented 4 years ago

I have summarized my findings on #2 since it is not git blame related. I think I have everything I need now. Give me a few days and we will have the gitlog at the function level.

carlosparadis commented 4 years ago

@massihonda when you have a chance, take a look at this function. I added you as a ctb on DESCRIPTION as part of this commit, and when I accept your PR for the configs git will add you as a contributor to the page. I will be pushing a vignette in a moment. A few comments:

The actual git blame output needed was 1 line of the original source code file annotated with the metadata. Otherwise, we can't easily use the line_content column to pass it to uctags.
When I suggested using a regex to parse commit hashes and not using loops, what I was trying to say is that you could observe how the format of the file changed, and then apply the regex to the lines. For example, author- information is followed by committer information, etc. Using this approach, I was able to narrow down the possible git blame files to be of one of the 3 cases (see sub-function parse_line_content). In your PR, you scan the entire git blame file every time you call stri_match_all_regex() over 10 times, even if you already know the line was associated with something else from a prior call of stri_match_all_regex(). The pushed function only applies the regex matching for commit hash in all lines in the pushed function. For the remaining ones, it leverages the possible 3 ways the blame file format appears to parse exactly the lines for the regex once.

I went ahead and extended the API files to have regex. This will allow us to unit test them as well. If you can think of counter cases, please let me know.

I will send the remaining commits in a moment for the use of uctags to a separate issue.

massihonda commented 4 years ago

Ok Carlos, I will pay more attention next time if I shall work on another part!

On Sun, Aug 2, 2020 at 1:06 PM Carlos Paradis wrote:

@massihonda https://github.com/massihonda when you have a chance, take a look at this function. I added you as a ctb on DESCRIPTION as part of this commit, and when I accept your PR for the configs git will add you as a contributor to the page. I will be pushing a vignette in a moment. A few comments:

The actual git blame output needed was 1 line of the original source code file annotated with the metadata. Otherwise, we can't easily use the line_content column to pass it to uctags.

When I suggested using a regex to parse commit hashes and not using loops, what I was trying to say is that you could observe how the format of the file changed, and then apply the regex to the lines. For example, author- information is followed by committer information, etc. Using this approach, I was able to narrow down the possible git blame files to be of one of the 3 cases (see sub-function parse_line_content). In your PR, you scan the entire git blame file every time you call stri_match_all_regex() over 10 times, even if you already know the line was associated with something else from a prior call of stri_match_all_regex(). The pushed function only applies the regex matching for commit hash in all lines in the pushed function. For the remaining ones, it leverages the possible 3 ways the blame file format appears to parse exactly the lines for the regex once.

I went ahead and extended the API files to have regex. This will allow us to unit test them as well. If you can think of counter cases, please let me know.

I will send the remaining commits in a moment for the use of uctags to a separate issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sailuh/kaiaulu/issues/68#issuecomment-667659865, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL42SQEVJPRQDB4RNQPOBELR6VCEPANCNFSM4OXDOL6Q .