sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
20 stars 12 forks source link

gitlog_to_hdsmj() number of variables differ from DV8 output + Refactor into R/graph.R and R/network.R #184

Closed leilani-reich closed 1 year ago

leilani-reich commented 1 year ago

Refer to https://github.com/sailuh/kaiaulu/pull/171#issuecomment-1503153951 for info.

For my project bridging the gap, my gitlog_to_hdsmj() outputs a json with 162 variables (the "variables" field in the json) whereas the json output by dv8 gives 108 variables. On the other hand, for the project QtNotepad, my gitlog_to_hdsmj() outputs a json with 8 variables whereas the json output by dv8 gives 10 variables.

gitlog-output-april-9.zip

carlosparadis commented 1 year ago

Reply to https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1525282728 second part:

Also, I think you missed my question https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1519865657.

Also, do you want me to replicate the order of the filenames that DV8 uses in its hdsm.json output? DV8 appears to be ordering the filenames from newest (most recent committed/changed according to gitlog) to oldest. If so, I believe I could take the commiter_datetimetz column in the gitlog table to get the order of the filenames in my variables field.

Wasn't there already a alphabetical sort in one of those functions? If it is not too much of a hassle, you could include as a second sorting criteria using the same parameter. The default parameter would be NULL. Alternative parameters would be one of two words choosing the method (see how I do that for the transform functions when deciding if author-committer, etc).

But I would save this for the very, absolute last unless it is a few lines of code. There are still a number of other higher priorities here. Both finishing the duplication in reverse order so we can finish Milestone 3.4, followed by the generalization of graph_dsm(), and that download / parser function from Bugzilla to greatly reduce the number of gets. And we only have this week left.

carlosparadis commented 1 year ago

Reply to https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1525283273

Just to be clear, are you saying it works just fine on your end the entire Notebook on the most current commit version? If so, I will try re-running this again on my end.

I will work on the other asks (issue https://github.com/sailuh/kaiaulu/issues/193, doubling cells but switching src&dest (to match dv8), fixing code)).

Sounds good!

leilani-reich commented 1 year ago

reply to: https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1525492406

Just to be clear, are you saying it works just fine on your end the entire Notebook on the most current commit version?

Yes, I had no issues running the entire notebook with the new commits.

leilani-reich commented 1 year ago

Do you want me to change the dependencies_to_sdsmj and gitlog_to_hdsmj currently in the R/dv8.R file? Or is it ok to leave it be? Ultimately, we are going to drop these for the transform_dependencies_to_sdsmj, transform_gitlog_to_hdsmj, and graph_to_dsmj, right?

carlosparadis commented 1 year ago

Reply to: https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1525504430

For now, we can leave them be on your Pull Request, until we can test the graph_todsmj and their functionality is mapped to the associated transform* functions and subsequently replace on the Notebook. Changing them now will break the Notebook until we have the replacements.

But you are correct that the intent is to remove them once that is done.

Once graph_to_dsmj() and the transformer functions are done, you want to test it works with the:

https://github.com/sailuh/kaiaulu/blob/2b10c5b399ae7c8b4b9e1058cf1a55c9c905211f/R/network.R#L109

to see if you truly generalized the function.

Note the above generates a directed graph, contrary to the default parse_gitlog() that generates a bipartite and requires projection which leads to a bipartite graph. It is also not representing co-change, but your function should work nonetheless since it is just a general representation to represent graphs in JSON at that point. You can see their distinction visually here:

http://itm0.shidler.hawaii.edu/kaiaulu/articles/gitlog_showcase.html#bipartite-networks-and-projections

http://itm0.shidler.hawaii.edu/kaiaulu/articles/gitlog_showcase.html#temporal-networks

You may have questions on how to fully generalize the function across these, so feel free to iterate here as much as you need if it helps save you time.

leilani-reich commented 1 year ago

reply to: https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1525516627

Hi Carlos, to be clear, do you want me to change this line https://github.com/sailuh/kaiaulu/blob/2b10c5b399ae7c8b4b9e1058cf1a55c9c905211f/R/network.R#L41

with this line in transform_gitlog_to_hdsmj()

https://github.com/sailuh/kaiaulu/blob/9db3295f65748819c438becbdd11e4f4ebed8c8e/R/network.R#L108

as mentioned here?

And then I don't use any projection since you said transform_gitlog_to_temporal_network gives the directed graph, right?

carlosparadis commented 1 year ago

Reply to https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1527253324

You would create a new function, much as you did for transform_gitlog_to_hdsmj and transform_dependencies_to_sdsmj. Here, you would call them, I guess, transform_temporal_gitlog_to_adsmj. Whereas a, stands for author`.

As you said, and I hope you saw on the linked Git Log Notebook, the graph generated by the temporal git log is already projected, and is a directed graph of author to author.

Your additional temporal function, shouldn't really have to do much, other than calling transform_gitlog_to_temporal_network over project_git, and subsequently passing the graph to graph_to_dsmj.

If you are capable to do that, then you have succeeded in fully generalizing graph_to_dsmj to any graph in Kaiaulu.

Note this also showcases how in the future, other transform functions may be written reusing your graph_to_dsmj function. For example, maybe in the future we will have to save a issue network , or a CVE network, etc. They will either be projected networks that are undirected or directed, which your function already handles by passing a flag of either creating duplicate reverse edges or not.

The other thing that you should be aware on the generalization, is the type of edges that the graph contains so you can appropriately encode that information on the JSON.

For example, for the gitlog, we have to pass the edges as labeled cochange, the file dependencies can be any DV8 capture, adsmj may be called something else, etc. The function has to be capable to extract that information from the graph table passed as input (contrary to undirected and directed which is currently passed manually as a TRUE/FALSE flag parameter). I,e. your function graph_to_dsmj can't assume what the edge weight is called, just how to obtain it. In which case I need some idea of how you are getting that information in your graph_to_dsmj. Could you point it out the lines to me?

I can't remember if the graph tables currently provide you sufficient information, and the dependencies graph you obtain may be lacking too.

carlosparadis commented 1 year ago

@leilani-reich Here's some code and more concrete examples.

SDSM

melt(project_dependencies[["edgelist"]],id.vars = c("src_filepath","dest_filepath"),variable.name = "label")

If you apply the above to the edgelist on your transform_dependencies_to_sdsmj you obtain a table like so:

Screen Shot 2023-04-28 at 3 27 18 PM

Let's call this the long form of the edgelist table. Whereas project_dependencies is on the wide form. More generally:

image

You can go back and forth between these formats using melt and dcast functions in R.

What I was trying to say above, is that your graph_to_sdsmj should always expect a edgelist table of the long form, with the columns from, to,weight and label. It is the transform_* function responsibility to format the data, and re-name the columns accordingly for graph_to_dsmj. So the melt function code above would be placed on your transform_dependencies_to_sdsmj.

Here you don't need to use model_directed_graph since what you get from parse_dependencies() is already a graph.

Note the flag passed to graph_to_dsmj here should indicate this is a directed graph, and thus no duplication of edges.

HDSM

Meanwhile, for the transform_gitlog_to_hsdsmj you would have:

Screen Shot 2023-04-28 at 3 44 10 PM

This table is mostly on the right format. So you only have to perform:

table$label <- "Cochange"

Here we have an undirected graph, and hence it should pass the flag indicating duplicating edges in reverse direction.

ADSM

It is important the label you assign here is consistent with what DV8 is expecting for the two above. Granted, DV8 does not currently expect a network of authors as I recommended above, so we can choose any label we want and assume it to be a future functionality to DV8. It is still useful for us to export the table and check your function generalizes.

Screen Shot 2023-04-28 at 3 48 58 PM

You can see the temporal gitlog edgelist is mostly in the right format. You just need to insert a label column again. Call it collaborate.

The graph is directed, so you should not duplicate edges here.

That's it. Make sure your graph_to_dsmj function docs describe on the parameter for graph the expected table format and column names. And remember to refer to column names through your code, instead of column positions.

Let me know if you have any questions.

leilani-reich commented 1 year ago

reply to: https://github.com/sailuh/kaiaulu/issues/184#issuecomment-1527990872

For example, for the gitlog, we have to pass the edges as labeled cochange, the file dependencies can be any DV8 capture, adsmj may be called something else, etc. The function has to be capable to extract that information from the graph table passed as input (contrary to undirected and directed which is currently passed manually as a TRUE/FALSE flag parameter). I,e. your function graph_to_dsmj can't assume what the edge weight is called, just how to obtain it. In which case I need some idea of how you are getting that information in your graph_to_dsmj. Could you point it out the lines to me?

Hi Carlos, in regards to how I am getting the right name of the edge weight for graph_to_dsmj, for the gitlog_to_hdsmj, first I rename the weight value to "Cochange" (I will do something similar for transform_temporal_gitlog_to_adsmj to rename weight to "Collaborate"). https://github.com/sailuh/kaiaulu/blob/2b10c5b399ae7c8b4b9e1058cf1a55c9c905211f/R/network.R#L46

Then, in graph_to_dsmj, I use this code to get all the parameter names actually used in my table out of a list of potential parameter names (note I will add the parameter 'Collaborate' for the ADSM): https://github.com/sailuh/kaiaulu/blob/2b10c5b399ae7c8b4b9e1058cf1a55c9c905211f/R/graph.R#L36-L39

And only if a parameter is present (length > 0), will it be added into a cell https://github.com/sailuh/kaiaulu/blob/2b10c5b399ae7c8b4b9e1058cf1a55c9c905211f/R/graph.R#L50-L55

Also, thank you for providing more code & differences between SDSM, HDSM, and ADSM in your other comment.

carlosparadis commented 1 year ago

To be clear:

https://github.com/sailuh/kaiaulu/blob/2b10c5b399ae7c8b4b9e1058cf1a55c9c905211f/R/graph.R#L36-L39

This should not exist in graph_to_dsmj: This function should make no assumption on the values of the table. Think about it: Suppose in the future there are 50 transforms. If graph had to assess their values, the function would run very long and confusing.

This information should be on the transform_dependencies_to_sdsmj(), because this function is only in charge of that conversion.

The only assumption graph_to_dsmj should make are the column names and what should be contained in them. You should assume if the value is available on the "label" column, then you should include the associated edges on the export.

I can make inline comments to the function too if you think it will be easier.

leilani-reich commented 1 year ago

Hi Carlos, I understand. I was using the wide format table for parse_dependencies before, which was why it was like that. I will fix it, as well as the inline corrections you just sent. Also, so the column names for the table passed to graph_to_dsmj() should be from, to, label, and weight? So we use the name "weight" instead of "value"? Just wanted to clarify this column name.

carlosparadis commented 1 year ago

Yes. Use specifically the order from, to, weight, label.

The melt function should have a parameter that lets you rename value called value.name so just specify value.name = "label" and it should return the correct name.

carlosparadis commented 1 year ago

PR was accepted, so I am closing this issue. We can open a new one for any problems associated to this.