uptake / pkgnet

R package for analyzing other R packages via graph representations of their dependencies
https://uptake.github.io/pkgnet/
Other
155 stars 37 forks source link

Only round the numeric columns in the output table that have at least… #249

Closed j450h1 closed 4 years ago

j450h1 commented 5 years ago

… one non-Integer values. Right now those columns are betweenness and pageRank. Resolves issue #228

j450h1 commented 5 years ago

I've found the block of code to change the data types it is in the node_measures function.

                if (!m %in% names(self$nodes)) {
                    log_info(sprintf("Calculating %s...", m))
                    result <- private$node_measure_functions[[m]](self)
                    #result <- as.character(result) #didn't work for some reason
                    resultDT <- data.table::data.table(
                        node_name = names(result)
                        , result = as.character(result) #change here
                    )
                    setkeyv(resultDT, 'node_name')
                    self$nodes[, eval(m) := resultDT[node, result]]
                }

A new parameter will need to be added to the node_measures feature to specify the type since I couldn't figure out a way to adjust the type outside of this function. I see two different approaches here, the new parameter could have a default type of character and then only the double columns (betweenness and pageRank) would not have this character conversion.

The 2nd approach is to have no default type and it would be a required argument to specify the type. If the type was integer, then a conversion would be done otherwise everything stays the same.

Let me know what approach is preferred and if you have any questions before I proceed.

Here is an example of where the node_measures function is called:

   graphOutDegree = function(self){
                measure <- 'outDegree'
                igraph::centralize(
                    scores = self$node_measures(measure)[, get(measure)]
                    , theoretical.max = igraph::centr_degree_tmax(
                        graph = self$igraph
                        , mode = "out"
                        , loops = TRUE
                    )
                    , normalized = TRUE
                )
            }

With the updated function I think this line would look something like:

scores = self$node_measures(measure, "integer")[, get(measure)]
j450h1 commented 5 years ago

Another approach for the new argument is to focus on integers only and have a boolean which defaults to FALSE and would need to adjusted to TRUE for the 4 integer columns.

codecov-io commented 5 years ago

Codecov Report

Merging #249 into master will increase coverage by 0.01%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #249      +/-   ##
==========================================
+ Coverage   92.44%   92.46%   +0.01%     
==========================================
  Files          12       12              
  Lines         927      929       +2     
==========================================
+ Hits          857      859       +2     
  Misses         70       70
Impacted Files Coverage Δ
R/GraphClasses.R 93.15% <ø> (ø) :arrow_up:
R/AbstractGraphReporter.R 90% <100%> (ø) :arrow_up:
R/FunctionReporter.R 93.25% <0%> (+0.02%) :arrow_up:
R/testing_utils.R 82.85% <0%> (+0.5%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f6de57b...d74dcb8. Read the comment docs.

j450h1 commented 5 years ago

@jameslamb Let me know if you have any preference before I proceed with the updated way to change the node_measures function

bburns632 commented 5 years ago

@j450h1 , James will be AFK for a while, so I apologize for some feedback that may be redundant (there's a lot you've discussed here).

The issue is with how the DT in the report is handling the rounding. At first thought, I would not want to affect the datatypes in the data that is available after a CreatePackageReport() run. I think we were on the same page with that since I see the first attempts were to modify the code HERE, between the node data.table creation and the report formatting.

HOWEVER, as you point out @j450h1, the root issue is that the output of outDegree,inDegree,numRecursiveDeps,numRecursiveRevDeps are the wrong data type. They should be integers but are double. I think this is masked a little b/c of how R console handles this.

Example:

library(pkgnet)
t <- CreatePackageReport('ggplot2')
str(t$DependencyReporter$nodes)

image

Which looks fine in the terminal: image

But DT appropriately rounds these doubles (b/c they are doubles): image


SUGGESTION Just cast the function output to integer with the function.

Example (not tested and tabs a wreak):

outDegree = function(self){
as.integer(
                igraph::degree(
                    graph = self$igraph
                    , mode = "out"
                    , loops = TRUE
                )
)
            }

Theoretical Output with one column type changed: image image

@jameslamb I believe I saw your concern above about changing functionality, but I think that risk is minimal. Worth seeing what unit tests come back with.

So @j450h1, it's not as elaborate of a suggestion, but what do you think about casting the output of those functions to integer?

jayqi commented 5 years ago

I agree with @bburns632 suggestion of casting to int within the functions where those specific four measures are calculated.

j450h1 commented 5 years ago

Okay great. I will work on update later today.

j450h1 commented 5 years ago

Adjusting as such:

           outDegree = function(self){
                result <- igraph::degree(
                    graph = self$igraph
                    , mode = "out"
                    , loops = TRUE
                )
                as.integer(result)
            }

leads to this error:

## Error in setkeyv(resultDT, "node_name"): some columns are not in the data.table: node_name

Looks like this is related to the fact that names(result) is now NULL and is assigned to node_name as per below:

                # If not already calculated it, calculate and add to node DT
                if (!m %in% names(self$nodes)) {
                    log_info(sprintf("Calculating %s...", m))
                    result <- private$node_measure_functions[[m]](self)
                    resultDT <- data.table::data.table(
                        node_name = names(result)
                        , result = result
                    )
                    setkeyv(resultDT, 'node_name')
                    self$nodes[, eval(m) := resultDT[node, result]]
                }

I'm not very familiar with the Datatable package, so would appreciate any thoughts on why just changing the type to integer would affect the names?

jayqi commented 4 years ago

@j450h1 very sorry for the long delay in helping you figure this out. I think we're close, so I hope you still have time to take a look at this.


The igraph::degree function returns named vectors. Later, where we put it into a data.table, it grabs the names from the vector to assign to the node name column.

Apparently, as we are both learning, the as.integer function strips names from vectors. Quick example:

myvec <- c(foo=1.0, bar=2.0)
myvec
#> foo bar 
#>   1   2
names(myvec)
#> [1] "foo" "bar"
is.integer(myvec)
#> [1] FALSE
myvec <- as.integer(myvec)
names(myvec)
#> NULL
is.integer(myvec)
#> [1] TRUE

Based on a quick look at this StackOverflow question, I think the best option is to instead use mode to make those vectors into integers.

myvec <- c(foo=1.0, bar=2.0)
myvec
#> foo bar 
#>   1   2
mode(myvec) <- "integer"
myvec
#> foo bar 
#>   1   2
names(myvec)
#> [1] "foo" "bar"
is.integer(myvec)
#> [1] TRUE

Created on 2020-01-12 by the reprex package (v0.3.0)


Give that a try, and see if it works. Thanks so much for your patience and for taking a shot at this issue!

j450h1 commented 4 years ago

Sure @jayqi I will give it a crack, thanks for pointing me in the right direction!

j450h1 commented 4 years ago

Looks like that did the trick! See latest commit:

image

With the assumption that these two which I don't see in the output (Out-Closeness and In-Closeness) are not integers and should not be converted like the 4 as shown in above screenshot:

            # Out-Closeness
            # Closeness doesn't really work for directed graphs that are not
            # strongly connected.
            # igraph calculates a thing anyways and gives a warning
            # Typically given as normalized values
            , outCloseness = function(self){
                suppressWarnings(igraph::closeness(
                    graph = self$igraph
                    , mode = "out"
                    , normalized = TRUE
                ))
            }

            # In-Closeness
            # Closeness doesn't really work for directed graphs that are not
            # strongly connected.
            # igraph calculates a thing anyways and gives a warning
            # Typically given as normalized values
            , inCloseness = function(self){
                suppressWarnings(igraph::closeness(
                    graph = self$igraph
                    , mode = "out"
                    , normalized = TRUE
                ))
            }
j450h1 commented 4 years ago

Likewise I appreciate your help and patience as I worked through these final updates.

Sent from my iPhone

On Jan 16, 2020, at 6:33 PM, Jay Qi notifications@github.com wrote:

 Merged #249 into master.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.