Review request - Githubissues

bcdaniels commented 4 years ago

This set of ipython notebooks implements: 1) Computation of rich club coefficient as a function of degree cutoff 2) Comparing distributions of citation counts across two groups of papers and determining whether they are statistically distinct

The data this code relies on is not yet uploaded here. Should we put this in the github repository? (Large files can be a pain in git.)

@deryc may have more up-to-date versions of the code that he is using for more recent analysis.

jdamerow commented 4 years ago

Cool! 😀 Some example data would be nice, but if you put it in a Dropbox or something, we can also get it from there. Also, can you add some explanation to the Readme regarding what the different notebooks do, what one would need to know to run them (e.g. adjust path here, then just run all cells, or whatever is needed), etc? Thanks!

bcdaniels commented 4 years ago

Going through this again, I think the most re-usable code boils down to just a few lines. I think it makes sense to keep the rest of the code around, too, but much of it probably won't be worth developing any further right now.

The input data for the rich club coefficient was a bibliographic coupling matrix (Bibliographic_Coupling_Matrix.xlsx), from which we calculated the rich club coefficient (see the last section, 5.24.2018 use networkx's version of rich club coefficient, of the file paper-figures-rich-club-innovation.ipynb). We could provide the xlsx file as an example if @deryc is okay with it.

The relevant code for the rich club coefficient calculation is just a few lines:

import networkx as nx
import pandas as pd
import numpy as np
coupling = pd.read_excel('Bibliographic_Coupling_Matrix.xlsx')
G = nx.from_numpy_array(np.array(coupling)>0)
rcc = nx.rich_club_coefficient(G)

Also maybe useful is the code for making the plot:

import pylab
sortedDegrees = np.sort(rcc.keys())
pylab.plot(sortedDegrees,[rcc[d] for d in sortedDegrees],'o:')
pylab.xlabel('Bibliographic coupling degree threshold')
pylab.ylabel('Rich club coefficient')

Otherwise, a lot of this was working with data that Deryc had pre-calculated in terms of the number of innovative keywords (e.g. this code uses the file InnovativePapers_Top100_Post2006-editedBCD.csv). So I'm not sure the code would be useful to anyone else unless Deryc is still using the same format to output innovative papers. I'm guessing his workflow has changed since then?

The only other code I imagine could be reused is the cdfPlot defined near the beginning of paper-figures-rich-club-innovation.ipynb, which plots cumulative distributions to determine whether they are statistically distinct. I think @deryc has been doing many more of these plots recently, so he may have a more updated version.

So in the end maybe it would be best to condense this code down into just one or two useful functions? It seems like these would be useful to review as Deryc continues developing his code. What do you think, @deryc?

deryc commented 4 years ago

Yes. I am fine with providing the example .xlsx file we used. I will clean up some code that identifies the innovations. I still use the same steps from this code. However, it is much more efficient now. I use a modified version of cdfPlot in almost all the analyses I have performed recently. I think condensing the code is probably the smartest move. This makes it more portable in the long run. This will become even more important after the first of the year when I start adapting my code to new projects using Scopus or MSAcademic.

laubichler-lab / innovation-rich-club

Review request #1