Review Ticket for Exploring and Analyzing Network Data with Python

walshbr commented 7 years ago

The Programming Historian has received the following tutorial on 'Exploring and Analyzing Network Data with Python' by @jrladd @jmotis and @scottbot. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/exploring-and-analyzing-network-data-with-python

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @ianmilligan1 or @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudspeople (Ian Milligan and Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

walshbr commented 7 years ago

I think this is a really thorough and well-written piece. I learned a lot, and I think our readers will too. I've provided a round of initial and more substantial feedback offline at the authors' request. My one suggestion given there that I would reiterate here was a recommendation to incorporate analysis of the results throughout the lesson rather than predominantly at the end. I think the revisions you made do this nicely, but it might be useful to hear what others think as well.

External reviewers have been found for this lesson, and at this point I feel that the piece is ready for them to take a look at. Let me know if you have any questions!

walshbr commented 7 years ago

The reviewers for this lesson will be @anneschao @liqiwei2000 and @ebeshero - looking forward to talking more about the lesson. Let me know if you have any questions about procedures!

anneschao commented 7 years ago

@jrladd Hi John, Do you mind telling us (Qiwei and I) where we can find the data set for Society of Friends?

ebeshero commented 7 years ago

Yes, please--It looks like these files we're to work with in the tutorial are missing download links: quakers_nodelist.csv quakers_edgelist.csv

@walshbr @jmotis @scottbot @jrladd

walshbr commented 7 years ago

Sorry I missed this - was working from a different local copy than the live staging repository. @jrladd you can place those files in assets/lesson-name. I committed empty files to that effect that you can fill in here - https://github.com/programminghistorian/ph-submissions/commit/27d674caea75247b91ccb54f3e99ce5137b868a3. So you'll just need to overwrite the files that live at these two paths:

/assets/exploring-and-analyzing-network-data-with-python/quakers_nodelist.csv
/assets/exploring-and-analyzing-network-data-with-python/quakers_edgelist.csv

jrladd commented 7 years ago

I'm sorry I missed this message earlier in the week! I'll post the files as soon as I'm back at a computer later today.

jrladd commented 7 years ago

@anneschao @ebeshero I just posted the data in the asset folder that @walshbr described. Happy to make adjustments if that's not working, or if there's anything else you need.

walshbr commented 7 years ago

Thanks @jrladd! I would also link to them from the text of the lesson itself. Paragraph 8 makes a lot of sense. Let us know when you get a chance to do so that the reviewers know they can get going.

jrladd commented 7 years ago

Okay, I added some download links. They take you to the github page for each file, rather than download it directly. @walshbr do you know how to get a direct download link?

walshbr commented 7 years ago

@jrladd try modifying the beginning of the path to be "../assets/" instead of "/assets/" in each case.

jrladd commented 7 years ago

@walshbr So adjusting the path in that way sent me to the same Github page, but I think I was able to get it by adding a link to the "raw" version of the file. @ebeshero @anneschao If you try it now you should be able to save both CSV files locally and work with them. Both links are in paragraph 8.

anneschao commented 7 years ago

@jrladd Qiwei and I have finished reviewing your lesson. Your codes are well written and the instructions and explanations are really great. Our enclosed comments are really more cosmetic, designed to ease a beginner into using Python. Chao and Li comments on Python and Networks.docx

walshbr commented 7 years ago

Thanks @anneschao - would you mind pulling those notes out of the word document and listing them in a comment directly on the thread? GitHub doesn't play so well with things other than plain text, and it'd be good to document the conversation in a way that the GH system can handle.

anneschao commented 7 years ago

OK, will do asap!

On Aug 1, 2017, at 9:58 AM, Brandon Walsh notifications@github.com wrote:

Thanks @anneschao https://github.com/anneschao - would you mind pulling those notes out of the word document and listing them in a comment directly on the thread? GitHub doesn't play so well with things other than plain text, and it'd be good to document the conversation in a way that the GH system can handle.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/92#issuecomment-319396099, or mute the thread https://github.com/notifications/unsubscribe-auth/AcW_9U4VBtE8Kws7SiWsGh_jfoJnj4Rgks5sTzz4gaJpZM4OAFfz.

anneschao commented 7 years ago

@walshbr @jrladd Here are our comments, thanks!

In the beginning, please tell us what computer you are using, so that we have an idea of the amount of time it takes to perform the operation.
Also, at the start please point the user to IDLE as a Python workstation.
After that, please let the user know that he/she needs to start a new file or a new module in order to begin.
Line 17, in the sentence “set up a plaintext file in the same directory”: the tutorial linked to this passage does not answer the question and is more confusing.
When you talk about betweenness centrality and state “that betweenness centrality will take longer to calculate than other centrality measures, “ can you give us a ballpark figure on the amount of time it takes to run, same with the time it’ll take to calculate the diameter?
Line 84, For consistency, output “print modularity” as well.
Finally, the title seems to suggest that Python or Networkx will create the visualization, but it does not. So perhaps this should be stated early on so as not to mislead.

walshbr commented 7 years ago

Thanks @anneschao! And @jrladd - if you would, wait until we get the last review in to make changes to the lesson. That way we can make sure that the other reviewer is looking at a consistent version and that we can have any necessary conversation.

jrladd commented 7 years ago

Sounds good, @walshbr. And thanks for the helpful feedback, @anneschao and @liqiwei2000!

ebeshero commented 7 years ago

Just a quick ping to say I'm working on this right now and sorry for the delay! @walshbr

ebeshero commented 7 years ago

@jrladd @jmotis @scottbot @walshbr and @anneschao Thanks for the opportunity to review this really terrific tutorial!

First, a question and a comment for the first reviewer (@anneschao) : Why recommend IDLE in particular for this tutorial? There doesn't seem to be a need to use any particular Python IDE, and the user might well run this out of a Jupyter notebook or wherever comfortable. I did everything in a text editor, though in real life I work with PyCharm sometimes. Still, I don't think the user needs to install anything beyond the pip (or pip3) packages indicated here, and pointing back to the information on Python installation elsewhere in Programming Historian seems sufficient to me. Note: I have Anaconda 3.6 installed and use pip, not pip3 for installing packages, but this didn't really cause me any hiccups in progressing through the tutorial.

I agree with your last point, though, @anneschao, that someone checking out this tutorial might expect to produce a snazzy visualization at the end. I'm glad that this isn't the goal of the tutorial--that makes it quite distinctive, but I think you should tell us up front something about how this is a tutorial about exploring networks as graphs, and that that means exploring graph statistics, and applying them to make statistically significant assertions about historical relationships--that this tutorial will help the historian to see more than what a visualization will be able to show on its flashy own. The tutorial writers do indicate this further along in the tutorial, but I think it would be helpful to see it up front. I had a distinct impression while playing with the Python script that I was exploring a network in the dark without seeing it on the screen--but really that brings up important issues of what we expect to be seeing when we study networks, and for most of us who are new to the subject, we do tend to presume circular nodes and linear edges in pretty clusters or monstrously complicated hairballs. Talk to us about our assumptions for a bit before showing us how to read the stats!

To this goal of giving the reader a better survey of the topics covered, you might also, following the GOALS section, just give us a quick overview of the tools we'll be using: Python with NetworkX to look under the hood at the nuts and bolts of networks.

From here on, I'll key my comments to specific paragraphs and notes:

Paragraph 10:

I'd like to see a broader or more extensive discussion of undirected networks here. You show us a good way to differentiate between undirected and directed, but you might also indicate that undirected networks are applicable to co-occurrence relationships, without implying that one person knows another person at all. (That's an interest of mine, and I imagine it's an interest to other scholars as well.)

Paragraph 43

Connected to the issue with Paragraph 10: Paragraph 43 implies a view of undirected networks as "limited". I think I understand why you use that word, because less information is available about relationships among participants in an undirected graph, but I also tend to think that I'm making fewer assumptions about parties when I'm working with an undirected graph--I may know that some nodes initiated relationships with other nodes, but I don't know that about everyone, and perhaps I'd best study how they cluster based on some known related context. I prefer working with the undirected when I don't want to spend time nailing down every single relationship, and I want to avoid making assumptions about my data set. In that way, perhaps studying an undirected graph offers a more solid basis for conjecturing about relationships in a context where little information is available or what information we have is questionable. Perhaps more explanation would help scholars to decide what approach to take, and what extra steps might be necessary for a directed graph to be sensible.

Here's something more: Rather than implying that an undirected graph seems "limited" because we can't get information about directed flow, how about pointing out how you'll be showing us related alternatives (looking at the connections and paths taken between nodes and clusters) that help us to expose people and relationships pivotal to particular events? Tell us something more about the specific benefits of working with an undirected graph.

Paragraph 36

Suggestion: For a more satisfying or interesting output at this point, how about just concatenating n with the birth year, thus?

print(n, G.node[n]['birth_year'])

Paragraph 40

To avoid repeating "get" twice in the same sentence, I'd suggest this rephrase: "When you start work on a new data set..."

Paragraph 48

I got a little confused here about which graph you were referring to, which was a little silly of me because of course you're referring to the Quaker network graph embedded in the tutorial--usually. Still, because you point us to two different network graph resources here with pictorial graphs (one in Wikipedia, and one from the Six Degrees page) it might help to clarify when you return to the Quaker network graph (I think in sentences 3 and 4 of this paragraph).

Footnote 7 (paragraph 102)

This doesn't seem well connected to its signal placement in paragraph 48: Does this footnote actually belong elsewhere, or is there something missing here?

Footnote 8

This seems a bit of a teaser! Since you show us how to isolate components further on in the tutorial, I wonder if you might just find a place to show us how to get a reading of each component's density after all?

Footnote 13

Another teaser? Why not just show us how to get that global modularity score? It would be pretty helpful to know. :-)

Paragraph 64

Would you explain the difference between estimating triadic closure with the clustering coefficient vs. with transitivity? I hunted around for information and found some cryptic stuff on the web like this: http://homepage.divms.uiowa.edu/~sriram/196/spring12/lectureNotes/Lecture3.pdf And I think if you can work out a cogent explanation for the Programming Historian community (or just folks like me) we'd really appreciate it! (I've worked with clustering coefficient before, and never really understood what the difference is.)

Perhaps more importantly, can we discuss the danger of making assumptions here? The explanation of transitive properties might lead some to make more positive assertions than perhaps we should...that is, this isn't so much about determining whether someone knew someone else as it is calculating a likelihood, and that doesn't seem quite the same thing as a transitive property. I guess I'm just wanting to sound a note of caution here, in case we're tempted to treat our Python script as a calculator.

Paragraph 80

In the last sentence you say "Python will give you the flexibility to explore your network computationally in ways other interfaces cannot." I stopped a moment and wondered about other interfaces and whether this seemed a bit exaggerated. I've worked a lot with Cytoscape, and its network analyzer does actually give me a readout of lots of the network stats you discuss here. I'm aware that R has a pretty powerful data analysis package, too. I think Python has probably emerged as a programming language of choice among DHers and there's lots of great material on Programming Historian about it, so it's probably the toolkit of choice for the community here, but I wonder if you might just open up some possibilities here and indicate a few other paths that might compare with the one you've showed us--just in case there's an R person out there who might be wondering if the same things can be done in other ways.

Thanks, overall, for making lots of network stats concepts clearly understandable! I think this is a really encouraging tutorial for people to work with and adapt to their own data sets, and I especially appreciated how you pointed out multiple paths to visualization, even without making that the focus of the tutorial. I'm looking forward to experimenting more with the programs, software, and methods you shared here.

walshbr commented 7 years ago

Thanks so much @anneschao and @ebeshero for the great and thorough reviews. I appreciate the lively engagement, and feel free to use this thread to engage one another in conversation about any of the points. I'm happy to answer any questions as well. I especially think the suggestions about regulating readers' expectations about the kinds of work and output they'll be getting are useful. Also like the further suggestions for analyzing the results and readers assumptions that came out of the reviewer comments.

I agree that that the discussion of IDLE/IDE's more generally is probably out of the scope of the lesson. My sense of the comments from @anneschao is that introducing IDLE would be a way to guide beginners into the tutorial. Perhaps another way to deal with this, rather than suggesting a particular IDE or setup, would be to spend a bit more time in the prerequisites section elaborating on setup and pointing to other resources at Programming Historian if readers wish to learn more about working with Python (though being careful to note that our tutorials on setting up on Python use Python 2, not 3 as in this tutorial). I don't think we want to recommend any particular development environment unless it is necessary, but we could point people to places where they could learn more about setting one up that works for them.

@jrladd - let me know when you've pushed up your revisions. I'll do one last read through at that time.

jrladd commented 7 years ago

Thanks very much for your comments, @ebeshero! I look forward to coming back to this discussion as we revise, beginning tomorrow. I'll let you all know when a new draft is posted.

I'm sure other questions will come up as we work, but to your final point about Paragraph 80, I do think you're right that we exaggerated a bit. For me, the payoff to using a programming language rather than a GUI is that you can immediately work with the results of your algorithms---sorting them and combining them with other measures/attributes as we do in the tutorial. We should probably explain a little more about what the "ways other interfaces cannot" actually are. Of course, this flexibility of programming languages isn't limited to Python, and I know there are lots of great tools in R and other languages that do many of the same things we do here. @walshbr do you have thoughts about mentioning other languages in the tutorial? I'm happy to point scholars familiar with R in the right direction.

As for IDEs, we also went back and forth on how much background on Python to give and whether we should mention specific tools like IDLE, Jupyter, etc. We're happy to expand this a bit and link to other resources. The Python 2 vs. 3 issue may make this a little tough for readers, so we'll be careful to add a caveat.

Thanks again!

walshbr commented 7 years ago

I think it's fine to point people using R in a particular direction. It sounds like you're already prepared to be careful not to let the discussion get too sidetracked, which is the only concern I would have. I'm happy to take a look at some text if you want feedback on an addition though.

jrladd commented 7 years ago

Hi @walshbr et al., the changes have been made and the document is updated. Let me know what more we can do. Thanks again to all of you!

One quick question re: the node and edge files. We tried to update them to work as download links (so that a reader would only have to click and the download dialog would come up, instead of the file displaying in the browser), but the Markdown converter seems to be changing them back to regular links. It's a minor thing, but do you know a workaround?

walshbr commented 7 years ago

Thanks @jrladd I'll take a look at this early next week.

walshbr commented 7 years ago

@jrladd to your question about downloading - I think that behavior differs from browser to browser. Just tested, and Chrome downloads the files as expected. Safari and Firefox open the files in new tabs in the browser. If you're concerned, you might try modifying the language of your text to something similar to this lesson, which explicitly recommends the user right click and save/download the link. That language will differ browser to browser, but I think people will get the idea. What do you think?

When doing my final editorial check I'll move those data files over to be hosted from the site as per here. But that same issue will persist.

jbartonthomas commented 7 years ago

@jrladd , I tried to do a little bit of digging for you to follow-up. Like @walshbr mentioned, the download attribute just isn't supported in certain browsers/versions. Check out this list: http://caniuse.com/#feat=download (though it seems like at least some of the browsers listed in green here don't actually have the tag implemented).

It might be possible to set the MIME type for a file to force the download behavior you want. After some googling, it did seem that github pages supports this (https://help.github.com/articles/mime-types-on-github-pages/). Again, after googling, I think you'd use 'Content Type : attachment'.

Definitely not suggesting you should try this if you don't want to. I haven't done this before, so I don't know if there are any tricks.

jrladd commented 7 years ago

@walshbr @jbartonthomas Thanks to you both for looking into this. I think if it works on some browsers and not others, that's okay for our purposes. I'll add a sentence that says to right-click and "save link as" to avoid confusion.

walshbr commented 7 years ago

I've got the lesson archived on this repo and moved it over to the production repository. It's living on a pull request at the moment - https://github.com/programminghistorian/jekyll/pull/572. I'll do one last read through in the next couple days, soft launch and notify you to check it out one last time, and then publicize it next week. But if you wanted to clone that repository and switch to the "networks-with-python" branch you could check it now now.

So far, I deleted a stale image file that wasn't being used, updated the assets routes to fit with the usual syntax, renamed the image files to allow for our image naming policy, and filled out all the appropriate frontmatter.

walshbr commented 7 years ago

Preview of the lesson here @jrladd. Would you mind giving it a read to make sure that everything looks fine to you? I'll plan on making it go live early next week, so you'll have the weekend to take a look.

I had a couple formatting things I changed, mostly to put some superscript notes before code blocks. A couple other technical questions:

Am I right in thinking that the input and output don't line up re: the line "From this statement, you’ll get a line of output for each node in the network. It should look like a simple list of years." Output in terminal is names and years, not just years, which leads me to believe that the print statement in the loop should actually just be print(G.node[n]['birth_year']) instead of print(n, G.node[n]['birth_year'])

Also last thing - I was getting an error when using community/louvain. This line - communities = community.best_partition(G) Looks like it was a version compatibility issue according to this link. And one that can be fixed by modifying the installation line to specify a specific version:

pip3 install python-louvain==0.5 What do you think? If those two fixes sound right to you I will modify accordingly, push up the final version, merge it early next week, and publicize.

jrladd commented 7 years ago

Thanks, @walshbr. It looks good to me!

To your questions:

Good catch! I updated the code on the recommendation of the reviewer and forgot to update the results or the prose. It should read, "It should look like a simple list of names and years:"

And the result should be:

Anne Camm 1627
Sir Charles Wager 1666
John Bellers 1654
Dorcas Erbery 1656
Mary Pennyman 1630
Humphrey Woolrich 1633
John Stubbs 1618
Richard Hubberthorne 1628
Robert Barclay 1648
William Coddington 1601

Yes, thanks for catching the compatibility issue. That update sounds good to me.

I'll let you know if anything else comes up, but from my look just now I think we're ready to go!

walshbr commented 7 years ago

Lesson is merged and now live. Congratulations, all! I'll be publicizing on twitter shortly. Let me know if any problems come up by filing an issue on our main repository. Thanks for all of your work.

walshbr commented 7 years ago

And the last thing - be sure to promote your hard work! Looks like there is already some of that going on on twitter, but I'd encourage you to promote it, use it in your work, reference it in publications and on blog posts, etc. It's been fantastic working with you all, and I hope the process has been as productive for you as it has been for me.

MrMethodmanMAN commented 7 years ago

Hi Guys,

This looks like an interesting and comprehensive analysis, I am interested in implementing the code to see how it works, but I am having trouble finding quakers_edgelist.csv. Can you post a link to the data ?

walshbr commented 7 years ago

Glad you're finding it useful! @MrMethodmanMAN - the live version of the site with links to the code is at https://programminghistorian.org/lessons/exploring-and-analyzing-network-data-with-python. The link to the edgelist is here - https://programminghistorian.org/assets/exploring-and-analyzing-network-data-with-python/quakers_edgelist.csv

I'll close your other issue since it seems to duplicate this one, but let me know if you have any other questions.

MrMethodmanMAN commented 7 years ago

HI Brandon,

No it is the same issue, I only noticed the issues bottom after seeing and felt that it was the more correct channel. Thanks again for the link. Great tutorial, it is very much appreciated!

programminghistorian / ph-submissions