Review Ticket for "Correspondence Analysis for Historical Research with R"

mdlincoln commented 7 years ago

The Programming Historian has received the following tutorial on "Correspondence Analysis for Historical Research with R" by @greebie. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/correspondence-analysis-in-R

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I will provide initial feedback here before inviting the reviewers to comment.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @ianmilligan1 or @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudspeople (Ian Milligan and Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

greebie commented 7 years ago

One formatting issue I am having is that Github does not render Mathjax or any nice formula-producing syntax. I do not think the formulas are mandatory, but they do make the document look more professional. Has anyone in the community encountered this issue before and how did you handle it?

mdlincoln commented 7 years ago

Good question. I don't believe we've yet had a lesson that needed to use mathjax, but if it looks like we'll really want it for this one, then we might consider adding the necessary javascript to the Programming Historian site. For now, please include all the equation syntax you originally intended to use, and lets see how editing and revisions go. If we do need to add in mathjax, we can incorporate it with our upcoming general redesign.

greebie commented 7 years ago

Okay. Another option is to add an image for the formulas. If PH considers more tutorials in R, it may make sense to have some math capability. In general, I find equations to take away from a paper, but on the other hand, I wouldn't want to be standing in a VIVA saying "oh I don't know how the statistics work, I just let R do it for me!"

mdlincoln commented 7 years ago

Actually, on closer inspection, it looks like we do have mathjax linked to in our HTML headers... but the implementation may be out of date as I see they've just switched their hosting service. Let me look in to fixing it on our home site as well as on the ph-submissions version.

greebie commented 7 years ago

Okay - assuming things are running as desired, I am willing to draft a quick Mathjax cheatsheet for PH. (So many flavors to get mathy things in a md.) Thanks.

mdlincoln commented 7 years ago

OK, I've made two changes:

I enabled mathjax on the ph-submissions repo (it was active on our published site, but not this one)
I replaced your $...$ notation with \$...\$ according to http://docs.mathjax.org/en/latest/start.html#tex-and-latex-input

mdlincoln commented 7 years ago

@greebie On Line 33, the image link seems to be broken.

greebie commented 7 years ago

Apologies. Minor error in my fix which is now fixed!

mdlincoln commented 7 years ago

I only have one editorial note before sending this off to reviewers.

In the introduction, you note "You would like to have a nice 2 x 2 matrix to show these relationships in some non-confusing manner." Most of our readers won't know what a 2 x 2 matrix is... in fact, I am not entirely sure what you meant here either. Is it a plot? A table with numbers?

Clearly state the research results of doing CA here, so that our reviewers have a solid idea of what you hope to demonstrate in the lesson.

greebie commented 7 years ago

Thanks Matthew. I have included slightly different language to describe the "2 x 2 matrix" idea. The 2 x 2 is popular in business administration literature where correspondence analysis is used quite frequently. In this case I called it a "2 dimensional plot."

mdlincoln commented 7 years ago

So, a plot with an x and a y axis?

greebie commented 7 years ago

Yes, but with high emphasis on the meaning and interpretation of the four quadrants, thus "matrix." That said, I don't emphasize interpreting quadrants in this tutorial, so your point is well taken.

Maybe more information on quadrants analysis in the tutorial? The thing about CA is that interpretation is multi-dimensional as well! Part of what makes it so powerful imho!

Thanks so much for all your help and work here.

mdlincoln commented 7 years ago

OK, that does help clarify.

I think we're in good enough shape to send it off to reviewers, so I'll begin arranging them. Once I have both reviewers set, I'll add an update here with the target deadline for them to submit their formal reviews. As mentioned in our contribution process, please wait until we get both formal reviews in before beginning any revision work. I will write up my own synthesis of the reviews and outline the work that looks like needs to be done. And then we'll proceed from there!

greebie commented 7 years ago

Thanks Matthew. Look forward to hearing from the reviewers.

greebie commented 7 years ago

Just flagging that its probably better to include a copy of the json data to programming historian for long-term sustainability. That will require some editing in the text as well.

mdlincoln commented 7 years ago

@statsmaths has agreed to serve as one of the peer reviewers for this lesson 🎊 Taylor will be submitting a review by July 1st.

I'll post on here again once I've arranged the second reviewer.

mdlincoln commented 7 years ago

@sandravg has agreed to serve as our other peer reviewer for this lesson 😸

Full disclosure: Sandra and I work together at the Getty Research Institute on a research project.

As a reminder of our process: We will not do any active editing of the lesson until both Taylor and Sandra have had a chance to submit their reviews, and I have written my own summary of the two formal reviews and any other comments logged by the community. Once I have written my summary, we ask the community to hold off on further notes until @greebie and I have worked through the revisions.

statsmaths commented 7 years ago

This tutorial does a nice job of presenting and motivating the technique of correspondence analysis. I personally have not seen it used very much in the humanities but it has the potential to be a powerful technique for historians and other humanists. Here are some comments for possible revision:

The data should probably be uploaded to the programming historian site rather than your own GitHub account. I believe that is the usual standard for PH unless there is a good alternative.
I do not see the pedagogical benefit of making users parse the JSON files. Could you not just start with the datasets harper_df and trudeau_df? It is useful to show users how to make abbreviated labels and converting data frames to tables as both of these will be very common roadblocks for applying this type of analysis. Creating data frames from JSON, however, is less likely to be a specific issue for correspondence analysis (though it could be a great separate tutorial).
The mathematical details from paragraphs 22 to 32 are relatively clear and do a good job of describing the high-level motivations without alienating users unfamiliar with the underlying linear algebra. I find its placement a bit confusing, however, because (1) you are showing R output from objects we have not created yet and (2) most of these steps are actually done automatically by the factoextra library but this is not clear from the text. One way to address this would be to move this section after the section "Correspondence Analysis of the Canadian Parliamentary Committees 2006 & 2016." and adding "(Optional)" the section heading.
The level of R required for this tutorial is reasonable (though, as mentioned above, it would be more accessible if you only worked with data frames and tables and avoid JSON and lists). I'm not sure about the prerequisite "some statistical knowledge (an understanding of chi-squared tests would be especially helpful)". Is this actually required and is it reasonable to assume for readers of Programming Historian?
Throughout the text you link to many digital resources. This is great and makes it easy for users to find these resources. Where applicable, such as other PH submissions and R packages, these should also be accompanied with full endnote citations.

Overall I think this is will be a very useful tutorial. The application to comparing the two governments in the Canadian parliament are excellent and does a wonderful job of illustrating the power of this technique.

greebie commented 7 years ago

Thanks a million, Taylor not only for the helpful feedback, but also for taking the time to review. The suggested revisions all seem logical to me. I like the idea of an "optional" section, perhaps even an appendix to include the math stuff. Thanks for the comments about the parliaments cases.

mdlincoln commented 7 years ago

@statsmaths Thanks so much for your comments. @greebie just a reminder, we'll hold off on doing any edits to the draft until @sandravg has a chance to post her own review. Once she has submitted, I'll formulate my own summary and then we can get to work.

mdlincoln commented 7 years ago

I've been in touch with @sandravg and we confirmed that she will submit her review here no later than July 15th.

greebie commented 7 years ago

Thanks for letting me know! Look forward to your response, @sandravg !

sandravg commented 7 years ago

@greebie, everyone, here my review, thanks for your patience!

Great lesson that I think it will be very useful for humanists. I appreciate how it demonstrates that working with categorical data can be extremely rich for analysis. Here my comments and suggestions:

The links throughout the lesson are very helpful, and connect to various topics at different levels of expertise, so they constitute a great resource for readers.
Paragraph 13: this paragraph could perhaps make more explicit to other historians why CA is suited for the question and the data (vis-à-vis other methods) because, for instance, you are not just interested in who goes into what committee (your last sentence), but in the relationship (or lack thereof) between committees (therefore agendas) by the (dis)similarity of their member MPs. Perhaps another placement for paragraph 13 is after paragraph 17. I say this because this section would provide the opportunity for readers to fully connect the contents of this paragraph and CA with the nature of their data and the questions they are asking.
I find the explanations of the method and the underlying transformations very clear. Connected to the idea that knowledge of chi-square tests is useful but not really addressed in the lesson, perhaps it could make sense to bring attention to the chi-square test of independence in the output presented in paragraph 49.
I saw two typos in the code: paragraph 42 (the “]” at the end), and paragraph 48 (CA_Harper instead of CA_harper).
Paragraph 50: I really really like this because this is exactly what usually happens and tends to be discouraging, especially for those initially approaching these methods, and you guide the reader through the “what now?”
Paragraph 58: I would specify ESPE for Equal Pay because it can get confusing without familiarity with the CPC names and acronyms. I’m not completely clear about what is the final set of CPCs used for the Trudeau graph. What I followed: ESPE was included along the other 3 in place of JUST, INAN and FAAE, but I get a no cross-relationship for ETHI and when I exclude it I get other results. I wonder if that would end up being the case for other readers.

The following have to do with parts in the interpretation that read a bit unclear to me, so I point them out in case others not familiar with the context also face similar questions:

Paragraph 58, when you say “that makes some sense given the nature of Violence against Indigenous Women versus Equal Pay”: As someone new to the context of your data, it is not evident to me what you are referring to by “nature”-- I am guessing it’s their recent creation, so that they would be populated by new/different MPs, therefore no connections encountered? But I am left wondering if you meant something else, like some sociopolitical event that I am not familiar with. It would be helpful to clarify, so readers like me can follow your explanations very closely while considering what the data looks like.
In keeping with the research question formulated, and clarity given the many different acronyms to keep track of, I wonder if it would be an idea to again formulate it in terms of FEWO’s relationships (like you did in paragraph 52), so when you propose to take some CPCs and add others for the Trudeau case, then it is not a arbitrary decision but is in keeping with the working hypothesis of dependence or correspondence between FEWO and other big ticket CPCs.
Although not a result of CA, it stands out that filtering the data to members in more than one committee results in much less data points for Trudeau than for Harper, in itself quite interesting in light of your exploratory questions and initial motivation.
Paragraph 61: I understand that the interpretation is based on the alignment with the dimensions, but is there also something to be said about the their position along the axes, so that FEWO has more in common to HESA than to CIMM or FINA? Finally, when you say “the evidence is not strong,” it seems to want some qualification. For someone who comes to this method anew, what indicates a strong relationship in such a context? Perhaps point to something specific in this or a previous graph, or refer to the CA summary output? Unless you conceived it more of an overall conclusion from the preceding sentences, in which case then I agree that the point is that it deserves further research.

Thank you, I think this lesson is a great demonstration of the method, made very accessible with an example that really showcases its advantages (and how to overcome some challenges)!

greebie commented 7 years ago

Thank you so much @sandravg for your very helpful and thorough review! A lot here that can really improve the lesson. I want to get right at it but I know that @mdlincoln is going to tell me to wait until he provides his summary. :)

mdlincoln commented 7 years ago

Yes I will, @greebie 😝

Thanks so much @sandravg for all your thoughts, and thanks again to @statsmaths. I'm going to post up a synthesis of everything no later than next Friday (2017-07-14), and then we can proceed.

mdlincoln commented 7 years ago

These are two great endorsements of your lesson, @greebie, with some excellent advice on how to develop it. Here are my takeaways after reading through both reviews:

Given the complexity of CA and the nuance of the interepretation secion, I agree with Taylor that you should start with the data already in tabular format. Users who looked at the network diagrams lesson on PH can learn more about varieties of data serializations for nodes and network ties.
Both Taylor and Sandra commented on the framing of the statistical explanation. I would strongly consider Taylor's recommendation about relocating the discussion so that the lesson makes a better logical flow. Similarly, for the chi-square test, I believe you could remove reference to it as Taylor suggests, or, alternatively, incorporate it more meaningfully into the analytical section at the end, as Sandra suggests. I have a slight preference for the latter.
Sandra's note about line 13 is a good point. Try to balance explaining this particular dataset and research question with a more cogent abstraction/generalization that explains how this approach can be more broadly useful. This goes not only for introducing CA, but also for each of the interpretive steps (e.g. "inertia" means something for the CPC data - but what can it mean in general?)
Sandra made some particularly substantive comments about the interpretation section. This might require the most additional work on your part, however I think it would benefit the lesson immensely if you would consider each of them, particularly 1, 2, and 4. I believe this would go a long way to abstracting out the interpretation that you're making on this particular dataset to the way that the same interpretive acts could be used by historians of any specialty.
I also agree with Sandra that explaining the CPC acronyms would be a welcome clarification - maybe make a little markdown table when introducing the dataset, so that we know what the letters stand for?
Taylor suggested adding an endnotes section for all the links you point to. This is not a required practice for PH lessons, so only add this if you feel you would like to - although for R packages in particular, including their correct citations (as output by citation()) as endnotes would be good. You do not need to add full citations of other PH posts.

I think we're in a very good position here. Although these comments will require a bit of reordering and rewording on your part, the backbone of the lesson is quite strong. Take some time to revise according to these points, and then I'll do a closer read-through with an eye towards wording/proofreading.

We ask authors to submit revisions within 4 weeks - for this lesson, 6 August. Please let me know if that timeline sounds doable.

greebie commented 7 years ago

Thanks Matthew. 4 weeks seems reasonable. I usually like to front-load this sort of thing, so you may get a revised draft earlier than that.

greebie commented 7 years ago

Hi Matthew. Please bear with me as I work through the changes, but for my own organization, I thought I'd put the notes in as I work on them.

As requested, I placed the datasets in csv / tabular format and placed them in assets/correspondence-analysis-in-R/. In the code, I refer to what I imagine would be the correct url once it is published.
I moved the mathematical explanations to the appendix as recommended and refer to it under "What is Correspondence Analysis?" I included an explanation of what the chi squared test for independence means for the dataset. In general, the chi squared results are not particularly meaningful because MPs are assigned to CPCs in an evenly distributed manner. However, I did provide an example of where we might see a low p-value in the chi squared and why it would occur.
I spent some time considering ways to provide more general abstractions for correspondence analysis. The approach I took was to provide additional examples and abstract from there. Again, I kept the numbers in the appendix, but included an overview of inertia.
As suggested, I elaborated more to address some of Sandra's ideas with an eye to expanding the method to more general historical uses.
I provided a markdown table to match the abbreviations to the committee names as suggested.
I included the R packages in the endnotes for the libraries as suggested.

Please give me an extra few days to look over the lesson and fix any new inconsistencies that may have occurred while working on the revisions. I'll post another comment once I think its ready (probably tomorrow)!

greebie commented 7 years ago

Hello Matthew,

I think I have addressed all the changes and fixed inconsistencies caused by the revisions. It is ready for your review now, I believe.

Thank you.

Ryan. .

mdlincoln commented 7 years ago

I'm working on some much closer line-by-line notes now, but I can't get the TrudeauCPC.csv to work - the column names in the csv do not match up to those that you call in the R code. Update the csv when you have a moment.

mdlincoln commented 7 years ago

In addition to the column names, I think the Trudeau data have changed a bit as well - I'm getting very slightly different results than what you show in the lesson. Run everything again and either update the data, or update the plots and text to make sure that they're consistent with one another.

mdlincoln commented 7 years ago

FYI note that I updated the figure syntax with 808e7a353ba3bbc22362d8eaec0e8688d9cbe043 to follow our author guidelines - you'll see that you just need to point to the specific filename, and include a caption.

mdlincoln commented 7 years ago

@greebie you need to rerun all your code from the updated data - I'm now getting the following even on the Harper data, not just the Trudeau data:

harper_df2 <- harper_df[which(harper_df$abbr %in% 
                                 c("HESA", "JUST", "FEWO", "INAN", "FINA", "FAAE", "IWFA")),]
harper_table2 <- table(harper_df2$abbr, harper_df2$membership)
harper_table2 <- harper_table2[, colSums(harper_table2) > 1]
CA_Harper2 <- CA(harper_table2)

Error in eigen(crossprod(X, X), symmetric = TRUE) : 
  infinite or missing values in 'x'

Once you've updated the lesson, I'll be able to finish my comments.

greebie commented 7 years ago

Okay - I'll take a look. Should be done by tomorrow.

greebie commented 7 years ago

Hi Matthew. The issue was that I forgot that R saves strings as Factors by default. A confusing, but easy to resolve problem. There was also a problem with labelling in the Trudeau data. I tested on my system and got the same answer as before. Note that sometimes the labels will shift around depending on the size of the screen. The points should be in the correct spots however.

I also included code to reproduce the working Trudeau graph, rather than just asking people to "replace x with y." Most people will just want to cut and paste I assume.

Also, the links in the code refer what will be the likely url once the item is published, so the uri will need to be replaced with the ph-submissions uri instead.

mdlincoln commented 7 years ago

stringsAsFactors!!! shakes fist

I'll take a look in the next day or two and then post my notes.

mdlincoln commented 7 years ago

I'll post my full comments right after this, but re: the differences in the plot - it's not an issue of label layout. On my machine at least, the Trudeau data is producing inverses of the plots you've posted. The points are all in the same relative positions, but CA has somehow decided to invert the signs on its multidimensional scaling. I've posted the exact results I get from re-running your code (I even added a set.seed(100) command - although I don't think CA is stochastic, so that shouldn't affect anything?) The Harper visualizations are just fine.

Is it possible we are running slightly different R package versions? I've posed the report of my devtools::session_info() as well.

# import the libraries:
library(FactoMineR)
library(factoextra)
#> Loading required package: ggplot2

# read the csv files

set.seed(100)

harper_df <- read.csv("http://programminghistorian.github.io/ph-submissions/assets/correspondence-analysis-in-R/HarperCPC.csv", stringsAsFactors=FALSE)

harper_table <- table(harper_df$abbr, harper_df$membership)

harper_table <- harper_table[,colSums(harper_table) > 1]
CA_harper <- CA(harper_table)


trudeau_df <- read.csv("http://programminghistorian.github.io/ph-submissions/assets/correspondence-analysis-in-R/TrudeauCPC.csv", stringsAsFactors=FALSE)
trudeau_table <- table(trudeau_df$abbr, trudeau_df$membership)
trudeau_table <- trudeau_table[,colSums(trudeau_table) > 1]
CA_trudeau <- CA(trudeau_table)


fviz_ca_biplot(CA_harper, repel=TRUE)

fviz_ca_biplot(CA_trudeau, repel=TRUE)


#include only the desired committees
# HESA: Health, JUST: Justice, FEWO: Status of Women, 
# INAN: Indigenous and Northern Affairs, FINA: Finance
# FAAE: Foreign Affairs and International Trade
# IWFA: Violence against Indigenous Women

harper_df2 <- harper_df[which(harper_df$abbr %in% 
                                c("HESA", "JUST", "FEWO", "INAN", "FINA", "FAAE", "IWFA")),]
harper_table2 <- table(harper_df2$abbr, harper_df2$membership)

# remove the singles again
harper_table2 <- harper_table2[, colSums(harper_table2) > 1] 
CA_Harper2 <- CA(harper_table2)


trudeau_df2 <- trudeau_df[which(trudeau_df$abbr %in% 
                                  c("HESA", "JUST", "FEWO", "INAN", "FINA", "FAAE", "ESPE")),]
trudeau_table2 <- table(trudeau_df2$abbr, trudeau_df2$membership)
trudeau_table2 <- trudeau_table2[, colSums(trudeau_table2) > 1] # remove the singles again
CA_trudeau2 <- CA(trudeau_table2)
#> Error in eigen(crossprod(X, X), symmetric = TRUE): infinite or missing values in 'x'

trudeau_df3 <- trudeau_df[which(trudeau_df$abbr %in% 
                                  c("HESA", "CIMM", "FEWO", "ETHI", "FINA", "HUMA", "ESPE")),]
trudeau_table3 <- table(trudeau_df3$abbr, trudeau_df3$membership)
trudeau_table3 <- trudeau_table3[, colSums(trudeau_table3) > 1] # remove the singles again
CA_trudeau3 <- CA(trudeau_table3)

Session info

``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R version 3.4.1 (2017-06-30) #> system x86_64, darwin16.5.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> tz America/Los_Angeles #> date 2017-07-21 #> Packages ----------------------------------------------------------------- #> package * version date source #> assertthat 0.2.0 2017-04-11 CRAN (R 3.4.1) #> backports 1.1.0 2017-05-22 CRAN (R 3.4.1) #> base * 3.4.1 2017-07-10 local #> bindr 0.1 2016-11-13 CRAN (R 3.4.1) #> bindrcpp 0.2 2017-06-17 CRAN (R 3.4.1) #> bitops 1.0-6 2013-08-17 CRAN (R 3.4.1) #> cluster 2.0.6 2017-03-10 CRAN (R 3.4.1) #> colorspace 1.3-2 2016-12-14 CRAN (R 3.4.1) #> compiler 3.4.1 2017-07-10 local #> datasets * 3.4.1 2017-07-10 local #> devtools 1.13.2 2017-06-02 CRAN (R 3.4.1) #> digest 0.6.12 2017-01-27 CRAN (R 3.4.1) #> dplyr 0.7.1 2017-06-22 CRAN (R 3.4.1) #> evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1) #> factoextra * 1.0.4 2017-01-09 CRAN (R 3.4.1) #> FactoMineR * 1.36 2017-06-15 CRAN (R 3.4.1) #> flashClust 1.01-2 2012-08-21 CRAN (R 3.4.1) #> ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.1) #> ggpubr 0.1.4 2017-06-28 CRAN (R 3.4.1) #> ggrepel 0.6.5 2016-11-24 CRAN (R 3.4.1) #> glue 1.1.1 2017-06-21 CRAN (R 3.4.1) #> graphics * 3.4.1 2017-07-10 local #> grDevices * 3.4.1 2017-07-10 local #> grid 3.4.1 2017-07-10 local #> gtable 0.2.0 2016-02-26 CRAN (R 3.4.1) #> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.1) #> knitr 1.16 2017-05-18 CRAN (R 3.4.1) #> labeling 0.3 2014-08-23 CRAN (R 3.4.1) #> lattice 0.20-35 2017-03-25 CRAN (R 3.4.1) #> lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.1) #> leaps 3.0 2017-01-10 CRAN (R 3.4.1) #> magrittr 1.5 2014-11-22 CRAN (R 3.4.1) #> MASS 7.3-47 2017-02-26 CRAN (R 3.4.1) #> memoise 1.1.0 2017-04-21 CRAN (R 3.4.1) #> methods * 3.4.1 2017-07-10 local #> munsell 0.4.3 2016-02-13 CRAN (R 3.4.1) #> pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.1) #> plyr 1.8.4 2016-06-08 CRAN (R 3.4.1) #> purrr 0.2.2.2 2017-05-11 CRAN (R 3.4.1) #> R6 2.2.2 2017-06-17 CRAN (R 3.4.1) #> Rcpp 0.12.12 2017-07-15 CRAN (R 3.4.1) #> RCurl 1.95-4.8 2016-03-01 CRAN (R 3.4.1) #> rlang 0.1.1 2017-05-18 CRAN (R 3.4.1) #> rmarkdown 1.6 2017-06-15 CRAN (R 3.4.1) #> rprojroot 1.2 2017-01-16 CRAN (R 3.4.1) #> scales 0.4.1 2016-11-09 CRAN (R 3.4.1) #> scatterplot3d 0.3-40 2017-04-22 CRAN (R 3.4.1) #> stats * 3.4.1 2017-07-10 local #> stringi 1.1.5 2017-04-07 CRAN (R 3.4.1) #> stringr 1.2.0 2017-02-18 CRAN (R 3.4.1) #> tibble 1.3.3 2017-05-28 CRAN (R 3.4.1) #> tools 3.4.1 2017-07-10 local #> utils * 3.4.1 2017-07-10 local #> withr 1.0.2 2016-06-20 CRAN (R 3.4.1) #> XML 3.98-1.9 2017-06-19 CRAN (R 3.4.1) #> yaml 2.1.14 2016-11-12 CRAN (R 3.4.1) ```

mdlincoln commented 7 years ago

You have done a good job of addressing many of the reviewers comments. I still think you could clarify both the introduction as well as the interpretation section a bit more, particularly in order to address Sandra's critiques.

I'll first offer some general comments, and then some section-by-section notes. Please don't be disheartened by the length of this review! Most of the information is already present in the lesson, but I think some further reorganization and reframing of key concepts will really perfect this tutorial.

General comments

Overall, be sure to organize this lesson so that you present simple, easy-to-grasp ideas first, and slowly build to more complex and nuanced questions. I'll address this more in the section-by-section notes, but that will be a key theme of all my suggestions.
Be attentive to terminology for "categories" vs. "groups". Since we're talking about categorical variables, it's easy to get mixed up between the two categories/variables being compared (in your case, politicians and committees), and the "groups" (or "levels", in R terminology) within each of those categories (FEWO vs. HESA vs. INAN, for example). I leave it to you to pick the terms you want to use for each of these concepts, but make sure to define them explicitly early on, and then use the same terms consistently for the remainder of the lesson. I know that usage of these terms in the wilds of software and statistics documentation is also really varied (you could even warn the reader about this in a footnote!) but at least we can keep things stable within this one lesson :)
You highlight three different statistics when discussing how to interpret the results: explanatory value, p-value, and inertia. It will help to more clearly state the definitions of each of these. This may happen explicitly when you first dig in to the summary results of the first CPC CA results... but they need to be hammered home repeatedly.
- p-value: the chance that the observed relationship mistakenly rejects the null hypothesis of "no relationship between the two categories" (aka a way to judge whether or not there exists some significant relationship - of whatever power/explanatory value - between the categories [runs and hides from the statisticians])
- explanatory value: the strength of that modeled relationship
- inertia: a measure of individual nodes' eccentricity or distinctiveness. Important to highlight: unlike the first two measures, which assess the usefulness of a given CA overall, inertia is specific to individual observations. This in particular needs some more careful explanation, since it's distinct to CA.
I like the approachable tone of the lesson, but be careful not to veer into glibness. You don't want to come off (however unintentionally) as patronizing or oversimplifying. For example, at one point you define multivariate exploratory data analysis by writing: "In regular English that means 'data that happened for a number of interesting reasons and we want to use statistics to find out why.'" This could be rephrased in a way that is more directly informative about why this data analysis is called both "multivariate" and "exploratory" (neither of which directly show "why", for what it's worth - so this needs a more accurate explanation, not just a rephrased one.) Other examples that could be toned down or removed:
- "The ‘table’ command is magical."
Make sure that any inline references to R functions or objects (e.g. table, colSums, CA_harper, CA) are wrapped in backticks, so they display as fixed-width text when rendered. This helps signal to the reader that those words are used in R specifically.

I'll structure the remainder of my comments by subheading.

Introduction

I think we can tighten the introduction more. My suggestion is to take the first sentence of line 2 and make it your opening line, and then immediately present types of categories. Eg:

Correspondence analysis (CA) produces a two or three dimensional plot of based on relationships among two or more categories of data. These categories could be “members” and “clubs,” “words” and “books”, or “countries” and “trade agreements”. CA is a method to calculate and visualize the nature and strength of these relationships, helping identify which values of one category correspond to which values of another.

That bolded language is my own text, which you could use directly, or reword as you like. I think variations on that core idea belong not only in the intro, but should be sprinkled evenly throughout the text. It's the biggest takeaway of what CA is, and why someone would want to use it.

In paragraph 2, you note that CA can relate two or more categories of data. This lesson just looks at two at a time - a good choice, I think! But it would be useful to note explicitly that this lesson will only look at 2-dimensional CA. You might add a footnote stating that, while you won't be exploring n-dimensional CA in the lesson, it is mathematically permissible. (And if you know of a link somewhere that someone could go to read more about n-dimensional CA, by all means add it in.)

For the sake of standardization, I think you can replace all but the first mention of "correspondence analysis" with "CA" throughout the lesson.

Pre-requisites

I still agree with Taylor that it isn't accurate to say that a familiarity with chi squared tests is helpful for this lesson - you can probably remove this. I think it may be unnecessarily off-putting to some readers.

What is Correspondence Analysis

This is going to be one of the most important sections of the whole lesson, so I want to be particularly attentive to how you introduce CA.

I'm nervous about foregrounding the spatial example first, as it's not the most intuitive way of using CA. I understand that you wouldn't be directly using spatial coordinates in that use of CA - but it took me a few reads through the lesson to understand what you meant.

Why not work through the 3 examples (or 2 of the 3 examples) that you give in the introduction - members/clubs, words/books, countries/trade. Stick to basic relationship questions first (e.g. "Which clubs share the most common members?", "Do books cluster in to discrete genres based on shared terms?") before getting into higher-order caveats and applications like language bias, or inferring spatial relationships. Only after reading very basic explanations of what CA does in each of these contexts will users of these lessons be able to understand more complex tradeoffs and applications.

In fact, given that you end this section by saying "Perhaps it is easier to just show ..." Yes, in fact, it is! Show the CPC plot first, and explain the abstract concept that blue is one group, red the other, spatial arrangement suggesting their correspondence, then walk through other examples, which will seem much more concrete after grasping the simple-to-understand MP/CPC membership relationship.

FWIW, I would consider entirely removing the pseudo-spatial-analysis example. While I get the gist of the approach, it seems a potentially tricky application of CA given that you'd not be encoding or testing against actual coordinates, or even shared boundaries. If there's some interesting published example of using CA in this way, maybe it's worth linking to as further reading, rather than offering it as an offhand example? This, and the extended caveat about language bias in the books/words example, are rather advanced CA issues that would fit better at the end of the lesson, rather than in the introduction. Stick to core concepts at the start, and once readers see you work through the CPC question, you can briefly outline these issues as opportunities for further consideration and investigation by the readers.

In general, a CA will plot the two most important factors.

If I understand this correctly, this statement is potentially misleading, as the axes are composite weightings of constituent variables, not "factors" in and of themselves. Can you clarify this?

The final paragraph is a great chance to reiterate the meaning of CA in the context of the CPC data: committees closer together have more similar membership, and (conversely) MPs closer together share more common committee posts. That's the core idea of CA - let's repeat it until we're blue in the face :)

Another small issue in the last paragraph: points being "in the same quadrant" of a two-dimensional plot is not the significant point here - rather, what is important is when the are close to one another. Saying "quadrants" may mislead a reader in to thinking that the quadrants and axis intercepts have a special significance, when I don't believe they do - outside of the inertia property, which has bearing on distance from the axis, rather than the literal quadrant a point falls in. (Please clarify if I'm wrong about this!)

Canadian Parliamentary Committees

This section is a good setup of the research question - it just needs a transition, either at the start of this section, or at the end of the preceding one, noting that you're now going to explain the research source and questions you'll be exploring for the remainder of the lesson.

The table is great - though I note that it doesn't display very nicely on our site right now! I'll make sure we update our CSS so the table looks handsome.

The Data

You present three different links, without noting that the reader doesn't actually have to download any of the files attached, as they'll directly connect to those links via R. At the end of the first paragraph, you also say "Here is a sample of the data for the first session of Stephen Harper's government:"... and then nothing? I think this just needs to be tidied up to fit the changes you made by adding the already-tabular data formats.

You also need to explain the tables more straightforwardly. Just note that the rows are committees, the columns MPs, and the cells show a 1 when the MP is a member, and a 0 when they are not.

The sentence "Now that you have seen a correspondence analysis, it is time to show how to create and interpret the findings" seems like an artifact from the reorganization of the lesson, and should be deleted.

Setting Up R for Correspondence Analysis

I think you meant to say "Pop the data", not "them" (the libraries), in to R.

Unless I'm mistaken, I believe you do not need to install curl for read.csv to pull from an http url - I tried it using just base R and could read the data fine. Remove this call.

Warn users that installing FactoMineR and its many dependencies for the first time may take five to ten minutes even on a fast machine, just so they know not to worry while it compiles all that C code.

In that vein, add a comment line to the code noting that you only need to run install.packages() the first time you try this lesson. When running code later, you only need to import the packages with library(). (Make this short - you don't need to explain the entire R package ecosystem.)

Correspondence Analysis of the Canadian Parliamentary Committees 2006 & 2016.

Quite clear - just take note of the issues about using backticks that I suggested above.

Interpreting the Correspondence Analysis

This could be a good section to do as I noted above, and highlight (literally! boldface!) explanatory value, p-value, and intertia - and end with a bulleted list of the terms and definitions, just to drive it home.

When you say "two steps" - what is a step here? A standard deviation?

When you show an excerpt of the summary() results, you include all these eigenvalues, but you don't talk about them. I'd remove them, and only show the line that discusses the p-value

You write The null hypothesis is that the "two factors are mutually-exclusive categories" - I think this needs to be reworded so it is easier to understand that this means the null H. is that the two categories have no relationship to one another.

Did Trudeau Expand the Agenda for Women’s Equality in Parliament?

Explain a bit more fully that you are tightening the focus of your research question now. If I understand correctly, instead of looking for clustering in CPC membership writ large, which is what we started out doing, you will focus specifically on the interrelationships of a subset of seven different committees.

I note that you put 4 committees in the comments, but the desired committees found with the %in% operator number 7 - I think you need to update the comment?

Remove the text "(the same one as at the beginning!)" - for a moment I thought you were saying the graph was unchanged, when really all you meant was that this picture was used in the intro example.

Include the call summary(CAPHarper2) and its abbreviated results as a reminder to readers of how they access the p-value, which you discuss after showing the CA factor map.

When you say "Roughly, we could say that economic concerns fall to the right of the y-axis and social concerns fall to the left" you need to explain why that's interesting.

In paragraph 53, you say "That is a result in and of itself." I think it may be to you... but not to the reader! You need to explain explicitly that what you're trying to say is that (if I understand correctly) in the Trudeau government there is no overlap at all between the CPCs related to women's rights and equality, and the "powerful" CPCs of foreign affairs, Justice, and Finance - and by extension, that the members of the women's rights/equality committees are sequestered from real power centers in parliament.

In paragraph 54 you show the results of another filtering of the data - you must include the code to produce this.

Paragraph 56 still leaves me wondering what results you're drawing from the inertia. What is the different philosophy for CPC membership selection that you think is being suggested?

Analysis

This is where we start to really connect the dots between the visual & statistical results, and real world interpretation.

You call back to several charts in this section, but it isn't always clear which ones, since we've made a few. Perhaps insert the graphs again here, making sure the captions explain not only which government we're looking at, but also which subset of the committees we're seeing.

I think I get the gist of your analytical conclusions, however the story you're telling feels a bit convoluted. Can you reorder this section so that we get a better flow from 1) what we see in the Harper gov't, to 2) what we see in the Trudeau gov't, and 3) what the differences (or lack of differences) between them suggest in historical context.

Paragraphs 62 and 63 look like they belong in the conclusion, not the specific analysis of the CPC data.

Conclusion

64 sounds a bit anodyne. How about you reiterate the core goal of CA, which is looking at interrelationships between categorical data.

The following paragraph lists the technical bits we did, but remind the user they also learned how to read the plots, and understand explanatory value, p-values, and inertia.

Appendix

I think this is quite clear - you may want to move, or even just copy, the definitions of inertia, explanatory value, p-values you give here into the body of the text, as I suggest above. SVD does not need to be in the body of the tutorial though - and I think it's just great to discuss it in the appendix here.

mdlincoln commented 7 years ago

@greebie sorry, I ought to have asked this in my previous comment: Do you think you would be able to send in revisions by August 25? I'll be away from PH duties August 6-20, so there's no enormous rush on these revisions - especially since I gave you a fair amount to do. Please let me know what you think.

greebie commented 7 years ago

@mdlincoln Yes no worries. I might try to front load the changes to get this off my mind. If this is the case, I will have no particular expectation that you should reply by the 20th. It's summer! R&R is still much more important than my soon-to-be masterpiece! :)

As for the graphs, this is common and is unlikely to do with your R version. In general, the way plots are produced for many network graphs is similar to python dictionaries -- they are processed at random. Because of this, early decisions on matrix measurement will determine what ends up with a positive value and later items are processed relative to those decisions. So point inversions are kind of expected depending on what random seed gets selected (by default the seed is produced using the current date and time).

There are three potential solutions. I like option 1 or 3, but am willing to do #2 if you feel it necessary.

Do nothing (hoping that readers will be familiar with inversions that are common with network graphs).
Define a random seed as part of the tutorial and confirm that we produce matched graphs. I don't like this option because it can create problems elsewhere. The most important is trying to explain random numbers generation (although this would make a great tutorial elsewhere! I might even take it on because it's an interesting subject).
Use language like "you should get a graph that looks something like this ..." around the images.

mdlincoln commented 7 years ago

Use set.seed(), and just briefly explain that the CA is a semi-random process.

greebie commented 7 years ago

Hi Matthew,

I pushed some new changes that I think address the above.

I originally had a more obvious example using trade agreements and countries, and I revisited this approach in the new edit.

Also, because you said:

Saying "quadrants" may mislead a reader in to thinking that the quadrants and axis intercepts have a special significance, when I don't believe they do - outside of the inertia property, which has bearing on distance from the axis, rather than the literal quadrant a point falls in. (Please clarify if I'm wrong about this!)

That led me to believe that the purpose of the axis was not well explained, so I tried a slightly different approach. The quadrants and axes do have special significance. They represent the top two "dimensions" (I call them "reasons") why the data points diverge from the centre of the graph. The "reasons" are to be interpreted in the graph based on the axis and the quadrants.

I need to take one more look at this to ensure i have everything. After that I'll have a more substantive response to all your points. Cheers!

mdlincoln commented 7 years ago

Thanks for the update, @greebie. Will you be able to submit changes in the next two weeks, say, by September 9th?

greebie commented 7 years ago

Yes I will. Meant to have final details on friday, but life happened. :)

mdlincoln commented 7 years ago

Perfect - I won't be able to look at it before then, so take your time. I look forward to it!

greebie commented 7 years ago

Hi Matthew. Changes have been submitted. I believe I went through the details step by step with one major exception -- the issue about quadrants did not seem clear.

In general the quadrants matter very much because the each axis represents one facet (I call them "reasons" in the article) for why inertia occurs. I used the example of "members and clubs" in the paper to explain this better. Club members might be separated because some prefer sports clubs more than art clubs and others are vice versa. If this were the case, there would roughly be a separation of members of sports clubs on one side and art clubs on the other. A secondary reason might be members who join clubs that meet on weekends versus those that meet in the evenings during the week. That would show up on another dimension. The CA shows the two "reasons" that have the largest overall inertia to show. These patterns are not explained by the data, but they must be interpreted based on the patterns in the graph (similar to factor analysis). Since the two dimensions provide two "separations" the quadrants can often (but not always) provide a four-way separation among the categories. In the members and clubs situation that would be 1. People who attend sports clubs on weekends. 2. People who attend sports clubs that happen during the week. 3. People who attend art clubs on weekends. 4. People who attend art clubs during the week. (Reverse for the sports clubs data points). In general, this gets more complex with more data points, but even then there can still be interesting patterns emerging from the data.

Basically, I think your overall advice helped me get to the point where this is more clear. What I did was 1) provide a more detailed example using more straight-forward examples & 2) improve the consistency of terms to avoid pitfalls among categories, elements, dimensions and datapoints.

Beyond that, I think I covered the majority of the points as written. Instead of a glossary, I provided endnotes to define the terms you suggested.

Thanks again for the advice and apologies on the delay in getting this out to you.

mdlincoln commented 7 years ago

Thank you for sending in these revisions. I very much like to clearer examples. I've committed a host of small formatting fixes and a few wording changes.

One major change I made: I still find the introduction of what CA is to be overcomplicated, particularly grafs 2 and 3. The axes of the plot are not "reasons" in the way a historian would understand them. They illustrate the relative co-incidence of categories, nothing more. This illustration may suggest the existence of possible clustering motivations or phenomena in the real-world subject. That's exactly why CA is useful - and you demonstrate this really well later in the lesson. But in and of themselves, the axes are just components of a reprojected mathematical space, so calling them "reasons" is confusing. The goal of this lesson should be to clarify the boundary between CA results and CA interpretation - and right now, the introduction still muddies the waters a bit. 59c96b7c48768fd852eaeadc4812f28fa452ff29 is my attempt at removing this confusing language.

Other to-do:

I think you need to put the "Setting up R for CA" section immediately before "The Data" section - it's only in setting up R that you call the table() function that produces the table that you show in "The Data" section. Also, in the text you discuss using the substr() function but your code doesn't actually use it - if this isn't something the user needs to do, then take it out of the instructions.
Please archive the raw data (https://github.com/greebie/Compare/blob/master/walkcompare/data/parl_comm_minority.json) using Zenodo. I agree it's not necessary to host the raw data on the PH site since it isn't integral to the lesson, but I do want to make sure we don't lose access to it should you move/delete/rename any of your GitHub repos in the future!
Can you add a link to a good news article or general-audience analysis of why the Robert Pickton trial led to the national inquiry?
You need a transition sentence to conclude your "Analysis" section by tying it back to CA. Right now, you go heavily in to political analysis and we lose the thread of how CA helped you get there.

mdlincoln commented 7 years ago

(I should have also noted: after these last changes, I think we'll be good to publish! Just gotta find some nice header image for your lesson :)

greebie commented 7 years ago

Hi Matthew. Thanks for the help with this. I think I've addressed all the changes as requested. Thanks so much for your help in refining this to a publishable tutorial!

mdlincoln commented 7 years ago

Thanks @greebie. I'm going to open a Pull Request on the main repo now with the lesson and its files. Final discussion and checks can happen there.

programminghistorian / ph-submissions