programminghistorian / ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons
http://programminghistorian.github.io/ph-submissions
138 stars 112 forks source link

Review Ticket for "Correspondence Analysis for Historical Research with R" #78

Closed mdlincoln closed 7 years ago

mdlincoln commented 7 years ago

The Programming Historian has received the following tutorial on "Correspondence Analysis for Historical Research with R" by @greebie. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/correspondence-analysis-in-R

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I will provide initial feedback here before inviting the reviewers to comment.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @ianmilligan1 or @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudspeople (Ian Milligan and Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

greebie commented 7 years ago

One formatting issue I am having is that Github does not render Mathjax or any nice formula-producing syntax. I do not think the formulas are mandatory, but they do make the document look more professional. Has anyone in the community encountered this issue before and how did you handle it?

mdlincoln commented 7 years ago

Good question. I don't believe we've yet had a lesson that needed to use mathjax, but if it looks like we'll really want it for this one, then we might consider adding the necessary javascript to the Programming Historian site. For now, please include all the equation syntax you originally intended to use, and lets see how editing and revisions go. If we do need to add in mathjax, we can incorporate it with our upcoming general redesign.

greebie commented 7 years ago

Okay. Another option is to add an image for the formulas. If PH considers more tutorials in R, it may make sense to have some math capability. In general, I find equations to take away from a paper, but on the other hand, I wouldn't want to be standing in a VIVA saying "oh I don't know how the statistics work, I just let R do it for me!"

mdlincoln commented 7 years ago

Actually, on closer inspection, it looks like we do have mathjax linked to in our HTML headers... but the implementation may be out of date as I see they've just switched their hosting service. Let me look in to fixing it on our home site as well as on the ph-submissions version.

greebie commented 7 years ago

Okay - assuming things are running as desired, I am willing to draft a quick Mathjax cheatsheet for PH. (So many flavors to get mathy things in a md.) Thanks.

mdlincoln commented 7 years ago

OK, I've made two changes:

  1. I enabled mathjax on the ph-submissions repo (it was active on our published site, but not this one)
  2. I replaced your $...$ notation with \\(...\\) according to http://docs.mathjax.org/en/latest/start.html#tex-and-latex-input
mdlincoln commented 7 years ago

@greebie On Line 33, the image link seems to be broken.

greebie commented 7 years ago

Apologies. Minor error in my fix which is now fixed!

mdlincoln commented 7 years ago

I only have one editorial note before sending this off to reviewers.

In the introduction, you note "You would like to have a nice 2 x 2 matrix to show these relationships in some non-confusing manner." Most of our readers won't know what a 2 x 2 matrix is... in fact, I am not entirely sure what you meant here either. Is it a plot? A table with numbers?

Clearly state the research results of doing CA here, so that our reviewers have a solid idea of what you hope to demonstrate in the lesson.

greebie commented 7 years ago

Thanks Matthew. I have included slightly different language to describe the "2 x 2 matrix" idea. The 2 x 2 is popular in business administration literature where correspondence analysis is used quite frequently. In this case I called it a "2 dimensional plot."

mdlincoln commented 7 years ago

So, a plot with an x and a y axis?

greebie commented 7 years ago

Yes, but with high emphasis on the meaning and interpretation of the four quadrants, thus "matrix." That said, I don't emphasize interpreting quadrants in this tutorial, so your point is well taken.

Maybe more information on quadrants analysis in the tutorial? The thing about CA is that interpretation is multi-dimensional as well! Part of what makes it so powerful imho!

Thanks so much for all your help and work here.

mdlincoln commented 7 years ago

OK, that does help clarify.

I think we're in good enough shape to send it off to reviewers, so I'll begin arranging them. Once I have both reviewers set, I'll add an update here with the target deadline for them to submit their formal reviews. As mentioned in our contribution process, please wait until we get both formal reviews in before beginning any revision work. I will write up my own synthesis of the reviews and outline the work that looks like needs to be done. And then we'll proceed from there!

greebie commented 7 years ago

Thanks Matthew. Look forward to hearing from the reviewers.

greebie commented 7 years ago

Just flagging that its probably better to include a copy of the json data to programming historian for long-term sustainability. That will require some editing in the text as well.

mdlincoln commented 7 years ago

@statsmaths has agreed to serve as one of the peer reviewers for this lesson 🎊 Taylor will be submitting a review by July 1st.

I'll post on here again once I've arranged the second reviewer.

mdlincoln commented 7 years ago

@sandravg has agreed to serve as our other peer reviewer for this lesson 😸

Full disclosure: Sandra and I work together at the Getty Research Institute on a research project.

As a reminder of our process: We will not do any active editing of the lesson until both Taylor and Sandra have had a chance to submit their reviews, and I have written my own summary of the two formal reviews and any other comments logged by the community. Once I have written my summary, we ask the community to hold off on further notes until @greebie and I have worked through the revisions.

statsmaths commented 7 years ago

This tutorial does a nice job of presenting and motivating the technique of correspondence analysis. I personally have not seen it used very much in the humanities but it has the potential to be a powerful technique for historians and other humanists. Here are some comments for possible revision:

Overall I think this is will be a very useful tutorial. The application to comparing the two governments in the Canadian parliament are excellent and does a wonderful job of illustrating the power of this technique.

greebie commented 7 years ago

Thanks a million, Taylor not only for the helpful feedback, but also for taking the time to review. The suggested revisions all seem logical to me. I like the idea of an "optional" section, perhaps even an appendix to include the math stuff. Thanks for the comments about the parliaments cases.

mdlincoln commented 7 years ago

@statsmaths Thanks so much for your comments. @greebie just a reminder, we'll hold off on doing any edits to the draft until @sandravg has a chance to post her own review. Once she has submitted, I'll formulate my own summary and then we can get to work.

mdlincoln commented 7 years ago

I've been in touch with @sandravg and we confirmed that she will submit her review here no later than July 15th.

greebie commented 7 years ago

Thanks for letting me know! Look forward to your response, @sandravg !

sandravg commented 7 years ago

@greebie, everyone, here my review, thanks for your patience!

Great lesson that I think it will be very useful for humanists. I appreciate how it demonstrates that working with categorical data can be extremely rich for analysis. Here my comments and suggestions:

The following have to do with parts in the interpretation that read a bit unclear to me, so I point them out in case others not familiar with the context also face similar questions:

Thank you, I think this lesson is a great demonstration of the method, made very accessible with an example that really showcases its advantages (and how to overcome some challenges)!

greebie commented 7 years ago

Thank you so much @sandravg for your very helpful and thorough review! A lot here that can really improve the lesson. I want to get right at it but I know that @mdlincoln is going to tell me to wait until he provides his summary. :)

mdlincoln commented 7 years ago

Yes I will, @greebie 😝

Thanks so much @sandravg for all your thoughts, and thanks again to @statsmaths. I'm going to post up a synthesis of everything no later than next Friday (2017-07-14), and then we can proceed.

mdlincoln commented 7 years ago

These are two great endorsements of your lesson, @greebie, with some excellent advice on how to develop it. Here are my takeaways after reading through both reviews:

  1. Given the complexity of CA and the nuance of the interepretation secion, I agree with Taylor that you should start with the data already in tabular format. Users who looked at the network diagrams lesson on PH can learn more about varieties of data serializations for nodes and network ties.

  2. Both Taylor and Sandra commented on the framing of the statistical explanation. I would strongly consider Taylor's recommendation about relocating the discussion so that the lesson makes a better logical flow. Similarly, for the chi-square test, I believe you could remove reference to it as Taylor suggests, or, alternatively, incorporate it more meaningfully into the analytical section at the end, as Sandra suggests. I have a slight preference for the latter.

  3. Sandra's note about line 13 is a good point. Try to balance explaining this particular dataset and research question with a more cogent abstraction/generalization that explains how this approach can be more broadly useful. This goes not only for introducing CA, but also for each of the interpretive steps (e.g. "inertia" means something for the CPC data - but what can it mean in general?)

  4. Sandra made some particularly substantive comments about the interpretation section. This might require the most additional work on your part, however I think it would benefit the lesson immensely if you would consider each of them, particularly 1, 2, and 4. I believe this would go a long way to abstracting out the interpretation that you're making on this particular dataset to the way that the same interpretive acts could be used by historians of any specialty.

  5. I also agree with Sandra that explaining the CPC acronyms would be a welcome clarification - maybe make a little markdown table when introducing the dataset, so that we know what the letters stand for?

  6. Taylor suggested adding an endnotes section for all the links you point to. This is not a required practice for PH lessons, so only add this if you feel you would like to - although for R packages in particular, including their correct citations (as output by citation()) as endnotes would be good. You do not need to add full citations of other PH posts.

I think we're in a very good position here. Although these comments will require a bit of reordering and rewording on your part, the backbone of the lesson is quite strong. Take some time to revise according to these points, and then I'll do a closer read-through with an eye towards wording/proofreading.

We ask authors to submit revisions within 4 weeks - for this lesson, 6 August. Please let me know if that timeline sounds doable.

greebie commented 7 years ago

Thanks Matthew. 4 weeks seems reasonable. I usually like to front-load this sort of thing, so you may get a revised draft earlier than that.

greebie commented 7 years ago

Hi Matthew. Please bear with me as I work through the changes, but for my own organization, I thought I'd put the notes in as I work on them.

  1. As requested, I placed the datasets in csv / tabular format and placed them in assets/correspondence-analysis-in-R/. In the code, I refer to what I imagine would be the correct url once it is published.

  2. I moved the mathematical explanations to the appendix as recommended and refer to it under "What is Correspondence Analysis?" I included an explanation of what the chi squared test for independence means for the dataset. In general, the chi squared results are not particularly meaningful because MPs are assigned to CPCs in an evenly distributed manner. However, I did provide an example of where we might see a low p-value in the chi squared and why it would occur.

  3. I spent some time considering ways to provide more general abstractions for correspondence analysis. The approach I took was to provide additional examples and abstract from there. Again, I kept the numbers in the appendix, but included an overview of inertia.

  4. As suggested, I elaborated more to address some of Sandra's ideas with an eye to expanding the method to more general historical uses.

  5. I provided a markdown table to match the abbreviations to the committee names as suggested.

  6. I included the R packages in the endnotes for the libraries as suggested.

Please give me an extra few days to look over the lesson and fix any new inconsistencies that may have occurred while working on the revisions. I'll post another comment once I think its ready (probably tomorrow)!

greebie commented 7 years ago

Hello Matthew,

I think I have addressed all the changes and fixed inconsistencies caused by the revisions. It is ready for your review now, I believe.

Thank you.

Ryan. .

mdlincoln commented 7 years ago

I'm working on some much closer line-by-line notes now, but I can't get the TrudeauCPC.csv to work - the column names in the csv do not match up to those that you call in the R code. Update the csv when you have a moment.

mdlincoln commented 7 years ago

In addition to the column names, I think the Trudeau data have changed a bit as well - I'm getting very slightly different results than what you show in the lesson. Run everything again and either update the data, or update the plots and text to make sure that they're consistent with one another.

mdlincoln commented 7 years ago

FYI note that I updated the figure syntax with 808e7a353ba3bbc22362d8eaec0e8688d9cbe043 to follow our author guidelines - you'll see that you just need to point to the specific filename, and include a caption.

mdlincoln commented 7 years ago

@greebie you need to rerun all your code from the updated data - I'm now getting the following even on the Harper data, not just the Trudeau data:

harper_df2 <- harper_df[which(harper_df$abbr %in% 
                                 c("HESA", "JUST", "FEWO", "INAN", "FINA", "FAAE", "IWFA")),]
harper_table2 <- table(harper_df2$abbr, harper_df2$membership)
harper_table2 <- harper_table2[, colSums(harper_table2) > 1]
CA_Harper2 <- CA(harper_table2)
Error in eigen(crossprod(X, X), symmetric = TRUE) : 
  infinite or missing values in 'x'

Once you've updated the lesson, I'll be able to finish my comments.

greebie commented 7 years ago

Okay - I'll take a look. Should be done by tomorrow.

greebie commented 7 years ago

Hi Matthew. The issue was that I forgot that R saves strings as Factors by default. A confusing, but easy to resolve problem. There was also a problem with labelling in the Trudeau data. I tested on my system and got the same answer as before. Note that sometimes the labels will shift around depending on the size of the screen. The points should be in the correct spots however.

I also included code to reproduce the working Trudeau graph, rather than just asking people to "replace x with y." Most people will just want to cut and paste I assume.

Also, the links in the code refer what will be the likely url once the item is published, so the uri will need to be replaced with the ph-submissions uri instead.

mdlincoln commented 7 years ago

stringsAsFactors!!! shakes fist

I'll take a look in the next day or two and then post my notes.

mdlincoln commented 7 years ago

I'll post my full comments right after this, but re: the differences in the plot - it's not an issue of label layout. On my machine at least, the Trudeau data is producing inverses of the plots you've posted. The points are all in the same relative positions, but CA has somehow decided to invert the signs on its multidimensional scaling. I've posted the exact results I get from re-running your code (I even added a set.seed(100) command - although I don't think CA is stochastic, so that shouldn't affect anything?) The Harper visualizations are just fine.

Is it possible we are running slightly different R package versions? I've posed the report of my devtools::session_info() as well.

# import the libraries:
library(FactoMineR)
library(factoextra)
#> Loading required package: ggplot2

# read the csv files

set.seed(100)

harper_df <- read.csv("http://programminghistorian.github.io/ph-submissions/assets/correspondence-analysis-in-R/HarperCPC.csv", stringsAsFactors=FALSE)

harper_table <- table(harper_df$abbr, harper_df$membership)

harper_table <- harper_table[,colSums(harper_table) > 1]
CA_harper <- CA(harper_table)


trudeau_df <- read.csv("http://programminghistorian.github.io/ph-submissions/assets/correspondence-analysis-in-R/TrudeauCPC.csv", stringsAsFactors=FALSE)
trudeau_table <- table(trudeau_df$abbr, trudeau_df$membership)
trudeau_table <- trudeau_table[,colSums(trudeau_table) > 1]
CA_trudeau <- CA(trudeau_table)


fviz_ca_biplot(CA_harper, repel=TRUE)

fviz_ca_biplot(CA_trudeau, repel=TRUE)


#include only the desired committees
# HESA: Health, JUST: Justice, FEWO: Status of Women, 
# INAN: Indigenous and Northern Affairs, FINA: Finance
# FAAE: Foreign Affairs and International Trade
# IWFA: Violence against Indigenous Women

harper_df2 <- harper_df[which(harper_df$abbr %in% 
                                c("HESA", "JUST", "FEWO", "INAN", "FINA", "FAAE", "IWFA")),]
harper_table2 <- table(harper_df2$abbr, harper_df2$membership)

# remove the singles again
harper_table2 <- harper_table2[, colSums(harper_table2) > 1] 
CA_Harper2 <- CA(harper_table2)


trudeau_df2 <- trudeau_df[which(trudeau_df$abbr %in% 
                                  c("HESA", "JUST", "FEWO", "INAN", "FINA", "FAAE", "ESPE")),]
trudeau_table2 <- table(trudeau_df2$abbr, trudeau_df2$membership)
trudeau_table2 <- trudeau_table2[, colSums(trudeau_table2) > 1] # remove the singles again
CA_trudeau2 <- CA(trudeau_table2)
#> Error in eigen(crossprod(X, X), symmetric = TRUE): infinite or missing values in 'x'

trudeau_df3 <- trudeau_df[which(trudeau_df$abbr %in% 
                                  c("HESA", "CIMM", "FEWO", "ETHI", "FINA", "HUMA", "ESPE")),]
trudeau_table3 <- table(trudeau_df3$abbr, trudeau_df3$membership)
trudeau_table3 <- trudeau_table3[, colSums(trudeau_table3) > 1] # remove the singles again
CA_trudeau3 <- CA(trudeau_table3)

Session info ``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R version 3.4.1 (2017-06-30) #> system x86_64, darwin16.5.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> tz America/Los_Angeles #> date 2017-07-21 #> Packages ----------------------------------------------------------------- #> package * version date source #> assertthat 0.2.0 2017-04-11 CRAN (R 3.4.1) #> backports 1.1.0 2017-05-22 CRAN (R 3.4.1) #> base * 3.4.1 2017-07-10 local #> bindr 0.1 2016-11-13 CRAN (R 3.4.1) #> bindrcpp 0.2 2017-06-17 CRAN (R 3.4.1) #> bitops 1.0-6 2013-08-17 CRAN (R 3.4.1) #> cluster 2.0.6 2017-03-10 CRAN (R 3.4.1) #> colorspace 1.3-2 2016-12-14 CRAN (R 3.4.1) #> compiler 3.4.1 2017-07-10 local #> datasets * 3.4.1 2017-07-10 local #> devtools 1.13.2 2017-06-02 CRAN (R 3.4.1) #> digest 0.6.12 2017-01-27 CRAN (R 3.4.1) #> dplyr 0.7.1 2017-06-22 CRAN (R 3.4.1) #> evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1) #> factoextra * 1.0.4 2017-01-09 CRAN (R 3.4.1) #> FactoMineR * 1.36 2017-06-15 CRAN (R 3.4.1) #> flashClust 1.01-2 2012-08-21 CRAN (R 3.4.1) #> ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.1) #> ggpubr 0.1.4 2017-06-28 CRAN (R 3.4.1) #> ggrepel 0.6.5 2016-11-24 CRAN (R 3.4.1) #> glue 1.1.1 2017-06-21 CRAN (R 3.4.1) #> graphics * 3.4.1 2017-07-10 local #> grDevices * 3.4.1 2017-07-10 local #> grid 3.4.1 2017-07-10 local #> gtable 0.2.0 2016-02-26 CRAN (R 3.4.1) #> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.1) #> knitr 1.16 2017-05-18 CRAN (R 3.4.1) #> labeling 0.3 2014-08-23 CRAN (R 3.4.1) #> lattice 0.20-35 2017-03-25 CRAN (R 3.4.1) #> lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.1) #> leaps 3.0 2017-01-10 CRAN (R 3.4.1) #> magrittr 1.5 2014-11-22 CRAN (R 3.4.1) #> MASS 7.3-47 2017-02-26 CRAN (R 3.4.1) #> memoise 1.1.0 2017-04-21 CRAN (R 3.4.1) #> methods * 3.4.1 2017-07-10 local #> munsell 0.4.3 2016-02-13 CRAN (R 3.4.1) #> pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.1) #> plyr 1.8.4 2016-06-08 CRAN (R 3.4.1) #> purrr 0.2.2.2 2017-05-11 CRAN (R 3.4.1) #> R6 2.2.2 2017-06-17 CRAN (R 3.4.1) #> Rcpp 0.12.12 2017-07-15 CRAN (R 3.4.1) #> RCurl 1.95-4.8 2016-03-01 CRAN (R 3.4.1) #> rlang 0.1.1 2017-05-18 CRAN (R 3.4.1) #> rmarkdown 1.6 2017-06-15 CRAN (R 3.4.1) #> rprojroot 1.2 2017-01-16 CRAN (R 3.4.1) #> scales 0.4.1 2016-11-09 CRAN (R 3.4.1) #> scatterplot3d 0.3-40 2017-04-22 CRAN (R 3.4.1) #> stats * 3.4.1 2017-07-10 local #> stringi 1.1.5 2017-04-07 CRAN (R 3.4.1) #> stringr 1.2.0 2017-02-18 CRAN (R 3.4.1) #> tibble 1.3.3 2017-05-28 CRAN (R 3.4.1) #> tools 3.4.1 2017-07-10 local #> utils * 3.4.1 2017-07-10 local #> withr 1.0.2 2016-06-20 CRAN (R 3.4.1) #> XML 3.98-1.9 2017-06-19 CRAN (R 3.4.1) #> yaml 2.1.14 2016-11-12 CRAN (R 3.4.1) ```
mdlincoln commented 7 years ago

You have done a good job of addressing many of the reviewers comments. I still think you could clarify both the introduction as well as the interpretation section a bit more, particularly in order to address Sandra's critiques.

I'll first offer some general comments, and then some section-by-section notes. Please don't be disheartened by the length of this review! Most of the information is already present in the lesson, but I think some further reorganization and reframing of key concepts will really perfect this tutorial.


General comments


I'll structure the remainder of my comments by subheading.

Introduction

I think we can tighten the introduction more. My suggestion is to take the first sentence of line 2 and make it your opening line, and then immediately present types of categories. Eg:

Correspondence analysis (CA) produces a two or three dimensional plot of based on relationships among two or more categories of data. These categories could be “members” and “clubs,” “words” and “books”, or “countries” and “trade agreements”. CA is a method to calculate and visualize the nature and strength of these relationships, helping identify which values of one category correspond to which values of another.

That bolded language is my own text, which you could use directly, or reword as you like. I think variations on that core idea belong not only in the intro, but should be sprinkled evenly throughout the text. It's the biggest takeaway of what CA is, and why someone would want to use it.

In paragraph 2, you note that CA can relate two or more categories of data. This lesson just looks at two at a time - a good choice, I think! But it would be useful to note explicitly that this lesson will only look at 2-dimensional CA. You might add a footnote stating that, while you won't be exploring n-dimensional CA in the lesson, it is mathematically permissible. (And if you know of a link somewhere that someone could go to read more about n-dimensional CA, by all means add it in.)

For the sake of standardization, I think you can replace all but the first mention of "correspondence analysis" with "CA" throughout the lesson.

Pre-requisites

I still agree with Taylor that it isn't accurate to say that a familiarity with chi squared tests is helpful for this lesson - you can probably remove this. I think it may be unnecessarily off-putting to some readers.

What is Correspondence Analysis

This is going to be one of the most important sections of the whole lesson, so I want to be particularly attentive to how you introduce CA.

I'm nervous about foregrounding the spatial example first, as it's not the most intuitive way of using CA. I understand that you wouldn't be directly using spatial coordinates in that use of CA - but it took me a few reads through the lesson to understand what you meant.

Why not work through the 3 examples (or 2 of the 3 examples) that you give in the introduction - members/clubs, words/books, countries/trade. Stick to basic relationship questions first (e.g. "Which clubs share the most common members?", "Do books cluster in to discrete genres based on shared terms?") before getting into higher-order caveats and applications like language bias, or inferring spatial relationships. Only after reading very basic explanations of what CA does in each of these contexts will users of these lessons be able to understand more complex tradeoffs and applications.

In fact, given that you end this section by saying "Perhaps it is easier to just show ..." Yes, in fact, it is! Show the CPC plot first, and explain the abstract concept that blue is one group, red the other, spatial arrangement suggesting their correspondence, then walk through other examples, which will seem much more concrete after grasping the simple-to-understand MP/CPC membership relationship.

FWIW, I would consider entirely removing the pseudo-spatial-analysis example. While I get the gist of the approach, it seems a potentially tricky application of CA given that you'd not be encoding or testing against actual coordinates, or even shared boundaries. If there's some interesting published example of using CA in this way, maybe it's worth linking to as further reading, rather than offering it as an offhand example? This, and the extended caveat about language bias in the books/words example, are rather advanced CA issues that would fit better at the end of the lesson, rather than in the introduction. Stick to core concepts at the start, and once readers see you work through the CPC question, you can briefly outline these issues as opportunities for further consideration and investigation by the readers.

In general, a CA will plot the two most important factors.

If I understand this correctly, this statement is potentially misleading, as the axes are composite weightings of constituent variables, not "factors" in and of themselves. Can you clarify this?

The final paragraph is a great chance to reiterate the meaning of CA in the context of the CPC data: committees closer together have more similar membership, and (conversely) MPs closer together share more common committee posts. That's the core idea of CA - let's repeat it until we're blue in the face :)

Another small issue in the last paragraph: points being "in the same quadrant" of a two-dimensional plot is not the significant point here - rather, what is important is when the are close to one another. Saying "quadrants" may mislead a reader in to thinking that the quadrants and axis intercepts have a special significance, when I don't believe they do - outside of the inertia property, which has bearing on distance from the axis, rather than the literal quadrant a point falls in. (Please clarify if I'm wrong about this!)

Canadian Parliamentary Committees

This section is a good setup of the research question - it just needs a transition, either at the start of this section, or at the end of the preceding one, noting that you're now going to explain the research source and questions you'll be exploring for the remainder of the lesson.

The table is great - though I note that it doesn't display very nicely on our site right now! I'll make sure we update our CSS so the table looks handsome.

The Data

You present three different links, without noting that the reader doesn't actually have to download any of the files attached, as they'll directly connect to those links via R. At the end of the first paragraph, you also say "Here is a sample of the data for the first session of Stephen Harper's government:"... and then nothing? I think this just needs to be tidied up to fit the changes you made by adding the already-tabular data formats.

You also need to explain the tables more straightforwardly. Just note that the rows are committees, the columns MPs, and the cells show a 1 when the MP is a member, and a 0 when they are not.

The sentence "Now that you have seen a correspondence analysis, it is time to show how to create and interpret the findings" seems like an artifact from the reorganization of the lesson, and should be deleted.

Setting Up R for Correspondence Analysis

I think you meant to say "Pop the data", not "them" (the libraries), in to R.

Unless I'm mistaken, I believe you do not need to install curl for read.csv to pull from an http url - I tried it using just base R and could read the data fine. Remove this call.

Warn users that installing FactoMineR and its many dependencies for the first time may take five to ten minutes even on a fast machine, just so they know not to worry while it compiles all that C code.

In that vein, add a comment line to the code noting that you only need to run install.packages() the first time you try this lesson. When running code later, you only need to import the packages with library(). (Make this short - you don't need to explain the entire R package ecosystem.)

Correspondence Analysis of the Canadian Parliamentary Committees 2006 & 2016.

Quite clear - just take note of the issues about using backticks that I suggested above.

Interpreting the Correspondence Analysis

This could be a good section to do as I noted above, and highlight (literally! boldface!) explanatory value, p-value, and intertia - and end with a bulleted list of the terms and definitions, just to drive it home.

When you say "two steps" - what is a step here? A standard deviation?

When you show an excerpt of the summary() results, you include all these eigenvalues, but you don't talk about them. I'd remove them, and only show the line that discusses the p-value

You write The null hypothesis is that the "two factors are mutually-exclusive categories" - I think this needs to be reworded so it is easier to understand that this means the null H. is that the two categories have no relationship to one another.

Did Trudeau Expand the Agenda for Women’s Equality in Parliament?

Explain a bit more fully that you are tightening the focus of your research question now. If I understand correctly, instead of looking for clustering in CPC membership writ large, which is what we started out doing, you will focus specifically on the interrelationships of a subset of seven different committees.

I note that you put 4 committees in the comments, but the desired committees found with the %in% operator number 7 - I think you need to update the comment?

Remove the text "(the same one as at the beginning!)" - for a moment I thought you were saying the graph was unchanged, when really all you meant was that this picture was used in the intro example.

Include the call summary(CAPHarper2) and its abbreviated results as a reminder to readers of how they access the p-value, which you discuss after showing the CA factor map.

When you say "Roughly, we could say that economic concerns fall to the right of the y-axis and social concerns fall to the left" you need to explain why that's interesting.

In paragraph 53, you say "That is a result in and of itself." I think it may be to you... but not to the reader! You need to explain explicitly that what you're trying to say is that (if I understand correctly) in the Trudeau government there is no overlap at all between the CPCs related to women's rights and equality, and the "powerful" CPCs of foreign affairs, Justice, and Finance - and by extension, that the members of the women's rights/equality committees are sequestered from real power centers in parliament.

In paragraph 54 you show the results of another filtering of the data - you must include the code to produce this.

Paragraph 56 still leaves me wondering what results you're drawing from the inertia. What is the different philosophy for CPC membership selection that you think is being suggested?

Analysis

This is where we start to really connect the dots between the visual & statistical results, and real world interpretation.

You call back to several charts in this section, but it isn't always clear which ones, since we've made a few. Perhaps insert the graphs again here, making sure the captions explain not only which government we're looking at, but also which subset of the committees we're seeing.

I think I get the gist of your analytical conclusions, however the story you're telling feels a bit convoluted. Can you reorder this section so that we get a better flow from 1) what we see in the Harper gov't, to 2) what we see in the Trudeau gov't, and 3) what the differences (or lack of differences) between them suggest in historical context.

Paragraphs 62 and 63 look like they belong in the conclusion, not the specific analysis of the CPC data.

Conclusion

64 sounds a bit anodyne. How about you reiterate the core goal of CA, which is looking at interrelationships between categorical data.

The following paragraph lists the technical bits we did, but remind the user they also learned how to read the plots, and understand explanatory value, p-values, and inertia.

Appendix

I think this is quite clear - you may want to move, or even just copy, the definitions of inertia, explanatory value, p-values you give here into the body of the text, as I suggest above. SVD does not need to be in the body of the tutorial though - and I think it's just great to discuss it in the appendix here.

mdlincoln commented 7 years ago

@greebie sorry, I ought to have asked this in my previous comment: Do you think you would be able to send in revisions by August 25? I'll be away from PH duties August 6-20, so there's no enormous rush on these revisions - especially since I gave you a fair amount to do. Please let me know what you think.

greebie commented 7 years ago

@mdlincoln Yes no worries. I might try to front load the changes to get this off my mind. If this is the case, I will have no particular expectation that you should reply by the 20th. It's summer! R&R is still much more important than my soon-to-be masterpiece! :)

As for the graphs, this is common and is unlikely to do with your R version. In general, the way plots are produced for many network graphs is similar to python dictionaries -- they are processed at random. Because of this, early decisions on matrix measurement will determine what ends up with a positive value and later items are processed relative to those decisions. So point inversions are kind of expected depending on what random seed gets selected (by default the seed is produced using the current date and time).

There are three potential solutions. I like option 1 or 3, but am willing to do #2 if you feel it necessary.

  1. Do nothing (hoping that readers will be familiar with inversions that are common with network graphs).
  2. Define a random seed as part of the tutorial and confirm that we produce matched graphs. I don't like this option because it can create problems elsewhere. The most important is trying to explain random numbers generation (although this would make a great tutorial elsewhere! I might even take it on because it's an interesting subject).
  3. Use language like "you should get a graph that looks something like this ..." around the images.
mdlincoln commented 7 years ago

Use set.seed(), and just briefly explain that the CA is a semi-random process.

greebie commented 7 years ago

Hi Matthew,

I pushed some new changes that I think address the above.

I originally had a more obvious example using trade agreements and countries, and I revisited this approach in the new edit.

Also, because you said:

Saying "quadrants" may mislead a reader in to thinking that the quadrants and axis intercepts have a special significance, when I don't believe they do - outside of the inertia property, which has bearing on distance from the axis, rather than the literal quadrant a point falls in. (Please clarify if I'm wrong about this!)

That led me to believe that the purpose of the axis was not well explained, so I tried a slightly different approach. The quadrants and axes do have special significance. They represent the top two "dimensions" (I call them "reasons") why the data points diverge from the centre of the graph. The "reasons" are to be interpreted in the graph based on the axis and the quadrants.

I need to take one more look at this to ensure i have everything. After that I'll have a more substantive response to all your points. Cheers!

mdlincoln commented 7 years ago

Thanks for the update, @greebie. Will you be able to submit changes in the next two weeks, say, by September 9th?

greebie commented 7 years ago

Yes I will. Meant to have final details on friday, but life happened. :)

mdlincoln commented 7 years ago

Perfect - I won't be able to look at it before then, so take your time. I look forward to it!

greebie commented 7 years ago

Hi Matthew. Changes have been submitted. I believe I went through the details step by step with one major exception -- the issue about quadrants did not seem clear.

In general the quadrants matter very much because the each axis represents one facet (I call them "reasons" in the article) for why inertia occurs. I used the example of "members and clubs" in the paper to explain this better. Club members might be separated because some prefer sports clubs more than art clubs and others are vice versa. If this were the case, there would roughly be a separation of members of sports clubs on one side and art clubs on the other. A secondary reason might be members who join clubs that meet on weekends versus those that meet in the evenings during the week. That would show up on another dimension. The CA shows the two "reasons" that have the largest overall inertia to show. These patterns are not explained by the data, but they must be interpreted based on the patterns in the graph (similar to factor analysis). Since the two dimensions provide two "separations" the quadrants can often (but not always) provide a four-way separation among the categories. In the members and clubs situation that would be 1. People who attend sports clubs on weekends. 2. People who attend sports clubs that happen during the week. 3. People who attend art clubs on weekends. 4. People who attend art clubs during the week. (Reverse for the sports clubs data points). In general, this gets more complex with more data points, but even then there can still be interesting patterns emerging from the data.

Basically, I think your overall advice helped me get to the point where this is more clear. What I did was 1) provide a more detailed example using more straight-forward examples & 2) improve the consistency of terms to avoid pitfalls among categories, elements, dimensions and datapoints.

Beyond that, I think I covered the majority of the points as written. Instead of a glossary, I provided endnotes to define the terms you suggested.

Thanks again for the advice and apologies on the delay in getting this out to you.

mdlincoln commented 7 years ago

Thank you for sending in these revisions. I very much like to clearer examples. I've committed a host of small formatting fixes and a few wording changes.

One major change I made: I still find the introduction of what CA is to be overcomplicated, particularly grafs 2 and 3. The axes of the plot are not "reasons" in the way a historian would understand them. They illustrate the relative co-incidence of categories, nothing more. This illustration may suggest the existence of possible clustering motivations or phenomena in the real-world subject. That's exactly why CA is useful - and you demonstrate this really well later in the lesson. But in and of themselves, the axes are just components of a reprojected mathematical space, so calling them "reasons" is confusing. The goal of this lesson should be to clarify the boundary between CA results and CA interpretation - and right now, the introduction still muddies the waters a bit. 59c96b7c48768fd852eaeadc4812f28fa452ff29 is my attempt at removing this confusing language.

Other to-do:

mdlincoln commented 7 years ago

(I should have also noted: after these last changes, I think we'll be good to publish! Just gotta find some nice header image for your lesson :)

greebie commented 7 years ago

Hi Matthew. Thanks for the help with this. I think I've addressed all the changes as requested. Thanks so much for your help in refining this to a publishable tutorial!

mdlincoln commented 7 years ago

Thanks @greebie. I'm going to open a Pull Request on the main repo now with the lesson and its files. Final discussion and checks can happen there.