Visualizing data with R and ggplot2

hawc2 commented 8 months ago

Programming Historian in English has received a proposal for a lesson, 'Visualizing data with R and ggplot2,' by @rogorido and @nabsiddiqui.

I have circulated this proposal for feedback within the English team. We have considered this proposal for:

Openness: we advocate for use of open source software, open programming languages and open datasets
Global access: we serve a readership working with different operating systems and varying computational resources
Multilingualism: we celebrate methodologies and tools that can be applied or adapted for use in multilingual research-contexts
Sustainability: we're committed to publishing learning resources that can remain useful beyond present-day graphical user interfaces and current software versions

We are pleased to have invited @rogorido and @nabsiddiqui to develop this Proposal into a Submission under the guidance of @semanticnoodles as editor.

The Submission package should include:

Lesson text (written in Markdown)
- For guidance, we recommend Sarah Simpkin's lesson Getting Started with Markdown
Figures: images / plots / graphs (if using)
Data assets: codebooks, sample dataset (if using)

We ask @rogorido and @nabsiddiqui to share their Submission package with our Publishing team by email, copying in @semanticnoodles.

We've agreed a submission date of April. We ask @rogorido and @nabsiddiqui to contact us if they need to revise this deadline.

When the Submission package is received, our Publishing team will process the new lesson materials, and prepare a Preview of the initial draft. They will post a comment in this Issue to provide the locations of all key files, as well as a link to the Preview where contributors can read the lesson as the draft progresses.

If we have not received the Submission package by April, @semanticnoodles will attempt to contact @rogorido and @nabsiddiqui. If we do not receive any update, this Issue will be closed.

Our dedicated Ombudspersons are Ian Milligan (English), Silvia Gutiérrez De la Torre (español), Hélène Huet (français), and Luis Ferla (português) Please feel free to contact them at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudspersons will have no impact on the outcome of any peer review.

semanticnoodles commented 8 months ago

I confirm @rogorido and @nabsiddiqui shared with me access to their repository containing all the required files, and that I handed them over to @anisa-hawes to allow the publishing team to generate the preview, thanks.

anisa-hawes commented 8 months ago

Hello Giulia @semanticnoodles, Igor @rogorido and Nabeel @nabsiddiqui,

Many thanks for sharing the lesson submission materials with me. I've now checked the Markdown file, and add some key elements of metadata. I've also checked the accompanying images and assets, ensuring each element meets our requirements.

You can find the key files here:

You can review a Preview of the lesson here:

http://programminghistorian.github.io/ph-submissions/en/drafts/originals/visualizing-data-with-r-and-ggplot2

--

A few initial notes:

I've made a slight adjustment to the Header sizes used in the lesson. Our typesetting convention is that ## Header 2 is the largest.
I've added placeholder alt_text + captions for each of your images. We have committed to providing alt-text for all figure images, plots and graphs included in our lessons, so you'll need to add this as part of your revisions. These notes on Descriptive Alt text may be useful to you.
I've checked to ensure that you both have the Write access you'll need to edit your draft directly. We ask authors to work on their own files with direct commits: (we prefer you don't fork our repo, or use the Pull Request system in ph-submissions).
I imagine Giulia @semanticnoodles may have noted this too, but I noticed that you include both a .tsv and a .csv version of the dataset, although only the .csv appears to be used in the lesson. Is the .tsv alternative required too?

anisa-hawes commented 8 months ago

Hello again Igor @rogorido and Nabeel @nabsiddiqui.

What's happening now?

Your lesson has been moved to the next phase of our workflow which is Phase 2: Initial Edit.

In this Phase, your editor Giulia @semanticnoodles will read your lesson, and provide some initial feedback. Giulia will post feedback and suggestions as a comment in this Issue, so that you can revise your draft in the following Phase 3: Revision 1.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 1 <br> Submission
Who worked on this? : Publishing Manager (@anisa-hawes) 
All  Phase 1 tasks completed? : Yes
Section Phase 2 <br> Initial Edit
Who's working on this? : Editor (@semanticnoodles)  
Expected completion date? : April 20
Section Phase 3 <br> Revision 1
Who's responsible? : Authors (@rogorido + @nabsiddiqui) 
Expected timeframe? : ~30 days after feedback is received

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

rogorido commented 8 months ago

@anisa-hawes Thanks for your comments. As for the tsv file: no, it is not required. It can be deleted.

I'll add the alternative captions. Thanks.

rogorido commented 7 months ago

I added captions and alt texts (10a6a9e1b0c9fa794637837338bd7a61b7f6c5d7), but Nabeel should take a look whether it looks 'Englishly' enough...

semanticnoodles commented 7 months ago

Hello @rogorido and @nabsiddiqui,

here follows my preliminary feedback; I am aware it is quite extensive, but I believe these indications could help you strengthen your tutorial. If you need any clarification, please do not hesitate to ask!

Overall feedback

In general, your tutorial provides valuable guidance on navigating and producing a wide range of visualisations, effectively walking through the various features of ggplot2. The piece meets the accessibility and inclusivity goals of the Programming Historian fairly well, and in most cases the language is easy to understand and straightforward. However, some elements need further work, mostly falling under two intertwined aspects discussed in the following paragraphs.

Usability: Enhancing the logical structure of the lesson

In my opinion, this is the most critical point to consider. The tutorial lacks a cohesive element to tie its components together and the organisation of the content could benefit from a more linear and less convoluted approach. The case study you propose (sister cities) seems to be just a tool to obtain a series of visualisations. This is fair enough, but it could benefit from further methodological contextualisation and unpacking: the people following your tutorial may not be historians not have a clear understanding of the methods you are using -- although they can be familiar with R.

In terms of improving the overall content, I think there are two possible directions for you to consider: either revising the content to follow a visualisation task-based narrative or placing more emphasis on the structure of the case study. The first option would privilege the visualisation tasks (but still require some methodological support for the case study), while the second would require you to generate stronger and sharper research questions from the case study, to be answered (at least in part) by the visualisation tasks. I think @nabsiddiqui did a very good job of structuring the content in the lesson Data Wrangling and Management in R, so I would recommend keeping that in mind as a reference.

The title of the proposal could benefit from being more specific - or at least mentioning the context of application. The table of contents looks unbalanced: the headings and their actual wording could be better aligned with the content they cover, and the nesting could be more linear.

You give very clear information about the concept of the grammar of graphics - this is really the cornerstone of understanding how ggplot2 is designed. I really appreciate you explaining this and including many useful resources, although I think they could be arranged more organically, instead of including relatively short hints throughout the tutorial, as they tend to overshadow the walkthrough steps on several occasions.

Sustainability: Critically reviewing the data analysis narrative

The dataset looks more than adequate for the visualisation tasks you have set as objectives, but the data narrative and its wording could benefit from further tuning. What you offer in this lesson is mostly visualisation of data distributions and there is little statistical testing involved. As your topic is sister cities, it makes perfect sense to talk about relationships, although what you observe are mostly trends or tendencies that you could try to explain through further research; sometimes you clearly point that out and sometimes it looks rather implicit. I think this is just a matter of fine-tuning the language, nothing more.

Section-specific feedback

Para stands for paragraph number; please refer to the preview generated by @anisa-hawes

Introduction, Lesson Goals and Data

[ ] Para 1, line 2: there is an extra )
[ ] Lesson’s goals could be more specific (you could pick outcomes that have major resonance that adding meaningful labels to plots)
[ ] No reference to the dataset is presented here (it comes from Wikidata, right?). Make sure you at least have a couple of words about it here represented.
[ ] Review the heading accordingly with the edits.

ggplot2: General Overview

[ ] This acts more like an introductory section, although it is nested under the previous one. Bring it to the same level as the previous or put it before it to give a more comprehensive introduction (or re-arrange it for better consistency, please).
[ ] A couple of words about the Tidyverse here would better contextualise the workflow.
[ ] Para 7 could be added to the Additional Resources section.
[ ] Para 8 could mention more strategically the arguments – review it for a better alignment with the walkthrough. You could even thinking of following the official layers featured in the introduction to ggplot2 vignette, adapting that to match with the elements you thoroughly explain.
[ ] Review the heading accordingly with the edits.

Sister cities in Europe

[ ] Please clarify your understanding of sister cities by giving a working definition. This would clarify the starting point of your research.
[ ] The rationale of your case needs some more unpacking; please add some context here, also about the provenance of your dataset.
[ ] The research questions here listed are somewhat aligned with the steps you propose. I would recommend you to review them for enhanced consistency.
[ ] Review the heading accordingly with the edits. Most importantly, from here on you start with the walkthrough. Make sure you clarify this by tuning the headings.

Loading Data with `readr`

[ ] If you referenced the tidyverse above you won’t need to explain tibbles extensively here. Please review this part for conciseness.
[ ] Including head(eudata) could support your explanation about the observations occurring in the dataset – this is also considered good practice in data science.
[ ] Para 16 could benefit the previous section.
[ ] Consider raising the level of this heading and review it accordingly.

Creating a bar graph

[ ] IMPORTANT: There is no typecountry column included in your dataset. I tested the walkthrough using the data contained in the eu column, just remember to send us the correct version of the dataset.
[ ] Paras 20-23 could be more focused on the walkthrough; anticipating para 23 once obtained the barplot could enhance the clarity.
[ ] Para 30 could use a bit more details about the interpretation of the results. If you plan
[ ] Review the heading accordingly with the edits.

Other Geoms: Histograms, Distribution Plots and Boxplots

[ ] Para 31, penultimate line: comma missing space afterwards.
[ ] Para 33, please review this for clarity (here you should mention why you used log10 once for all or put it into another spot. Consider explaining why none of the methods is ideal)

This leads to an uninformative histogram. We can take log10(dist) as our variable or filter to exclude values above 5000kms. None of these methods is ideal, but as far as we know, we are operating with manipulated data making it less problematic
[ ] Para 36, please review it for clarity (it reads implicitly why you employed ECDF).
[ ] Para 41, same issue: you refer to ANOVA without explaining why you foresee that as a viable statistic test, cutting the paragraph short.
[ ] Review the heading accordingly with the edits.

Manipulating the Look of Graphs

[ ] This section would be more logically following the Other Geoms section. Evaluate how to make this and the following sessions more cohesive.
[ ] Para 42 could be revised for clarity – especially the research question. Mind that you first performed the random subsampling and then explained it.
[ ] Para 45 does not add much information to the following steps. Instead of pointing out which elements you want to manipulate, consider laying out clearly the goal for your tasks.
[ ] Para 55, review for conciseness (sometimes less is more).
[ ] Review the heading accordingly with the edits.

Scales: Colors, Legends, and Axes

[ ] Para 65, please review for straightforwardness - advantage of using a continuous scale? Also a repetition in the last line (“represent the distance”).
[ ] Para 68, review for accuracy: the way it is phrased seems like ggplot2 does not use discrete colour scales at all.
[ ] Para 70, would better fit in the Additional Resources section.
[ ] Para 74, review for accuracy.

Faceting a Graph

[ ] This section would be more logically part of the Other Geoms section and use a title anticipating also the theme changes.
[ ] Para 75, review for clarity and conciseness (“split by categories [space time and so]” is not very straightforward. Consider explaining straightforwardly what facetting is.)

Themes: Changing Static Elements

[ ] As the previous, this section would be more logically following the Other Geoms section.

Extending ggplot2 with Other Packages

[ ] Para 84, extra comma not rendering the link for Ridgeline plots
[ ] As the previous, this section would be more logically following the Other Geoms section.

Additional Resources

[ ] Consider reviewing and incorporating other elements into this section, following more closely the tools used in the tutorial instead of pointing towards general-purpose resources. A critical list of resources would be more useful to your readers.

Format & style

Two quick comments on the form and style.

[ ] Please homogenise the use of capitalisation in the headings (exclusion made for ggplot2 that always comes lowercased, but you know it 😄)
[ ] Please homogenise the way you refer to R functions and arguments – using the code format or not, you choose. Consistency is the only requirement.

Thank you for the great work done so far!

rogorido commented 7 months ago

@semanticnoodles thanks for your extensive comments. I will have a look at the enhancements you're proposing in the next days.

anisa-hawes commented 7 months ago

What's happening now?

Hello Igor @rogorido and Nabeel @nabsiddiqui. Your lesson has been moved to the next phase of our workflow which is Phase 3: Revision 1.

This Phase is an opportunity for you to revise your draft in response to @semanticnoodles's initial feedback. You can make direct commits to your file here: /en/drafts/originals/visualizing-data-with-r-and-ggplot2.md. @charlottejmc or I are here to help if you encounter any practical problems!

When both of you + Giulia are happy with the revised draft, we will move forward to Phase 4: Open Peer Review.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 2 <br> Initial Edit
Who worked on this? : Editor (@semanticnoodles) 
All  Phase 1 tasks completed? : Yes
Section Phase 3 <br> Revision 1
Who's working on this? : Authors (@rogorido + @nabsiddiqui)  
Expected completion date? : May 17
Section Phase 4 <br> Open Peer Review
Who's responsible? : Reviewers (TBC) 
Expected timeframe? : ~60 days after request is accepted

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

semanticnoodles commented 6 months ago

Hello Igor @rogorido and Nabeel @nabsiddiqui, I hope you are doing well!

Just checking in with you about the draft revision (Phase 3 / Revision 1) as the deadline of the 17th of May has passed. If you need some extra time let me know approximately how much, so we can set up a new deadline -- and @anisa-hawes or @charlottejmc can update the Mermaid timeframe.

If you have doubts or need any clarification, please do not hesitate to keep in touch.

nabsiddiqui commented 6 months ago

Hello @semanticnoodles,

I have tried to rework a lot of the tutorial. I feel that changing some of the headings will make the flow more obvious. Let me see if it makes sense the way I have done it or if there should be additional changes. Here are some of what I reviewed based on your timeline. The rest I will leave to @rogorido unless he has an objection:

Introduction, Lesson Goals and Data

[X] Para 1, line 2: there is an extra )
[X] Lesson’s goals could be more specific (you could pick outcomes that have major resonance that adding meaningful labels to plots)
[X] No reference to the dataset is presented here (it comes from Wikidata, right?). Make sure you at least have a couple of words about it here represented.
[X] Review the heading accordingly with the edits.

ggplot2: General Overview

[X] This acts more like an introductory section, although it is nested under the previous one. Bring it to the same level as the previous or put it before it to give a more comprehensive introduction (or re-arrange it for better consistency, please).
[X] A couple of words about the Tidyverse here would better contextualise the workflow.
[X] Para 7 could be added to the Additional Resources section.
[x] Para 8 could mention more strategically the arguments – review it for a better alignment with the walkthrough. You could even thinking of following the official layers featured in the introduction to ggplot2 vignette, adapting that to match with the elements you thoroughly explain.
[X] Review the heading accordingly with the edits.

Sister cities in Europe

[X] Please clarify your understanding of sister cities by giving a working definition. This would clarify the starting point of your research.
[X] The rationale of your case needs some more unpacking; please add some context here, also about the provenance of your dataset.
[X] The research questions here listed are somewhat aligned with the steps you propose. I would recommend you to review them for enhanced consistency.
[X] Review the heading accordingly with the edits. Most importantly, from here on you start with the walkthrough. Make sure you clarify this by tuning the headings.

Loading Data with `readr`

[X] If you referenced the tidyverse above you won’t need to explain tibbles extensively here. Please review this part for conciseness.
[ ] Including head(eudata) could support your explanation about the observations occurring in the dataset – this is also considered good practice in data science.
[X] Para 16 could benefit the previous section.
[X] Consider raising the level of this heading and review it accordingly. (Felt it was better at this level)

Creating a bar graph

[ ] IMPORTANT: There is no typecountry column included in your dataset. I tested the walkthrough using the data contained in the eu column, just remember to send us the correct version of the dataset.
[ ] Paras 20-23 could be more focused on the walkthrough; anticipating para 23 once obtained the barplot could enhance the clarity.
[X] Para 30 could use a bit more details about the interpretation of the results. If you plan
[X] Review the heading accordingly with the edits.

Other Geoms: Histograms, Distribution Plots and Boxplots

[X] Para 31, penultimate line: comma missing space afterwards.
[X] Para 33, please review this for clarity (here you should mention why you used log10 once for all or put it into another spot. Consider explaining why none of the methods is ideal)

This leads to an uninformative histogram. We can take log10(dist) as our variable or filter to exclude values above 5000kms. None of these methods is ideal, but as far as we know, we are operating with manipulated data making it less problematic
[X] Para 36, please review it for clarity (it reads implicitly why you employed ECDF).
[X] Para 41, same issue: you refer to ANOVA without explaining why you foresee that as a viable statistic test, cutting the paragraph short.
[X] Review the heading accordingly with the edits.

Manipulating the Look of Graphs

[X] This section would be more logically following the Other Geoms section. Evaluate how to make this and the following sessions more cohesive.
[X] Para 42 could be revised for clarity – especially the research question. Mind that you first performed the random subsampling and then explained it.
[X] Para 45 does not add much information to the following steps. Instead of pointing out which elements you want to manipulate, consider laying out clearly the goal for your tasks.
[X] Para 55, review for conciseness (sometimes less is more).
[X] Review the heading accordingly with the edits.

Scales: Colors, Legends, and Axes

[X] Para 65, please review for straightforwardness - advantage of using a continuous scale? Also a repetition in the last line (“represent the distance”).
[X] Para 68, review for accuracy: the way it is phrased seems like ggplot2 does not use discrete colour scales at all.
[X] Para 70, would better fit in the Additional Resources section.
[X] Para 74, review for accuracy.

Faceting a Graph

[X] This section would be more logically part of the Other Geoms section and use a title anticipating also the theme changes.
[X] Para 75, review for clarity and conciseness (“split by categories [space time and so]” is not very straightforward. Consider explaining straightforwardly what facetting is.)

Themes: Changing Static Elements

[X] As the previous, this section would be more logically following the Other Geoms section.

Extending ggplot2 with Other Packages

[X] Para 84, extra comma not rendering the link for Ridgeline plots
[X] As the previous, this section would be more logically following the Other Geoms section.

Additional Resources

[ ] Consider reviewing and incorporating other elements into this section, following more closely the tools used in the tutorial instead of pointing towards general-purpose resources. A critical list of resources would be more useful to your readers.

Format & style

Two quick comments on the form and style.

[X] Please homogenise the use of capitalisation in the headings (exclusion made for ggplot2 that always comes lowercased, but you know it 😄)
[x] Please homogenise the way you refer to R functions and arguments – using the code format or not, you choose. Consistency is the only requirement.

Other

[ ] Change Title to be More Descriptive

anisa-hawes commented 5 months ago

Thank you, @nabsiddiqui!

@semanticnoodles will review these revisions and advise if we are ready to move onwards to the next Phase of the workflow (which will be Phase 4 Open Peer Review). Giulia is away this week, returning on June 3rd.

In the meantime, @charlottejmc and I can help with ensuring that functions and arguments are typographically consistent. These are aspects we always check as part of typesetting at Phase 6, but we'll do a quick scan now so that this isn't a distraction for Reviewers.

charlottejmc commented 5 months ago

Hello @nabsiddiqui and @semanticnoodles,

I've made some adjustments to add backticks to functions, arguments and other parts of code, trying to stay consistent with our house style.

semanticnoodles commented 5 months ago

Hello everybody, I am back! While I was away I got the chance to go through the tutorial and I can say you did upgrade the lesson quite a lot. Brilliant work @nabsiddiqui and @rogorido -- and many many thanks to @charlottejmc and @anisa-hawes for their support!

I will take another quick reading as I think I spotted another couple of small things to fix, but I believe now it is almost ready to move onwards to Phase 4. Sorry for the slight delay in my answer -- I will get back to you in a few hours.🖥

semanticnoodles commented 5 months ago

It took longer than expected (hours became days..). Nevertheless, if @rogorido and @nabsiddiqui can quickly fix the elements in the list below I believe we can move to the open peer review (Phase 4). The most urgent is the first element, the following are about simple formalities/typos.

[x] remember to upload the correct dataset containing the typecountry column
[x] paras 86-88 are missing list formatting
[x] Conclusion, paras 126/127 display several contracted forms, e.g. "you'll" -- please expand them.
[x] para 133 extra space before the dot
[x] paras 136/137 missing end dot
[x] para 141, explore typo "epxlore"

Thank you for the patience!

rogorido commented 5 months ago

@semanticnoodles (and @nabsiddiqui): I have already corrected all typos (I hope). And I have the correct dataset. But my question is: where should I exactly upload it?

Many thanks for your work!

charlottejmc commented 5 months ago

Hello @rogorido, thank you for making these corrections.

You can replace the current sistercities.csv file in your lesson's associated assets folder, here.

If you prefer, however, you could send the file directly to me (publishing.assistant[@]programminghistorian.org) and I can upload it for you.

Thank you!

rogorido commented 5 months ago

@charlottejmc Thanks for your answer. I have uplodaded it to the assets folder. I hope everything is OK now... (commit: 387fdd9)

anisa-hawes commented 5 months ago

Hello Igor @rogorido and Nabeel @nabsiddiqui,

What's happening now?

Your lesson has been moved to the next phase of our workflow which is Phase 4: Open Peer Review.

This phase is an opportunity for you to hear feedback from peers in the community.

Giulia @semanticnoodles will invite two reviewers to read your lesson/translation, test your code, and provide constructive feedback. In the spirit of openness, reviews will be posted as comments in this issue (unless you specifically request a closed review).

After both reviews, Giulia will summarise the suggestions to clarify your priorities in Phase 5: Revision 2.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 3 <br> Revision 1
Who worked on this? : Authors (@rogorido + @nabsiddiqui)
All  Phase 3 tasks completed? : Yes
Section Phase 4 <br> Open Peer Review
Who's working on this? : Reviewers (@justinwigard + @regan008)
Expected completion date? : August 31
Section Phase 5 <br> Revision 2
Who's responsible? : Authors (@rogorido + @nabsiddiqui)
Expected timeframe? : ~30 days after editor's summary

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

rogorido commented 5 months ago

@anisa-hawes Thanks. No problem with the comments being posted here.

semanticnoodles commented 4 months ago

Open Peer Review

During Phases 2 and 3, I provided initial feedback on this lesson, then worked with Igor @rogorido and Nabeel @nabsiddiqui, to complete a first round of revisions. In Phase 4 Open Peer Review, we invite feedback from others in our community.

Welcome to Justin Wigard @justinwigard and Amanda Regan @regan008 ! By participating in this peer review process, you are contributing to the creation of a useful and sustainable technical resource for the whole community. Thank you ✨ Please read the lesson, test the code, and post your review as a comment in this issue by August 31st.

Reviewer Guidelines:

https://programminghistorian.org/en/reviewer-guidelines

A preview of the lesson:

http://programminghistorian.github.io/ph-submissions/en/drafts/originals/visualizing-data-with-r-and-ggplot2

Notes:

All participants in this discussion are advised to read and be guided by our shared Code of Conduct
Members of the wider community may also choose to contribute reviews.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

Programming Historian in English is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or request clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience.

We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudsperson Dr Ian Milligan. Thank you for helping us to create a safe space.

nabsiddiqui commented 4 months ago

Wonderful! Thank you @semanticnoodles for organizing this.

Thank you also to @justinwigard and @regan008 for agreeing to serve as reviewers. I have enjoyed both of your research and scholarships and look forward to your comments. Please let me know if you need anything in the mean time.

regan008 commented 2 months ago

@semanticnoodles I'm running late with this, but I promise to get to it this week!

semanticnoodles commented 2 months ago

@semanticnoodles I'm running late with this, but I promise to get to it this week!

Hi @regan008, thanks for keeping us posted!

rogorido commented 2 months ago

@semanticnoodles I'm running late with this, but I promise to get to it this week!

Thanks, @regan008!

regan008 commented 2 months ago

@semanticnoodles Thanks again for asking me to review this -- it was a pleasure to read and engage with. @rogorido and @nabsiddiqui, I can't wait to assign this lesson to my own students. It is an excellent overview of the ggplot package and the concepts related to the Grammar of Graphics. Congratulations!

I highly recommend publication. I do think the lesson loses the reader slightly between paragraphs 56-60. I would recommend taking another look at those to see if they can be massaged to be a bit more in line with the skill level for the rest of the lesson. And lastly, I think perhaps in the geom section, the authors should briefly discuss line charts. I'm not sure the data will make that easy, but I think many readers will be historians who will want to look at change over time in their charts.

Here are some more detailed line-level comments:

Typo p9. Also genuine question – aren’t there more plotting packages than those 3? I’m thinking of something like Plotly’s package or Dygraphs. Do those build on top of ggplot, and that’s why you don’t include them?
P15 (point 5) – I might say that it can also be used to create maps and remove GIS from this sentence. I don’t think it does the sophisticated types of terrain maps that Geographic Information Systems implies.
P24 (point 7) – It might be useful here to make it clear that coordinate systems (I think) only apply when you are building maps. Its not immediately clear you are talking about maps in this point.
You’ve done a really good job of adding links out to other concepts like tidy data and exploratory data analysis. Your explanations give the reader just enough to understand and point them to where they can learn more.
Again, the tibble section is a masterclass in explaining a big concept in a simple way. Great job.
P39 – I think the figure caption for fig 1 is mixed up. This chart appears to show the count of locations not the total percentage
Maybe link out in p60 where ECDF is mentioned
P56-60 – I had to read this several times to understand what exactly was being plotted in these charts because both axes are numeric. I think just a bit more description of what you are tyring to visualize might help readers who are newer to data visualization. I get what you are doing here, but I think some further explanation would help beginners.
Overall, I think the ECDF section combined with the Histogram section really Levels up this lesson. I see their utility but I wonder if they are rocketing this to another level for readers? Perhaps you need one more basic example in here. As a historian, I think perhaps a line chart might be useful for those who want to display change over time?
Lastly, considering you don’t actually cover maps – which is fine, there’s a lot to cover here already, I think you need to explicitly state that in p15. I started the lesson thinking that it was likely a topic given that the data had coordinates.

Please let me know if you have any questions or need anything else from me. I look forward to promoting this lesson once it is live!

semanticnoodles commented 2 months ago

Thank you, Amanda @regan008, for your insightful review and enthusiasm! I will hold off on posting my wrap-up until both the reviews are published, but this is excellent!

justinwigard commented 2 months ago

Overall, I think this tutorial is excellent. I’m likewise hoping to assign this tutorial in my own classes, and even learned a few new ways to think about my own approach to working with data!

I highly recommend publication. I have divided my review into three brief sections based on the Programming Historian review guidelines: Surface, Functional, and Code. I know my review has a lot of little items, but they’re primarily lightweight and surface, aimed at just tightening up an already streamlined tutorial. Take or leave as needed, I trust y’all’s judgement.

One further point I wanted to highlight: while working through the code, I noticed that my own visualizations differed slightly from those in the tutorial, whether that was due to a slightly updated dataset or due to something on my side. I flagged those in the Functional section and the Code section primarily, just to get a second set of eyes on them – functionally, everything works, it’s just a few odd visual discrepancies! The attached Appendix files demonstrate what visualizations are happening on my side.

What a great submission. Please let me know if I can help further, and so looking forward to sharing it when things are completed!

Surface:

¶ 1: I like this opening move. I wonder I “communicate” might work as a more inviting phrase than “publish” findings?
-- Given the phrasing in ¶ 28, I do think that “communicate” or something broader than “publish” might be better here, especially as ¶ 28 talks about using an exploratory approach.
¶ 9: “there” might be able to be taken out
¶ 11: parentheses on “detailed below )that” just needs to be attached to the word “below”
¶ 16: Looks like an extra underscore is included.
¶ 19 or 21: Is there a way to make the connection that “geoms” are “Geometric Objects”, as in geoms (Geometric Objects)? I’m not sure if that should be spelled out more clearly in Aesthetics or in Geometric Objects, but it just struck me that “geoms” is used before Geometric Objects are introduced here.
¶ 24: Should “systems” be capitalized to match item #4, “Geometric Objects”?
¶ 42: “standardized grammar of grammar” should be changed to “standardized grammar of graphics” I think.
¶ 44: Has the tutorial on dplyr been mentioned? The phrase “again” is a little confusing here, and makes me think I’ve missed something.
¶ 49: I think there’s a sentence that was unfinished, potentially? “…the column for different bars, and We also added”
Throughout: is it “sister city” or “sister-city”? Both versions are used, so choosing one or the other would be helpful for readers, I think!
¶ 83: “We” can probably be uncapitalized, just to tidy this conclusion up.
¶ 95: “to represent the distance” is repeated twice here
¶ 113: “themes” in sentence two can be capitalized.
Why even apply themes? I understand the appeal, but readers might want to know more about this. There’s some strong explanations of why we should change size, shape, and other aesthetics of the data we’re visualizing, so I think you also have a great opportunity to teach us about why even consider themes.
¶ 118: Couple of small typos here that could be addressed to increase readability that much more.
-- “massive [amount of] extensions”?
-- “graphsridgeline plots” – think there just needs to be a comma separating those.

Functional:

¶ 26: Do the packages in tidyverse need to be spelled out in a brief list, since multiple tidyverse packages will be used in the tutorial?
¶ 26: Can you include a link to Wikidata’s home page? I love that y’all are connecting and using Wikidata, so I think general users might appreciate having a link to it. Might also be useful to include a short phrase explaining what Wikidata is for those unfamiliar.
¶ 30: Is there a way to highlight or offset the sentence on “Downloading the dataset” to be less skippable?
¶ 31: Have y’all thought about including include a screenshot under eudata showing what the tibble should look like? Might not be needed, but I figured I’d check!
¶ 31: Additionally, upon loading in eudata, I have a tibble of 13,081 x 15, but the tutorial only shows 13,081 x 12. Is that to be expected?
¶ 40-44: I love the explanation of ggplot2’s syntax! Quite effective and conversational, a good strong point.
¶ 57: “Explore the help page”. I really like how this directs out to the help page so people can learn. But, could you give the reader a couple of sample binwidths that might change or modify the histogram in a meaningful way? What would be a better value here? That might help generate a targeted skill test or exploration.
¶ 66: I really like the analysis of Figure 6, but I have two questions:
-- 1) Is there a particular reason Figure 6 looks different than my own application of the code? I have copied and pasted as-is. The other figures up to this point have worked perfectly and look exactly as in the tutorial. (see attached Appendix A)
-- 2) Would it be helpful to provide a counter-example to Germany here? How should we read Portugal’s relationship, or Bulgaria, to sister-cities?
¶ 72, Figure 7: It looks as though I may be working with an outdated dataset, as my graph looks close to Figure 7, but does not match up quite right. Additionally, I received a warning that 930 rows are missing values, so I’m wondering if this has caused the discrepancy between figures 6 and 7. – See Appendix B.
¶102-108: This is maybe the only part of the tutorial that I’m a little confused on. Should part of 103, the explanation of the new plot p2 and its various functions, be moved up to about line 97, or maybe even just before that figure, 95 or 96. The breakdown of Figure 14’s aesthetic modifications feels great at 103, but as a reader, it’s a little jarring to have an explanation of p2 so far after we create it in the sequence.
104: Really appreciate this sentiment about consistency across observed patterns strengthening confidence. A real strong rhetorical move and great explanation of why we’re doing all of these different approaches!

Code:

¶ 38: Should the + sign be on the line with the first bit of code? Copying it over as-is requires users to modify the code, so it may trip up some users.
Figures 8-onward: is “size = 3” necessary on these? When I ran the code, the points seemed a little overly large and like they may have muddied the graph. Again, could be how my own version of RStudio is set up, so, take it with a grain of salt.
-- See Appendix C.

Assorted thoughts

I really like how conversational the explanations are. Having learned ggplot2 and R not too long ago, I think this tutorial is doing a nice job of explaining how ggplot2 is powerful and easing readers into it.
Loading Data with readr: This section was so clear and helpful. There’s a nice balance of explaining basics and walking through some additional information.

Wigard_ProgrammingHistorian_AppendixA_2024

semanticnoodles commented 2 months ago

Thank you so much Justin @justinwigard for such a detailed review, it is fantastic. I cannot wait to get started wrapping up the brilliant points you and @regan008 raised!

nabsiddiqui commented 2 months ago

Thank you @justinwigard and @regan008. @rogorido and I will begin working on this soon.

rogorido commented 2 months ago

@justinwigard and @regan008 Thank you very much for the detailed corrections!

semanticnoodles commented 1 month ago

Hi @rogorido and @nabsiddiqui, here is my review/feedback summary (it took a while); thanks a million @regan008 and @justinwigard for all the food for thought and complementary feedback you provided! Both of you highly recommend the lesson for publication 🎉🎉🎉: @regan008 appreciates particularly the explanations about the tibbles and the Grammar of Graphics; on the other hand, @justinwigard appreciates the engaging tone of the lesson and the way it explains the potential of ggplot2.

Here is a quick recap of the core elements you highlight -- that I recommend @rogorido & @nabsiddiqui to go through carefully.

Notes on Amanda’s feedback

@regan008 makes some detailed comments about typos, potential clarifications (e.g., on plotting packages, coordinate systems), and a suggestion to link out where ECDF is mentioned, clarifying the contents of para 56-60. She also notes that while maps are mentioned, the lesson does not cover them explicitly (might be a chance to link to Using Geospatial Data to Inform Historical Research in R).

There may be an opportunity to use additional line charts, as @regan008 suggests, but requiring further transformations/brand new additions, e.g. using long/lat or population size between sister cities. The structure of the lesson works and I would like you to prioritise the refinements she suggests rather than adding brand new extensions. She makes a good point, but please only add additional data filtering/visualisation if you have time to devote to the task.

Notes on Justin’s feedback

@justinwigard highlights a number of areas where the lesson is already strong, as well as offering thoughtful suggestions for improvement under the four sections he articulated. Surely the minor typographical and grammatical suggestions other than the consistency of the sister cities spelling and the geoms require your attention.

On a functional level, he notes that some additional context could be helpful for readers unfamiliar with the tidyverse or Wikidata. He noted that providing counter-examples alongside some of the figures, like Figure 6, could help readers compare different cases, as well as adding more references on the choice of binwidth size (very often a rule of thumb, in my experience). He additionally suggests listing the tidyverse packages explicitly, and including a link to Wikidata, making more evident the line about the dataset download. He also suggests incorporating a screenshot to show how the tibble should appear after loading (I believe I suggested you to consider something similar previously, like running head(eudata), it might be really worth getting a screenshot). Many more technical insights from his side follow, and I suggest you have a look at them carefully.

Again, as I noted in Amanda's feedback, please focus on refinement/consolidation first, and then consider expanding your lesson further.

A few extras

Here are a few extra comments from my side, mostly technically oriented. Following @justinwigard notes I ran all the code to see if I could provide some extra technical feedback (using R version 4.3.0 [2023-04-21] on my RStudio version Cranberry Hibiscus, 2024.9.0.375).

The tibble size is in fact 13081 x 15, with the following colnames (I believe the index X could be removed from the dataset).

> colnames(eudata)
 [1] "X"                       
 [2] "origincityLabel"         
 [3] "origincountry"           
 [4] "originlat"               
 [5] "originlong"              
 [6] "originpopulation"        
 [7] "sistercityLabel"         
 [8] "destinationlat"          
 [9] "destinationlong"         
[10] "destinationpopulation"   
[11] "destination_countryLabel"
[12] "dist"                    
[13] "eu"                      
[14] "samecountry"             
[15] "typecountry

The overall code formatting for several chunks is a bit weird in fact: if you could remove the extra spaces or check returns that break the code (e.g. paras 38, 44, etc.) I believe it could facilitate the end users. I realise this is probably not your doing but it might be 100% dependent on the style format packages/export to .md.
para 64: it’s eudata.filtered (eudata missing filtered).
para 71: the y axis goes up to 15 max and I as well get a warning message Warning: Removed 956 rows containing missing values (geom_point())and the same happens with the codeblock in para 73. My plots look just like the ones from @justinwigard
Last but not least, please consider using a more specific title for this tutorial, like Visualizing Distributions and Relationships with R and ggplot2 (or something more task-specific).

A huge thank you for all your patience and hard work!🌟

anisa-hawes commented 1 month ago

Hello Igor @rogorido and Nabeel @nabsiddiqui,

What's happening now?

Your lesson has been moved to the next phase of our workflow which is Phase 5: Revision 2.

This phase is an opportunity for you to revise your draft in response to the peer reviewers' feedback.

Giulia @semanticnoodles has summarised their suggestions, but feel free to ask questions if you are unsure.

Please make revisions via direct commits to your file: /en/drafts/originals/visualizing-data-with-r-and-ggplot2.md. @charlottejmc and I are here to help if you encounter any difficulties.

When you and Giulia are all happy with the revised draft, the Managing Editor @hawc2 will read it through and provide additional feedback/suggestions as necessary before we move forward to Phase 6: Sustainability + Accessibility.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 4 <br> Open Peer Review
Who worked on this? : Reviewers (@justinwigard + @regan008)
All  Phase 4 tasks completed? : Yes
Section Phase 5 <br> Revision 2
Who's working on this? : Authors (@rogorido + @nabsiddiqui)
Expected completion date? : October 24
Section Phase 6 <br> Sustainability + Accessibility
Who's responsible? : Publishing Team
Expected timeframe? : 7~21 days

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

rogorido commented 1 month ago

@semanticnoodles thanks for your review/feedback.We will make all corrections in the next days.

rogorido commented 1 month ago

@regan008 and @justinwigard: Many thanks again for your comments and corrections. I have added many of them (cdfc89f413a9049cdd7a2120b410e9189238d1a8) and @nabsiddiqui and I should think about two or three changes you are proposing which have maybe more profound consequences for the tutorial.

In any case, just some comments:

@justinwigard:

the differences between your graphs and the graph in the tutorial come (as far as I can see it) from the fact that we use sample_frac() which takes a random sample out of the data. We should add a warning for the reader...
As of your question: Would it be helpful to provide a counter-example to Germany here? How should we read Portugal’s relationship, or Bulgaria, to sister-cities? No. we give the reader some hints to make analysis, but this is not a tutorial about sister-cities relationships, but about using ggplot2 for analyzing/visualizing them.

@regan008:

as of other packages: plotly was created mainly for python and has nowadays extensions for R, julia, etc. As far as I know, it is not very much used in R in comparison to 'native' solutions like ggplot (see here) the number of stars in github for instance). dygraphs is also rather a interface to the dygraphs javascript library and nothing 'R-native';
you are right: a line chart to show change over time would be the best for historians. Unfortunately it is not easy (if it is possible at all) to extract such kind of information from wikidata for the data we are working with.
You are right about maps, gis, etc. I have tried to make explicit that we do not cover maps in this lesson (maybe can someone points to a lesson about this topic in PH?).

In any case, we will still work in some on your comments (@semanticnoodles). Many thanks again.

anisa-hawes commented 1 month ago

Thank you for your work so far, Igor @rogorido and Nabeel @nabsiddiqui ✨

Please let Giulia @semanticnoodles know when you feel you've completed the revisions. She will read through the draft again to confirm that she's satisfied with the suggestions integrated.

rogorido commented 1 month ago

@anisa-hawes yes will do it!

nabsiddiqui commented 1 month ago

Hello @rogorido, @anisa-hawes, and @semanticnoodles,

Igor and I have added our edits, and I believe that we are all set to move to the next stage now.

anisa-hawes commented 1 month ago

Thank you, @nabsiddiqui and @rogorido.

Giulia @semanticnoodles will read through your revisions later this week, and advise if she feels any further adjustments are needed.

After that, Alex will read it through and share additional feedback/suggestions as necessary.

When both Giulia and Alex are happy, we will move forward to Phase 6: Sustainability + Accessibility which will begin with copyediting 🙂

rogorido commented 1 month ago

@anisa-hawes OK, many thanks!

semanticnoodles commented 1 week ago

Hello @rogorido & @nabsiddiqui,

I apologise for the delay in posting this feedback. I have been going through the whole lesson again with @justinwigard and @regan008 comments at hand. I think you have done a wonderful job of polishing the lesson, we are almost ready for Phase 6! 🎉

Please review the following points and we will be ready to move on - looking forward to seeing this brilliant lesson of yours available to the PH audience!

General Comments

[ ] Missing rows warning: not to have the readers freaking out when they encounter `Warning: Removed xyz rows containing missing values (geom_point()) can you spend a line or so just saying the do not have to worry?
[x] Title: I understand that it might not be easy, but as I mentioned in my previous comment, I would like you to think if the title could be improved, to be more informative for the PH audience – you are doing much more than teaching how to plot graphs here! Something like Exploring and Visualizing Data in R with ggplot2 might make the difference already, but you can consider referring to the grammar of graphics or anything that (and massive thanks @anisa-hawes for the brainstorming session on this):
- relates to your dataset.
- better clarifies the scope of the tools you are using.

Paragraph-specific comments

[x] ¶ 25: link missing a [ to be rendered
[x] ¶ 28-30: following @justinwigard observation, to make the dataset download less skippable can you put at the end of paragraph 28:

You can download the dataset at [this link](https://github.com/programminghistorian/ph-submissions/tree/gh-pages/assets/visualizing-data-with-r-and-ggplot2/sistercities.csv).

and then in paragraph 30 change the phrasing to:

Let’s go ahead and place the dataset in our project’s current working directory.
[x] ¶ 38: The paragraph seems messed up a little(trimmed?). Please check it.
[x] ¶ 39 (@regan008 ’s): I think the figure caption for Fig 1 is mixed up. This chart appears to show the count of locations not the total percentage.
[x] ¶ 44: instead of “tutorial” can you plese use its full name (Data Wrangling and Managment in R)?
[x] ¶ 49 (@justinwigard ’s): I think there’s a sentence that was unfinished, potentially? “…the column for different bars, and We also added”:

We passed a new parameter to the ggplot() command named fill, indicating the column for the bars. We also added…

Here I believe you meant something like “We mapped the origincountry column to the fill aesthetic in the ggplot() command, which defines the color range of the bars. We also added…”
[x] ¶ 64: in the code chunk it’s eudata.filtered (eudata missing filtered).
[x] ¶ 117: The Wallstreet Journal -> The Wall Street Journal

rogorido commented 1 week ago

@semanticnoodles Thanks a lot for your comments. We will work on your corrections and I hope we will be ready in 2-3 days.

nabsiddiqui commented 2 days ago

Hello @semanticnoodles. @rogorido and I have finished our edits. I have set a seed in the R code to allow for reproducibility. I have also updated the images to reflect the sample data the user will get due to the seed.

For the title, we were thinking perhaps "From Historical Data to Visual Analytics: The Grammar of Graphics in Practice"? I don't know what would be needed to change the title since the folders are based on the title. I am sure @anisa-hawes can help. Look forward to moving this ahead.

anisa-hawes commented 2 days ago

Thank you, @nabsiddiqui. Yes, of course we can help with the practicalities of adjustments to any file and directory names.

However, I think what Giulia @semanticnoodles is aiming towards is finding a title that is more specific. Fundamentally, we want to help readers find lessons that meet their learning goals. A clear title facilitates discovery through search, and offers a quick, basic sense of what can be learned.

Reviewing our lesson directory, I think the most successful titles generally comprise:

a verb or a noun which defines the main learning activity, method or process: Transcribing, Analysing, Visualising, Mapping, Text Mining, Facial Recognition
the kind of data readers will handle in the lesson: YouTube comment data, historical photographs, OCR text files
the names of key tools, software libraries or programming languages readers will use: R, Python, Neo4j, OpenRefine, SPARQL.

The current title is: Visualizing Data with R and ggplot2 Giulia has suggested the subtle adjustment: Exploring and Visualizing Data in R with ggplot2

I was wondering whether your title could clarify what kind of data readers are handling with these methods? The concept of Sister Cities is mentioned but what are you describing in general: demographic data? geographical/spatial data? ('mixed' data? - is the fact that you are selecting methods to visualise a range of different data types the key? 🤔)

My sense is that an effective lesson title is usually simple and succinct. So, I think I'd suggest avoiding the semicolon and compound structure (more often encountered for an expanded research article title) and focus on providing straight-forward keys to the lesson.

rogorido commented 5 hours ago

@anisa-hawes After talking with @nabsiddiqui I think we stick to the title proposed by Giulia.

hawc2 commented 2 hours ago

to @anisa-hawes' point, it would be nice to clarify what type of data this lesson teaches how to visualize - would it be fair to label it "Demographic Data"?

nabsiddiqui commented 4 minutes ago

I think it is more mixed data since some of it is about the cities themselves and some of it is about the demographics of the city.

I like "Exploring and Visualizing Mixed Data in R with ggplot2".

@rogorido is this ok with you?

programminghistorian / ph-submissions