programminghistorian / ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons
http://programminghistorian.github.io/ph-submissions
140 stars 115 forks source link

Scalable Reading of Structured Data (PH/JISC/TNA) #419

Closed tiagosousagarcia closed 2 years ago

tiagosousagarcia commented 3 years ago

The Programming Historian has received the following tutorial on 'Scalable Reading of Structured Data' by @maxodsbjerg, Helle Strandgaard Jensen, Josephine Møller Jensen, Alexander Ulrich Thygensen. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/en/drafts/originals/scalable-reading-of-structured-data

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as interim editor for the review process, until a permanent editor is assigned. The role of the editor is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. Thank you for helping us to create a safe space.

tiagosousagarcia commented 2 years ago

@maxodsbjerg, could I ask you to post the following on this thread, when you get a chance?

I the author|translator hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English|en français|en español to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

maxodsbjerg commented 2 years ago

Yes of course!

I the author hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English|en français|en español to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

svmelton commented 2 years ago

Hi @tiagosousagarcia! Thanks again for setting up the ticket. I noticed that the preview seems a bit off—the lesson should be displaying like this one. Let me know if you would like any help troubleshooting!

drjwbaker commented 2 years ago

Thanks @svmelton for the note and @jenniferisasi for fixing!

drjwbaker commented 2 years ago

I've made some edits https://github.com/programminghistorian/ph-submissions/commit/e4b2b4eace8d2a947a4f7682f518afdc89e9bfba#diff-83b53202f002a488c2e8a75ab1de9e95e1871fc1a19b3c70287ef848974fbac7 on l1-l117 with comments below:

I'll pick up on the rest of the article later.

I note that you are writing in English in a second language, which will be taken into account during peer review. If the article passes through peer review, copyediting will focus on ensuring the articles meets our Write for a Global Audience guidelines.

drjwbaker commented 2 years ago

Finished now! Next set of edits/comments:

drjwbaker commented 2 years ago

To summarise, there is a kernel of a good article here, but it needs to hold the hand of a the reader a little more, especially as a) the tutorial is intended as introductory and b) the tutorial attempts to allow the reader to follow multiple pathways.

So, firstly, these pathways need to be made clearer.

And secondly - and perhaps more importantly - the article needs to assume less knowledge, either by pointing the reader to things that explain new terms/concepts, or being more explicit about what the reader should do: the latter is particularly acute in the 'Data and Prerequisites' section, at which point, as it stands, I can see a reader not knowing what they are being asked to do, suddenly confronted as they are with descriptions of R packages (I know there is a note in the aims section, but the reader needs more help here).

In additin to that, there are some inconsistencies in styling for in paragraph mentions of code, variables, packages, and datasets that needs attention.

@tiagosousagarcia: anything to add from your read through.

tiagosousagarcia commented 2 years ago

@drjwbaker we have the images and they should be in the correct place in the repo, but they are not referenced in the .md -- I'll add them in the commit below where I think they should go (they still need captions though);

Otherwise, just a few extra notes:

drjwbaker commented 2 years ago

@maxodsbjerg Just to add, I appreciate these are a lot of changes to get to. Please don't feel there is a hurry here, as I know many people are already starting their festive leave period. Let's check in again in the new year, and should you have any queries, please ask me and/or @tiagosousagarcia.

maxodsbjerg commented 2 years ago

Thank you all for the edits/comments! I'll look into them in the new year.

drjwbaker commented 2 years ago

@tiagosousagarcia I note these images still aren't rendering in the preview. To be honest I'm not sure how to fix as I've an issue with another article at the moment https://github.com/programminghistorian/ph-submissions/issues/436#issuecomment-1004843172

This one that @amsichani is working on - code here - works perfectly if that is any help!

tiagosousagarcia commented 2 years ago

@drjwbaker I've noticed it pre-Christmas, but was hoping it was a case of delayed updating. I'll try to find where the bug is, but might need some help from the @programminghistorian/technical-team on this one

tiagosousagarcia commented 2 years ago

@drjwbaker I've noticed it pre-Christmas, but was hoping it was a case of delayed updating. I'll try to find where the bug is, but might need some help from the @programminghistorian/technical-team on this one

A bit more info on the issue -- essentially, it seems that the image location is not being correctly replaced by the slug. The generated preview has https://programminghistorian.github.io/ph-submissions/images/LEAVE%20BLANK/scalable-reading-of-structured-data-1.png as the address for the first figure, for example, even though the slug is indicated correctly on the .md file. On commit 00df8f6 I've removed all 'LEAVE BLANK' fields to see if it nudges it into the right direction

tiagosousagarcia commented 2 years ago

@drjwbaker I've noticed it pre-Christmas, but was hoping it was a case of delayed updating. I'll try to find where the bug is, but might need some help from the @programminghistorian/technical-team on this one

A bit more info on the issue -- essentially, it seems that the image location is not being correctly replaced by the slug. The generated preview has https://programminghistorian.github.io/ph-submissions/images/LEAVE%20BLANK/scalable-reading-of-structured-data-1.png as the address for the first figure, for example, even though the slug is indicated correctly on the .md file. On commit 00df8f6 I've removed all 'LEAVE BLANK' fields to see if it nudges it into the right direction

solved with commit 5c06b46

maxodsbjerg commented 2 years ago

@drjwbakern @tiagosousagarcia Thanks again for your comments! We had a meeting yesterday in our group and look forward to solving the comments. We divided the comments amongst us and plan on solving them in the next couple of weeks.

How would you prefer that we work with the comments? Fork the .md-file that you have been doing the light word editing on and ping you, when we're done?

drjwbaker commented 2 years ago

@maxodsbjerg Thanks for your note. I think a fork will work. So if you are happy with that approach, please proceed.

anisa-hawes commented 2 years ago

Hello all,

Please note that this lesson's .md file has been moved to a new location within our Submissions Repository. It is now found here: https://github.com/programminghistorian/ph-submissions/tree/gh-pages/en/drafts/originals

A consequence is that this lesson's preview link has changed. It is now: http://programminghistorian.github.io/ph-submissions/en/drafts/originals/scalable-reading-of-structured-data

Please let me know if you encounter any difficulties or have any questions.

Very best, Anisa

drjwbaker commented 2 years ago

@maxodsbjerg Just checking in to see how you are getting along with the pre peer-review edits.

maxodsbjerg commented 2 years ago

@drjwbaker It is all going very well. We have just a few edits left and I plan on finishing them this week.

drjwbaker commented 2 years ago

Fab. Thanks for the update.

maxodsbjerg commented 2 years ago

@drjwbaker @tiagosousagarcia We have finished the editing now. You'll find the updated markdown here: https://github.com/maxodsbjerg/ScalableReadingOfStructuredData/blob/main/20220117_PHedits_scalable-reading-of-structured-data.md

We've also collected your comments in a markdown-file and described what we did (the text in italic following your comment). You'll find it here: https://github.com/maxodsbjerg/ScalableReadingOfStructuredData/blob/main/20220210_PH-lesson_Scalable_Reading_edits.md

drjwbaker commented 2 years ago

Thanks so much for this @maxodsbjerg. I'm going to replace our version of the article with this one. Then we'll send it out for peer review. Note that there may be a slight delay here as @tiagosousagarcia is on leave.

drjwbaker commented 2 years ago

(and thanks so much for the commentary on our suggestions: the article is much tighter now. Great job!)

tiagosousagarcia commented 2 years ago

@inactinique and @martinmueller39 have kindly agreed to review this article. We can expect their reviews on the 1st and 15th of April respectively. If there are any questions, please feel free to post them on this ticket, or email me or @drjwbaker. Many thanks to our reviewers

inactinique commented 2 years ago

Dear authors,

Thank you for this very comprehensive tutorial.

General comments and suggestions

The code works fine (I tested it in a R-Studio notebook). The pre-requisites (software, experience) are well explained, though I think that you did not precise that a Twitter account was necessary to get the data with rtweet. The learning objectives are clearly defined, as well as the workflow you suggest to follow. The lesson is also overall well structured. Another strong point of your contribution is the fact that you explicit links with other lessons and, in your latest paragraph, you highlight the differences with the Beginner’s Guide to Twitter Data and explain how to overcome those differences.

I would suggest to highlight more clearly with which versions of R and R libraries you wrote the code as a change of version can slightly change the syntax of the code. Though on my machine, it worked with the latest versions of R and the packages you are using (macOS).

The code’s easy to reproduce, even for python oriented and R reluctant researchers like me :-). A few words on why R and not python would seem to me useful, but not mandatory. More interesting would be a few explanation on how much data you can handle with your R code and if there are strategies to adopt in case the dataset's too big (which won't happen with the way you are collecting tweets, but can easily happen with other ways to collect data).

I would also suggest a ‘further reading’ section at the end -- that would make your contribution a bit stronger and more interesting to researcher's who are using you lesson as a beginning point.

Particular comments

Some typos:

There might be others, may I advise some proof-reading?

In paragraphs 53, 59, 66 and 68 (I might forget one), I would remove “(Output removed because of privacy reasons)” from the code cell, because it’s not code. Of course, it should be still stated that you removed the output for (very obvious) privacy reasons, but it should be in a text cell, not in a code cell.

You are once using ggplot2, but ggplot otherwise. I would decide for one of the two (or explain why you use the two).

I really enjoyed reading this lesson. Thank you again to the authors.

Statement

I know at least one of the authors. We are both members of the board of the Journal of Digital History.

maxodsbjerg commented 2 years ago

@inactinique Thank you very much for your comments and sorry for the late reply.

I will get back to my colleagues and incorporate your very good comments and suggestions. Thanks again!

drjwbaker commented 2 years ago

@maxodsbjerg Just to note that you don't need to respond until both reviews have come in and I've had a change to summarise them. But thanks for taking a look nevertheless.

maxodsbjerg commented 2 years ago

@drjwbaker Thanks for the clarification!

drjwbaker commented 2 years ago

@martinmueller39 Do you need some extra time to complete this review?

tiagosousagarcia commented 2 years ago

Dear @maxodsbjerg and authors,

Thank you for your patience through the peer-review process. Unfortunately, our second reviewer had to drop out at the last moment. Instead of delaying the process further, exceptionally, we decided to continue the process with some further editorial support. What follows, then, is a mix between a peer- and an editorial review.

General comments and suggestions

Thank you for writing a clear and well-defined tutorial that will, I believe, be of interest to many PH readers. A good manual on the kinds of work to be done with twitter data (and not just on how to do it) is valuable to many disciplines and researchers in the humanities, and I am sure it will be greatly appreciated.

The tutorial is well structured and is easy to follow along (both in terms of code and ease of reading). I've found a couple of typos and less clear points that I noted below in detail.

The clear definition of the workflow and the use of scalable reading more generally are a high point of the tutorial for me. There are myriad ways of technically doing scalable reading (the R method here being just one of them), but the why and wherefore of this method remain unchanged, and I think you did a stellar job putting that across.

There is still, I think, some space to improve the tutorial even further. I hope my high-level suggestions below will be of help in that regard.

The multi-author conundrum

To some extent, with collaborative papers, there is no way of escaping this, as different voices will express themselves differently. In the tutorial, however, I think there is a marked shift between sections that is, sometimes, a little distracting to the reader. This is sometimes shown in less visible ways to the final reader of the tutorial (for example, in the .md paragraphs are sometimes written in a single line, other times have line breaks) which is trivial, but other times there is a considerable shift in register and tone from section to section. Some consolidation work needs to be done here, I think.

Code before explanation

This is more crucial for longer, or more complex pieces of code (I've noted them down below) -- I think it would be a benefit for the reader to see the code block before the explanation of its steps, so that there is an anchor to refer back to. Otherwise, the reader might be a little lost as to what exactly the explanation is referring to.

Conclusion/Next Steps/Further Reading

I get a sense the tutorial ends quite abruptly and openly, I would prefer to have a short, one-paragraph conclusion recapping the work that has been done and pointing the reader to the next steps in the scalable reading method. In other words, we have the distant reading aspect, but not the close reading one. I'm not suggesting, of course, that you need to include a close reading example, but as it stands, the reader is left with the impression that there are three, completely independent distant reading approaches which bear no relationship to each other. You've done some of that work throughout the tutorial, but I think a final (very short) section that recaps those points of connection and points the reader to where next to take the research would be very positive.

Line edits

(it's a long list, but most of these are quite small!)

drjwbaker commented 2 years ago

Thanks to @tiagosousagarcia for pulling this together.

@maxodsbjerg: to confirm, these are the last set of edits we suggest as oart of the submission and peer review process, after which we will recommend to the Managing Editor that the article be staged for publication.

On 'The multi-author conundrum', as we have not made any specific recommendations here, can I suggest that we approach this in two stages: first, the authors give the piece a good read and make changes you feel unify the voice of the article; second, when the article is passed to copy editing, this can be flagged in advance as something to which special attention is paid.

If you are happy with this, would it be possible to complete final edits by 19 May?

maxodsbjerg commented 2 years ago

@tiagosousagarcia Thanks for your comments and suggestions. I look forward to working with them

@drjwbaker - I'm sorry, but the latest development in this thread has gone completely under my radar, so we wont be able to complete final edits by 19 May. I've made arrangements with the other authors and we will have the final edits completed by 25 May.

drjwbaker commented 2 years ago

25 May is perfect. Thanks @maxodsbjerg.

maxodsbjerg commented 2 years ago

@drjwbaker We fixed all the comments but one and you can find the new updated markdown her: https://github.com/maxodsbjerg/ScalableReadingOfStructuredData/blob/main/20220523_ScalableReadingOfStructuredData.md

The one we didn't fix was this one: #Data and Prerequisites: could not a third option be to offer a base dataset that users can get from PH to get started without having to follow other lessons? We have difficulties seeing how this is possible since the Twitter policy is not to share hydrated tweets only Tweet-IDs. We initially planned on something similiar as your suggestion, but couldn't find any suitable open datasets that were compliant with Twitter's rules.

drjwbaker commented 2 years ago

@maxodsbjerg and team: many thanks. On..

Data and Prerequisites: could not a third option be to offer a base dataset that users can get from PH to get started without having to follow other lessons?

..thank you for the explanation.

@tiagosousagarcia (when you have some time as I know you are busy!) unless you have any final comments, I'm going to suggest we move to the next stage of the editorial workflow https://programminghistorian.org/en/editor-guidelines#recommend-publication---editorial-checklist and then inform @svmelton that this is ready for copyediting.

tiagosousagarcia commented 2 years ago

No further comments from me -- thank you @maxodsbjerg and everyone!

drjwbaker commented 2 years ago

@maxodsbjerg: we need bios for the team. I've started these below. Could you please check and edit if needed (plus add orcids if you have them - I could only find one for Helle)






tiagosousagarcia commented 2 years ago

@maxodsbjerg -- when you have the time, could you check and edit if needed the bios above?

maxodsbjerg commented 2 years ago

@tiagosousagarcia - Once again I'm sorry for my tardiness. We have the following short bio for all of us:

Max Odsbjerg Pedersen is an Information Specialist at Aarhus University Library at the Royal Danish Library Josephine Møller Jensen is an MA student in the Department of History and Classical Studies, Aarhus University Victor Harbo Johnston is an MA student in the Department of History and Classical Studies, Aarhus University Alexander Ulrich Thygesen is a Ph.D. student in the Department of German and Roman Languages, Aarhus University Helle Strandgaard Jensen is Associate Professor of Contemporary Cultural History in the Department of History and Classical Studies, Aarhus University

And I have the following ORC-ID: 0000-0001-9215-5605

Let me know if you need anything else.

tiagosousagarcia commented 2 years ago

@svmelton, we are ready to recommend this article for publication. The lesson files can be found at:

lesson - ph-submissions/en/drafts/originals/scalable-reading-of-structured-data.md images - ph-submissions/images/scalable-reading-of-structured-data/ gallery icons - ph-submissions/gallery/scalable-reading-of-structured-data.png and ph-submissions/gallery/originals/scalable-reading-of-structured-data-original.png

Author bios are as follow:

---
- name: Max Odsbjerg Pedersen 
  team: false
  orcid: 0000-0001-9215-5605
  bio:
      en: |
          Max Odsbjerg Pedersen is an Information Specialist at Aarhus University Library at the Royal Danish Library
---

---
- name: Josephine Møller Jensen
  team: false
  orcid: 
  bio:
      en: |
          Josephine Møller Jensen is an MA student in the Department of History and Classical Studies, Aarhus University
---

---
- name: Victor Harbo Johnston
  team: false
  orcid: 
  bio:
      en: |
          Victor Harbo Johnston is an MA student in the Department of History and Classical Studies, Aarhus University
---

---
- name: Alexander Ulrich Thygensen
  team: false
  orcid:
  bio:
      en: |
          Alexander Ulrich Thygesen is a Ph.D. student in the Department of German and Roman Languages, Aarhus University
---

---
- name: Helle Strandgaard Jensen
  team: false
  orcid: 0000-0002-8623-9586
  bio:
      en: |
          Helle Strandgaard Jensen is Associate Professor of Contemporary Cultural History in the Department of History and Classical Studies, Aarhus University
---

Let me know if there's anything missing!

rivaquiroga commented 2 years ago

Hi, everyone!

Is it possible that the authors save the sesamestreet_data object as a csv so we can archive it? I know it is not needed for following the lesson, but it is relevant for its sustainability:

tiagosousagarcia commented 2 years ago

Hi @rivaquiroga -- we did consider that during the review, but apparently twitter has some strict rules about privacy that prevent us from doing so, according to the authors (@maxodsbjerg et al)

rivaquiroga commented 2 years ago

We don’t need all the 90 columns. For replicating the plots we just need the dataframes as they were before being piped into the ggplot() function. For example, for the first plot the three variables needed are date, has_sesame_ht and n. There is no private data there.

tiagosousagarcia commented 2 years ago

Ah, good point, of course -- @maxodsbjerg, could we get copies of the data used for each figure so that we can translate them?

tiagosousagarcia commented 2 years ago

another small request, @maxodsbjerg -- I've noticed during the translation of the lesson that we still don't have captions for the figures: would you mind adding them here? I can add them to the .md directly

svmelton commented 2 years ago

Thanks, all! @tiagosousagarcia has this lesson been through copyediting already?

tiagosousagarcia commented 2 years ago

@svmelton - not yet! I was under the impression that happened after the recommendation? I may be wrong, I often am. In any case, I can get in touch with @anisa-hawes later today to arrange it

svmelton commented 2 years ago

No worries @tiagosousagarcia! @anisa-hawes do you have capacity to copyedit this lesson?

drjwbaker commented 2 years ago

We have project budget to pay for copyediting.

maxodsbjerg commented 2 years ago

@tiagosousagarcia Just wanted to let you know that we will be meeting in our twitter project group on Monday and then we will sort out the data sharing-issue and create some captions for the figures.