programminghistorian / ph-submissions

The repository and website hosting the peer review process for new Programming Historian lessons
http://programminghistorian.github.io/ph-submissions
138 stars 113 forks source link

Crowdsourced-Data Normalization with Python and Pandas #301

Closed walshbr closed 3 years ago

walshbr commented 4 years ago

The Programming Historian has received the following proposal for a lesson on 'Crowdsourced-Data Cleaning with Python and Pandas' by @hburns2. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/crowdsourced-data-normalization-with-pandas

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our Ombudsperson. Thank you for helping us to create a safe space.


[Permission to Publish]

The editor must also ensure that the author or translator post the following statement to the Submission ticket.

I the author|translator hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English|en français|en español to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

walshbr commented 4 years ago

This is great @hburns2. I think it will be a great addition for people working with data in general but also large sets of tabular data in general. What follows might seem like a lot, but each of these points is just largely asking for clarification or expansion. I don't think this needs any large structural changes or anything. The main points below that I'd highlight are:

  1. consider upgrading this to the latest Pandas release now so that we can freeze in place after the 1.0 release (it looks like they've changed a lot even since last August). That will help the sustainability of the lesson longterm, even though you're right to still point them to a particular Pandas installation in pip.
  2. deepen the reflection on the case study as you're going through it.
  3. offer a concluding section.

Paragraph by paragraph comments follow. I think we usually suggest as a starting point for negotiation about a month for getting revisions back to us. Would August 15th work as a goal? Happy to negotiate. And I'm around if you have any questions about anything below - I tried to go through and double check that I hadn't accidentally left in any notes to myself, but always a chance I might have missed something.

Also note that I've now uploaded the lesson and the CSV to this repository at

/lessons/crowdsourced-data-cleaning-with-pandas.md /assets/crowdsourced-data-cleaning-with-pandas/Menu.csv

so any future changes to those files should occur there. I also had to edit a few things and updated some metadata for the lesson itself. So just make sure you're working with the most recent version there.

hburns2 commented 4 years ago

I the author hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

hburns2 commented 4 years ago

Apparently pandas has been updated again to 1.0.5, so I've updated everything accordingly

hburns2 commented 4 years ago

I have completed the recommended edits!

walshbr commented 4 years ago

Got it! I'll do another read for any remaining things before looking for reviewers as soon as I can.

walshbr commented 4 years ago

@hburns2 - finished reading! This looks good to me. I have a list of recommended edits I'll share here. There should be far fewer of them than before and they are largely less substantive. The only two suggestions that are slightly longer:

There's nothing here that would radically change things, though, so I think this should be ready for review once you can do a sweep through for them. For next steps, can you suggest a date by which you'd be able to take care of these things? I can start looking for reviewers and loop them into this thread, and then I'd ask that you just ping us in the thread here so we're aware of when the file is ready to be read by them.

hburns2 commented 4 years ago

@walshbr Thank you so much for your feedback and suggestions! Looking it over, the updates do not seem too time-intensive. I will aim to have these completed by August 29.

For the Jupyter Notebook, is there a specific place I should add it to the repository?

walshbr commented 4 years ago

Cool! You could put the notebook in the assets folder for the lesson next to the CSV -

https://github.com/programminghistorian/ph-submissions/tree/gh-pages/assets/crowdsourced-data-cleaning-with-pandas

Also FWIW I updated the link to the download for your CSV, so you might check it out in the file now to see how to correctly link internally to the notebook file when you upload it.

hburns2 commented 4 years ago

@walshbr I have hopefully addressed all edits and the lesson is ready to be read

walshbr commented 4 years ago

Thanks @hburns2! Have one reviewer willing to read and just looking for another now.

walshbr commented 4 years ago

Just a note that @mkudzia and @brandontlocke are going to be acting as reviewers for the lesson. They've agreed to try to get things in before October 3rd or so but let me know if you have any questions.

mkudzia commented 4 years ago

Hey there @hburns2 , just finished reviewing your awesome lesson! I have both some general comments at the beginning, and then more specific ones (mostly with paragraph numbers included) below. Please do not hesitate to let me know if any of the comments don't make sense/need clarification, if you disagree, etc. Overall, great work!

Overall comments:

Specific Comments:

walshbr commented 4 years ago

Thanks for the helpful review @mkudzia! @hburns2 - at this stage we'll wait for the other review to come in before going forward so that the text of the lesson doesn't change while @brandontlocke is reviewing it.

brandontlocke commented 4 years ago

Hi @hburns2! Thanks for putting this together. This was fun to do and I learned quite a bit. Great job! Please don't hesitate to ping me if you want clarification on anything, or if you have questions.

General

Why use crowdsourcing

Clarity/copyediting stuff

Exploring the NYPL Historical Menu Dataset

Most of these comments are based on an assumption that this is for absolute beginners. I know there is usually some explanation at the top about expectations (as there is here with the concepts of data cleaning and tidy data) , so some of these may be rendered moot by the expectation of basic Python familiarity.

Conclusion

walshbr commented 4 years ago

Thanks @brandontlocke and @mkudzia for the two thorough reviews!

@hburns2 - I don't think I have anything to add, but we can chat in thread here with the two of them if you had anything you wanted to bounce off of them before working on revisions. If you don't have any concerns, the next step would be for us to settle on a date by which you would have revisions back. Again, this date can be flexible if things come up for you. I think we usually aim for about a month - would mid-November work for you to aim for?

After that round of revisions from you I'll do another read, but any suggestions for revision I have there are usually pretty minimal.

hburns2 commented 3 years ago

@brandontlocke and @mkudzia Thank you so much for your thoughts and insight! I greatly appreciate it. I'll take a look and get back to you with any questions that I have.

@walshbr That sounds great. Would sometime around November 21 work? Mid-October through early November are my busiest months for university instruction and, due to this, I anticipate needing slightly more time to ensure I thoroughly address the feedback.

walshbr commented 3 years ago

Whatever works for you works for me @hburns2! We can say November 21st if you want and then see how things go. If your other obligations get in the way we can certainly be flexible.

walshbr commented 3 years ago

@hburns2 - thanks for this! In response to your email question - I think the date section works fine as is. I had a couple local suggestions for how you might deal with the fact you don't have space to go into detail about particular things. But I think it does a better job now of concluding than it did before. The word count limit on lessons is more of a relative guideline than a fixed rule that helps take care about the labor that goes into translating, in particular, but if you needed a little more space because you wanted to work a bit more on that section that'd be fine. But, again, I think it works fine as is with a couple suggested tweaks below. One you have a chance to incorporate these change (make sure you're working from the latest version on this repository) it should be ready for copyediting. So just check in with me and I'll ping @svmelton for her to pass it along to copyeditors at that time.

Should note that only one of these is really a substantial suggestion. Most are just local moments.

hburns2 commented 3 years ago

@walshbr sorry for the delay! My computer malfunctioned, but we're up and running again. I think everything has been addressed and it's ready for copyedits. I updated the Jupyter Notebook with edits, as well. Thank you so much for your feedback and insight! It's greatly appreciated

hburns2 commented 3 years ago

Oh! I also changed the name of the lesson as swapped out most iterations of "cleaning" for "normalizing"

walshbr commented 3 years ago

Thanks @hburns2! We're on break until after January 5th, but I'll double check everything before passing off to @svmelton for copyediting then.

walshbr commented 3 years ago

All looks good to me! I caught a couple small things, but I think those will come out in copyediting. @svmelton this is ready to send along to the copyeditor. I think this should be everything but let me know if I'm missing anything?

assets/crowdsourced-data-normalization-with-pandas - asset files no image files lessons/crowdsourced-data-normalization-with-pandas.md - the lesson file gallery/crowdsourced-data-normalization-with-pandas.png - the modified avatar gallery/originals/crowdsourced-data-normalization-with-pandas.png - the original avatar

The bio -

- name: Halle Burns

  team: false

  orcid: 0000-0003-2346-2876

  bio:

      en: |

          Halle Burns is an assistant professor in the University Libraries at the University of Nevada, Las Vegas. She is also a certified instructor with The Carpentries, an international organization that teaches coding and data science skills to researchers. 
svmelton commented 3 years ago

Thanks @walshbr! I'll get to work on finding a copyeditor.

svmelton commented 3 years ago

Hi @hburns2! Copyedits are in and can be found via this Dropbox link. You have discretion about accepting wording changes, etc. Please let me know if you have any questions!

hburns2 commented 3 years ago

@svmelton I have completed merging the copyedits with the lesson file. GitHub should now reflect the most recent version of the lesson

svmelton commented 3 years ago

Thank you @hburns2! I'll complete my piece in the next few days, and we can get this published!

hburns2 commented 3 years ago

@svmelton sounds great! I caught a few final spelling errors I didn't catch in my last read-through. So NOW I'm done messing with it.

svmelton commented 3 years ago

And the lesson is now live! Thanks to @hburns2 for all your work, to @walshbr for editing, and to @brandontlocke and @mkudzia for reviewing.

walshbr commented 3 years ago

Yay congratulations! I just tweeted the lesson out - please do promote so folks can enjoy your fine work. Thanks to @brandontlocke and @mkudzia for their great reviews and to @hburns2 for a lovely lesson.

mkudzia commented 3 years ago

Woohoo! Congrats @hburns2 this is so exciting.