Closed walshbr closed 3 years ago
This is great @hburns2. I think it will be a great addition for people working with data in general but also large sets of tabular data in general. What follows might seem like a lot, but each of these points is just largely asking for clarification or expansion. I don't think this needs any large structural changes or anything. The main points below that I'd highlight are:
Paragraph by paragraph comments follow. I think we usually suggest as a starting point for negotiation about a month for getting revisions back to us. Would August 15th work as a goal? Happy to negotiate. And I'm around if you have any questions about anything below - I tried to go through and double check that I hadn't accidentally left in any notes to myself, but always a chance I might have missed something.
Also note that I've now uploaded the lesson and the CSV to this repository at
/lessons/crowdsourced-data-cleaning-with-pandas.md /assets/crowdsourced-data-cleaning-with-pandas/Menu.csv
so any future changes to those files should occur there. I also had to edit a few things and updated some metadata for the lesson itself. So just make sure you're working with the most recent version there.
I the author hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.
Apparently pandas has been updated again to 1.0.5, so I've updated everything accordingly
I have completed the recommended edits!
Got it! I'll do another read for any remaining things before looking for reviewers as soon as I can.
@hburns2 - finished reading! This looks good to me. I have a list of recommended edits I'll share here. There should be far fewer of them than before and they are largely less substantive. The only two suggestions that are slightly longer:
There's nothing here that would radically change things, though, so I think this should be ready for review once you can do a sweep through for them. For next steps, can you suggest a date by which you'd be able to take care of these things? I can start looking for reviewers and loop them into this thread, and then I'd ask that you just ping us in the thread here so we're aware of when the file is ready to be read by them.
@walshbr Thank you so much for your feedback and suggestions! Looking it over, the updates do not seem too time-intensive. I will aim to have these completed by August 29.
For the Jupyter Notebook, is there a specific place I should add it to the repository?
Cool! You could put the notebook in the assets folder for the lesson next to the CSV -
Also FWIW I updated the link to the download for your CSV, so you might check it out in the file now to see how to correctly link internally to the notebook file when you upload it.
@walshbr I have hopefully addressed all edits and the lesson is ready to be read
Thanks @hburns2! Have one reviewer willing to read and just looking for another now.
Just a note that @mkudzia and @brandontlocke are going to be acting as reviewers for the lesson. They've agreed to try to get things in before October 3rd or so but let me know if you have any questions.
Hey there @hburns2 , just finished reviewing your awesome lesson! I have both some general comments at the beginning, and then more specific ones (mostly with paragraph numbers included) below. Please do not hesitate to let me know if any of the comments don't make sense/need clarification, if you disagree, etc. Overall, great work!
Overall comments:
Specific Comments:
Thanks for the helpful review @mkudzia! @hburns2 - at this stage we'll wait for the other review to come in before going forward so that the text of the lesson doesn't change while @brandontlocke is reviewing it.
Hi @hburns2! Thanks for putting this together. This was fun to do and I learned quite a bit. Great job! Please don't hesitate to ping me if you want clarification on anything, or if you have questions.
out
of every line in it? I think it would be more satisfying to see it appear from nowhere.Most of these comments are based on an assumption that this is for absolute beginners. I know there is usually some explanation at the top about expectations (as there is here with the concepts of data cleaning and tidy data) , so some of these may be rendered moot by the expectation of basic Python familiarity.
I think it may be helpful to either explain why we're keeping all of the lines we're writing in the script and re-running them every time, or to explain that we're doing things iteratively, so we may run the script to show us all the columns, then, once we know all of the names, we'll remove that and write a line that will drop some. That being said, I did appreciate the check-ins that showed the full script to compare to what I had.
p27 & p29 -link to menu.csv and Jupyter file are broken due to relative linking
p33 It may be helpful to clarify that the ellipses represent columns that aren't being printed right now and/or explain that by default it prints the beginnings and the ends and cuts out the middle. It may be confusing to people new to Python and are wondering where the 20 columns are
p41 It may be helpful to clarify that each time we run this, we're re-importing the data all over again. That would explain why, even though I've just dropped 6 columns, when I re-run it, it shows me that I have 20 columns.
p50 - "Or, for instance, it might be that researchers are transcribing menus chronologically, therefore you have records for every menu but not data to possess said records." - I'm not sure what this means.
p60 - I'm a little confused — doesn't it make more sense to remove the rows with missing data, then remove unnecessary columns? In other words, if some things are close to that 25% null line, doesn't it make more sense to remove the rows that are completely empty, and then determine which columns are no longer needed? If there's a reason to do this that I'm missing, it may be good to explain the rationale for the order, just to demonstrate the thinking process that goes into this.
p63 - I think the concept of creating menu
as a subset should be explained closer to when it's first used in p56. We've learned that we're creating a new variable, but there's not a clear reason why. I think explaining that it's a subset we're going to work on would be appropriate somewhere in p56-p59, as we're first using this new concept.
p69 typo in"If the dates in question show are in a standardized specific order, the function to_datetime() can be used."
Cell 7 in the Jupyter Notebook is commented out, and also doesn't work at that point. I'm guessing it's just a relic from development that didn't get removed.
Cell 10 (I think?) has several lines put together, so you don't see the result of menu.shape
Thanks @brandontlocke and @mkudzia for the two thorough reviews!
@hburns2 - I don't think I have anything to add, but we can chat in thread here with the two of them if you had anything you wanted to bounce off of them before working on revisions. If you don't have any concerns, the next step would be for us to settle on a date by which you would have revisions back. Again, this date can be flexible if things come up for you. I think we usually aim for about a month - would mid-November work for you to aim for?
After that round of revisions from you I'll do another read, but any suggestions for revision I have there are usually pretty minimal.
@brandontlocke and @mkudzia Thank you so much for your thoughts and insight! I greatly appreciate it. I'll take a look and get back to you with any questions that I have.
@walshbr That sounds great. Would sometime around November 21 work? Mid-October through early November are my busiest months for university instruction and, due to this, I anticipate needing slightly more time to ensure I thoroughly address the feedback.
Whatever works for you works for me @hburns2! We can say November 21st if you want and then see how things go. If your other obligations get in the way we can certainly be flexible.
@hburns2 - thanks for this! In response to your email question - I think the date section works fine as is. I had a couple local suggestions for how you might deal with the fact you don't have space to go into detail about particular things. But I think it does a better job now of concluding than it did before. The word count limit on lessons is more of a relative guideline than a fixed rule that helps take care about the labor that goes into translating, in particular, but if you needed a little more space because you wanted to work a bit more on that section that'd be fine. But, again, I think it works fine as is with a couple suggested tweaks below. One you have a chance to incorporate these change (make sure you're working from the latest version on this repository) it should be ready for copyediting. So just check in with me and I'll ping @svmelton for her to pass it along to copyeditors at that time.
Should note that only one of these is really a substantial suggestion. Most are just local moments.
[x] p 2- “That being said, there are some relatively universal steps that can be performed, such as the ones discussed in this tutorial.” I think there is a missing sentence you want here to transition back into why you want to do this work. Something like…”That being said, data normalization is especially useful and necessary for X reason. To carry out, there are some relatively universal steps……” I think this is especially important because then you spend a good deal of time complicating the topic and showing why you might not want to do this.
[x] p3 - it might be worth noting up from that you will refer to the process as data cleaning (since it seems like you will) even though you note the objections to the term. Or you could switch across the board to “data normalization” in the title as well. the latter feels like slightly more work, but might make more sense given the opening remarks you have. And it might not actually be that difficult if you did a find/replace and checked the context around each instance to make sure it worked. I might suggest that given the reviewers' remarks as well, but your call! Let me know if you'd like to discuss further. It seems like you make a good case for using normalization instead but then walk it back as is.
[x] P8 - “crowdsourcing tasks such as transcription” seems grammatically in conflict with “provides”. So I might just modify the first word of the quote to be “provide” since it corresponds with crowdsourcing tasks.
[x] p9 ish - I made some modifications to the header levels you used in the opening sections. the title doesn’t need to appear again, as that will occur in the page anyway. accordingly the top-level header on the page will be a H2 tag (2 #’s in markdown). Some other suggestions (take them or leave them) that might help the structure of the piece feel clearer upfront - I might make “things to consider” and “best practices” both H3’s, which would give a uniform structure to the opening suction. I’d also suggest getting rid of either “getting started” or “prerequisites.” Since structurally that is just a section with only one thing inside of it I think I would just collapse that level and just call it either “getting started” or “prerequisites”
[x] I think it might be worth a sentence somewhere in the getting started section to the effect of “While this is, technically, a programming lesson, the amount of technical work in data normalization can be simplified or complicated greatly by a number of human factors. So these best practices will reduce the amount of time you have to spend normalizing later.” (As a way of tying the advice to the programming later one). Maybe at p 14?
[x] P 19 - I might add a transitional sentence at the beginning of this paragraph that moves from the best practices stuff to the case study. Honestly, you might just move these two sentences from this previous paragraph down to start the NYPL section - “However, no matter how strict your guidelines or your submission protocols, variability inevitably will be present in your collected data. That being said, there are ways to identify and normalize those instances.” And follow that with “The NYPL possesses a digitized collection of approximately 45,000 menus, dating from the 1840s to today, and offers a good case study in how to correct some of these inevitable issues” (bold is my addition)
[x] P43 - I might clarify for people that they’re not actually modifying the Menu.csv file (all these things are just happening internal to Python). I can imagine a new Python programmer being afraid. Also is the reason to do this dropping that it improves performance?
[x] P45 - might be worth a parenthetical here saying (though you might have removed some of the old print statements if you took my advice)
[x] P77 - I think you might want to make clear that the coerce method isn’t something you’re asking them to use here. Could imagine someone reading quickly, seeing python, and not distinguishing the thing you bring up in passing here as something not to include in their code. I think maybe another sentence caveat describing how that is out of scope and pointing them elsewhere to learn more for it could be good!
[x] At p 82 you still have the line in that is intentionally producing an error, so you probably want to remove it (or maybe comment it out and indicate that it was the line that intentionally produced the error?). That way the code will run when you add another line in the next section.
[x] At p 83/84 you probably can remove the print statement around the .to_csv function. Also the parenthetical needs to be closed.
[x] It might be worth a sentence early on, when you describe the case study and research questions, saying that actual analysis is out of scope for the lesson. You’re really just getting things to the point where you could then do follow-up work with them (and that’s fine!).
@walshbr sorry for the delay! My computer malfunctioned, but we're up and running again. I think everything has been addressed and it's ready for copyedits. I updated the Jupyter Notebook with edits, as well. Thank you so much for your feedback and insight! It's greatly appreciated
Oh! I also changed the name of the lesson as swapped out most iterations of "cleaning" for "normalizing"
Thanks @hburns2! We're on break until after January 5th, but I'll double check everything before passing off to @svmelton for copyediting then.
All looks good to me! I caught a couple small things, but I think those will come out in copyediting. @svmelton this is ready to send along to the copyeditor. I think this should be everything but let me know if I'm missing anything?
assets/crowdsourced-data-normalization-with-pandas - asset files no image files lessons/crowdsourced-data-normalization-with-pandas.md - the lesson file gallery/crowdsourced-data-normalization-with-pandas.png - the modified avatar gallery/originals/crowdsourced-data-normalization-with-pandas.png - the original avatar
The bio -
- name: Halle Burns
team: false
orcid: 0000-0003-2346-2876
bio:
en: |
Halle Burns is an assistant professor in the University Libraries at the University of Nevada, Las Vegas. She is also a certified instructor with The Carpentries, an international organization that teaches coding and data science skills to researchers.
Thanks @walshbr! I'll get to work on finding a copyeditor.
Hi @hburns2! Copyedits are in and can be found via this Dropbox link. You have discretion about accepting wording changes, etc. Please let me know if you have any questions!
@svmelton I have completed merging the copyedits with the lesson file. GitHub should now reflect the most recent version of the lesson
Thank you @hburns2! I'll complete my piece in the next few days, and we can get this published!
@svmelton sounds great! I caught a few final spelling errors I didn't catch in my last read-through. So NOW I'm done messing with it.
And the lesson is now live! Thanks to @hburns2 for all your work, to @walshbr for editing, and to @brandontlocke and @mkudzia for reviewing.
Yay congratulations! I just tweeted the lesson out - please do promote so folks can enjoy your fine work. Thanks to @brandontlocke and @mkudzia for their great reviews and to @hburns2 for a lovely lesson.
Woohoo! Congrats @hburns2 this is so exciting.
The Programming Historian has received the following proposal for a lesson on 'Crowdsourced-Data Cleaning with Python and Pandas' by @hburns2. This lesson is now under review and can be read at:
http://programminghistorian.github.io/ph-submissions/lessons/crowdsourced-data-normalization-with-pandas
Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.
I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.
Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.
I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @amandavisconti if you feel there's a need for an ombudsperson to step in.
Anti-Harassment Policy
This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.
[Permission to Publish]
The editor must also ensure that the author or translator post the following statement to the Submission ticket.