Crowdsourced-Data Normalization with Python and Pandas

walshbr commented 4 years ago

The Programming Historian has received the following proposal for a lesson on 'Crowdsourced-Data Cleaning with Python and Pandas' by @hburns2. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/crowdsourced-data-normalization-with-pandas

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our Ombudsperson. Thank you for helping us to create a safe space.

[Permission to Publish]

The editor must also ensure that the author or translator post the following statement to the Submission ticket.

I the author|translator hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English|en français|en español to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

walshbr commented 4 years ago

This is great @hburns2. I think it will be a great addition for people working with data in general but also large sets of tabular data in general. What follows might seem like a lot, but each of these points is just largely asking for clarification or expansion. I don't think this needs any large structural changes or anything. The main points below that I'd highlight are:

consider upgrading this to the latest Pandas release now so that we can freeze in place after the 1.0 release (it looks like they've changed a lot even since last August). That will help the sustainability of the lesson longterm, even though you're right to still point them to a particular Pandas installation in pip.
deepen the reflection on the case study as you're going through it.
offer a concluding section.

Paragraph by paragraph comments follow. I think we usually suggest as a starting point for negotiation about a month for getting revisions back to us. Would August 15th work as a goal? Happy to negotiate. And I'm around if you have any questions about anything below - I tried to go through and double check that I hadn't accidentally left in any notes to myself, but always a chance I might have missed something.

Also note that I've now uploaded the lesson and the CSV to this repository at

/lessons/crowdsourced-data-cleaning-with-pandas.md /assets/crowdsourced-data-cleaning-with-pandas/Menu.csv

so any future changes to those files should occur there. I also had to edit a few things and updated some metadata for the lesson itself. So just make sure you're working with the most recent version there.

[x] p1 - “the masses” - you might just take care that the ways you refer to the crowd don’t suggest they’re lacking agency or thought. I think the terminology is better later on in the lesson, so just this instance was the only one that I noticed.
[x] p2 - might link to the pandas library
[x] - similarly link to the NYPL?
[x] - “working with” I’d add “In this lesson, you will work with……and learn”
[x] p4 - I know penguin watch is on zooniverse, but it might be worth mentioning zooniverse explicitly as well given the degree to which it has become a platform for related projects.
[x] p5 - I think this paragraph should be expanded. Can you give more examples of tasks that require human involvement? The two examples so far are, I think, of micro tasking. You might include a macro tasking project as a way of complicating the survey and show the distinct things groups of people can do.
[x] p6 - the transition here could maybe use a couple sentences on how crowdsourcing isn’t the best option sometimes, given that the previous section was framed as “Why use?” Maybe propose the question “why not”
[x] At some point, it’d be great if you could address, even in a few sentences, the labor and ethics of using crowdsourcing. Do individuals deserve to be compensated for their time? Why or why not? Are you treating the users as engaged individuals or as mere labor? How do you credit them for their time? Might be slightly out of scope, but those are often the issues I find come up when talking about why or why not to crowsource.
[x] p10 - again, some scaffolding might help here just about “why have guidelines at all?”
[x] p12 - there’s probably a gender neutral way to refer to this quality other than “manpower.” Work hours? Resources?
[x] p 13 - the quality control paragraph might deserve to be its own bullet point rather than being a subpoena of manpower? Mostly thinking about the content but also the fact that it is a numbered list with only one thing in it. You might also nod to how the whole point of the lesson you’re writing is that you can only control the quality so much.
[x] Sometimes you are using “subheader -“ and sometimes “subheader:” So you might want to be consistent
[x] Under “other guidelines” - now might be a good time to link to metadata standards like Dublin Core as a means of ensuring consistency and interoperability among many different kinds of projects.
[x] p 14 - when you describe OCR, you might just nod to the fact that it usually introduces a number of errors that have to be hand corrected anyway. And when you mention “writing a universal code” I think you mean writing a universal code for an OCR engine to use to ensure accuracy? Would be worth clarifying.
[x] p 14 - given that humanists will be reading this, it might be worth giving just a tad more context for the dataset. In other words, why would people care about it? What kinds of questions or historical knowledge can historical menus reveal? addendum - you get to this in the next paragraph. So might not be as crucial, but might just be worth a sentence.
[x] p16 - I think I would start this paragraph by saying something like “the datasets offered by What’s on the Menu” are XYZ. And you might also link to where people can find the datasets if they want to learn more.
[x] Note that I updated the link to your csv to be the copy stored in our assets folder. So any changes you might need to make to it should be made to that file on this repository.
[x] p17 - you might note that duplicate data is the kind of error you might run into with crowdsourced data.
[x] p19 - I just noticed that Pandas is currently on version 1.0.4 (a lot of releases since August!)- it’s probably worth upgrading the code blocks to the latest version just for the fact that their release page notices significant functionality differences. That will help us keep the lesson sustainable in the longterm and mean you won’t have to rewrite it any time soon. But I would still keep the pip install pointed to the particular version that you end up using in the lesson - that’s a good call.
[x] p19 - FWIW all of our lessons are now in Python 3. And since the sunset date for Python 2 has passed, it might be worth instructing people to go ahead install Python 3 rather than building a workaround for it.
[x] p21- you might offer some links for data cleaning / tidy data so that those who aren’t familiar can learn more.
[x] p22 - for accessibility purposes, you should only have one header one on the page. So I’d check to make sure that is the case (the getting started header is a second header one) and modify the remaining headers accordingly.
[x] P22 - the why Pandas section should probably be expanded or combined with either the previous or following section. Probably too short to stand alone.
[ ] Exploring the NYPL Historical Menu Dataset
[x] There are a few structural things to keep in mind for each piece code in the lesson. In each instance where you have code block, the surrounding text should clarify whether you are having the reader type something in terminal or in a file meant to be run in Python. And for the latter, you should give a filename and instruct the user to run that file. You’ll also want to make sure that you give instructions for how to download the CSV and that the CSV should be in the same folder as the Python file.
[x] p. 25 - I would give the output of the code and comment on it, and generally would also encourage that for every subsequent code block.
[x] p.25- related to my last comment - the df.head() function will only display the data if it’s run like that in the python interpreter. If you’re having them save this into a text file, they’ll need a print statement around df.head() I believe. Same goes for all the other examples to come about outputting data to the screen.
[x] p.26 - for the df.columns piece I’d probably make it an inline code block with single backticks, rather than its own standalone code block.
[x] p29 - it could be helpful to start the case study section by setting up the research questions you’re interested in for this.
[x] p.29 - I follow everything you’re saying, but just note that this section could be confusing depending on whether you’re talking about using a Python file (in which case you might want to offer the user the current state of the file at various stages) or in the interpreter.
[x] p.30 - df.shape is a new command, so you might say what it is, offer the output, and describe more what happened.
[ ] P32 - you might just show a screenshot of the dataset or something to show that it has duplicate rows in it before you go about removing them.
[x] During this section on duplicate data it might be worth connecting back to the opening section on crowdsourcing. Is there something particular about crowdsourcing data collection that means duplicate data is likely to occur?
[x] P37 - you might offer some check to show the user that the duplicates are deleted, maybe by printing out the shape of the data frame again to show that it is a different size now.
[x] P38- to tie this back to your research case study, you might offer some reason why you would want to start processing the data before it was complete.
[x] P40- This sort of elides what the code df.isnull().sum() is doing. Specifically the .sum() part.
[x] P43 - I think this would be confusing because, unless I’m mistaken, you’d need to do menu.shape() and not df.shape() right? So just being explicit about what code should be run would be helpful.
[x] In general, during this section I think it would be helpful to further the thought experiment by thinking through what research question you might be working through that would call upon these methods. Why might you want to drop things that have more than 50% null values? Why not? Helping to play out those scenarios will help humanists realize why the methods might be useful.
[x] p46 - just playing around with this, it looks like the .dropna() function won’t modify the original data frame. So if I run menu.dropna() it will indeed drop the rows. But if I then call menu.shape() afterwards I still get the old data frame with the rows in it containing null values. So probably worth clarifying that for the reader.
[x] p46- probably worth re-stating that you would call menu.head() to see the first five rows. Someone using pandas for the first time is unlikely to remember.
[x] P51 - to someone new to programming, the distinction you’re making between datatype and date is likely to be difficult to follow. It might help to give an example. Also, for what it’s worth, you sort of gloss over how to clean and regularize the dates, which is what I thought you were going to talk about. Is there a way to do that programmatically? Or is that best done by hand and by guidelines?
[x] p52 - this is probably likely to be interesting and also confusing to new users, so I’d expand the piece on how the date range is limited.
[x] I think it’d be worth concluding the lesson somehow; maybe reflecting on what the reader has learned, further steps that might be taken from here, why this was important for their research, what they learned about the dataset, etc. etc. One way to do this might be to reference Rawson and Muñoz on "Against Cleaning." They also use the NYPL dataset, and that discussion would complicate some of what you're saying about why data cleaning is necessary. Since we're writing articles for humanists and not technical documentation for software engineers, I think the readership will appreciate that sort of commentary.

hburns2 commented 4 years ago

I the author hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

hburns2 commented 4 years ago

Apparently pandas has been updated again to 1.0.5, so I've updated everything accordingly

hburns2 commented 4 years ago

I have completed the recommended edits!

walshbr commented 4 years ago

Got it! I'll do another read for any remaining things before looking for reviewers as soon as I can.

walshbr commented 4 years ago

@hburns2 - finished reading! This looks good to me. I have a list of recommended edits I'll share here. There should be far fewer of them than before and they are largely less substantive. The only two suggestions that are slightly longer:

When you reiterate the current state of the code that should be done in code blocks and more frequently.
I think you could tie the functions back to the research questions as you're going.

There's nothing here that would radically change things, though, so I think this should be ready for review once you can do a sweep through for them. For next steps, can you suggest a date by which you'd be able to take care of these things? I can start looking for reviewers and loop them into this thread, and then I'd ask that you just ping us in the thread here so we're aware of when the file is ready to be read by them.

[x] P1 - “data collected in this manner inevitably contain variability.” Missing a word or something
[x] P2 - “In this lesson, you will with Python and the pandas library with a dataset from the New York Public Library and will learn the fundamentals of data cleaning and identify common issues when utilizing crowdsourced data.” Missing word and serial comma
[x] P6 - might think about a concluding sentence here
[x] P9 - you might make the subheader here a question as well, so as to be consistent with the others in this sequence.
[x] P15 - under Submission Guidlines / Special Characters you might just give an example of a special character.
[x] P21 - do you assume here that people are already in their virtual environment with it activated? Might be worth saying so or giving the commands for it.
[x] P25 - I might just gloss each of those terms briefly.
[x] P26 - if one of those is the question you actually go on to examine, it might be worth listing it as a statement rather than a question and saying so.
[x] P28 - I think I’d give them a filename to use besides filename.py, but that’s just me. I’d change it here and elsewhere
[x] P29- Did you wind up using the Jupyter notebook file? It doesn’t look like it is linked or in the repository.
[x] P33 - I might note whether the NAN is expected as part of the data or an error. And also maybe note whether that is because this is the nature of spreadsheets or something about crowdsourced data in particular.
[x] P34 - it’s probably a good idea to periodically give the state of the code as it currently is, since you’re adding things to it as you go. And I’d also reiterate the way to run the file by saying something to the effect of “run your file again with python pand.py” to keep people grounded in the way to do things. Just because you don’t know how technically proficient the reader will be. Update as I read further - I see that you do this, but I’d perhaps do it a bit more frequently and use code blocks instead of screenshots
[x] P42 - note that there is no screenshot here. But also it would be better to just write out the code block rather than using a screenshot anyway, as that will be more sustainable and easier to edit in the future if you/we have changes to make.
[x] P55 - My one substantive thought throughout was that you might tie some of these functions you’re introducing specifically to the ways they are tied to particular needs or questions from the case study. This might be a good opportunity to do so. Is there a reason you could suggest for why these columns are null-free but the others are not? Seems tied to the fact that this is crowdsourced data.
[x] P57 - I’d reiterate the research questions here as you reference them in passing.
[x] P62 - maybe offer a sentence about a use case where this might be true?
[x] P67 - ditto here on code block instead of screenshot
[x] P73 - Is there a different way you would recommend dealing with dates outside the bounds of what pandas can do?

hburns2 commented 4 years ago

@walshbr Thank you so much for your feedback and suggestions! Looking it over, the updates do not seem too time-intensive. I will aim to have these completed by August 29.

For the Jupyter Notebook, is there a specific place I should add it to the repository?

walshbr commented 4 years ago

Cool! You could put the notebook in the assets folder for the lesson next to the CSV -

https://github.com/programminghistorian/ph-submissions/tree/gh-pages/assets/crowdsourced-data-cleaning-with-pandas

Also FWIW I updated the link to the download for your CSV, so you might check it out in the file now to see how to correctly link internally to the notebook file when you upload it.

hburns2 commented 4 years ago

@walshbr I have hopefully addressed all edits and the lesson is ready to be read

walshbr commented 4 years ago

Thanks @hburns2! Have one reviewer willing to read and just looking for another now.

walshbr commented 4 years ago

Just a note that @mkudzia and @brandontlocke are going to be acting as reviewers for the lesson. They've agreed to try to get things in before October 3rd or so but let me know if you have any questions.

mkudzia commented 4 years ago

Hey there @hburns2 , just finished reviewing your awesome lesson! I have both some general comments at the beginning, and then more specific ones (mostly with paragraph numbers included) below. Please do not hesitate to let me know if any of the comments don't make sense/need clarification, if you disagree, etc. Overall, great work!

Overall comments:

Maybe consider re-framing from data cleaning to data normalizing? There's an increasing move away from "cleaning" framing (though I do realize that there also might be good reasons for leaving the phrasing as-is). You kind of address it obliquely in your conclusion, but I would love to see you be more explicit about that - at least that it's a conversation - at the beginning, since it's a really important consideration for humanists.
- For extra awesomeness it would be great to see you discuss the trade-offs; what does normalized data allow you to do that non-normalized data doesn't?
Is there a way to talk about the benefits of cleaning up your Python file as you go, once you no longer need some of your old print statements, etc.?
I really like the way you check in periodically to show us what our code should look like at various points along the way. So helpful!
I think this is a challenging lesson to have written, both in terms of the amount of ground you're covering with respect to content (what crowdsourcing is, crowdsourcing ethics, challenges of crowdsourced data, python and pandas), and in general I think you're doing a great job. I noticed that it seems like your assumption is that readers are more familiar with python (or will go back and gain familiarity using a different tutorial) and that you're focusing on folks who are beginners to crowdsourcing. If that is indeed your intention, it might be useful to state that explicitly toward the top of the tutorial. That said, I'll have some suggestions on places where you might elaborate a bit on your python instructions.

Specific Comments:

Some of the links in the introductory sections are less self-explanatory than they could be; in particular there's a reference in P10 to "Simper's article," but that hasn't been previously referenced so we don't know who/what that is. Maybe there was a previous reference that got edited out? I can make other specific notes if that's helpful, but this section could use a general once-over to make sure the references are clear.
In P11, you title that section "guidelines," but this actually seems more like a best practices overview. It also seems like it's a mix of both suggestions for researchers and for crowdsource participants. Could you clarify this section a bit? Maybe the section aimed at researchers could be better described as "best practices for researchers" and the participant section as "best practices for submitters/for submission guidelines"?
Nice methodology note at the end of P19 letting us know you duplicated a row so we could practice de-duping!
In the "prerequisites" section (P21), I suggest making two changes:
- Explicitly adding "familiarity with Python, and with either the command line or Jupyter notebooks," and
- Changing the third sentence to "Should you have no previous experience working with Python, consider starting with 'Python Introduction and Installation'[https://programminghistorian.org/en/lessons/introduction-and-installation], as you'll need to have Python installed on your computer and be familiar with some basics. You will need to create a virtual Python 3 environment for this lesson."
It would be nice between P21 and P22 to include a reminder about how to start and stop a virtual environment (I always have to look this up, as I don't use virtual environments all that often).
In P33, could you tell us why null is listed as NaN? This is purely to satisfy my curiosity. I appreciated that at the end of the paragraph you note that a column containing mostly or all null values is a red flag!
In P37, can you tell us how we'd drop rows if we wanted to? Would that be axis=2?
Oh bummer, we were finally getting into the real meat of addressing data issues but then we kind of stop cold at the end. Specifically, I'd love more info on dealing with inconsistent date/time data once we get to P71-75!
- You mention the error dealing with the incorrectly inputted year for entry number 13112, and say "while we are then able to programmatically find that date and replace it", you don't actually say how to do that!
- Would you consider adding, or linking to, instructions on how to do this? I went into the spreadsheet and searched for the entry number in order to fix it manually, but I actually really want to know how to fix this programmatically! Especially since that's not the only incorrect date and I don't love the idea of going through one at a time and fixing those manually, and I'd also like to end the lesson with working code.
I think P77 might make more sense in the introduction, and then you can reintroduce this topic in P78 as part of a conclusion, but ideally you wouldn't introduce brand new conceptual material in the conclusion. That said, you could move this paragraph to the intro section, and replace it with a recap of what we've learned in the lesson.

walshbr commented 4 years ago

Thanks for the helpful review @mkudzia! @hburns2 - at this stage we'll wait for the other review to come in before going forward so that the text of the lesson doesn't change while @brandontlocke is reviewing it.

brandontlocke commented 4 years ago

Hi @hburns2! Thanks for putting this together. This was fun to do and I learned quite a bit. Great job! Please don't hesitate to ping me if you want clarification on anything, or if you have questions.

General

I really love your inclusion of some more meta and critical perspectives on both crowdsourcing and data 'cleaning' in this lesson. I think this really strengthens the piece and makes it a lot stronger and more useful than a more simple pandas how-to. I do have some feedback on those specific bits in later sections.
The Python/pandas portion of it is very clearly written and easy to follow. I did both the .py and Jupyter versions, and I really appreciate having both options. Is it possible to share a Jupyter notebook that doesn't already have the out of every line in it? I think it would be more satisfying to see it appear from nowhere.
I think the ending is a little unsatisfying in two ways, but it's an easy fix. First, by ending on an error, I'm wondering: did all dates except that one error get converted? Or does an error mean that it just stopped when it hit a problem? Is there a way to either correct and re-run the date conversion, or just drop the line and then finish converting to datetime? I think the second way that it's unsatisfying is that I want to see the finished project, and have some idea of how I can actually use it. I'd love to be able to write the new dataset to a CSV, and then be able to open it and see the difference.

Why use crowdsourcing

I think the Squirrel Census, Penguin Watch, and Netflix Prize are good and notable examples, but, given the venue and expected audience, I am wondering if some cultural heritage examples may be more appropriate? Perhaps something like Transcribe Bentham or one of the Library of Congress's crown projects would be of more interest to this audience, or would at least provide some balance and examples of different types of crowdsourcing.
I also really appreciate the discussion of ethics in crowdsourcing, and the ways that it may just create a different type of labor. These are super important, as some seem to view crowdsourcing as a free labor windfall. I do think this could be bolstered by some discussion of crowdsourcing as a method of engagement for cultural heritage orgs, with the side effect of creating good data, which is the framing Trevor Owens introduces in "Crowdsourcing Our Cultural Heritage" edited by Mia Ridge. The book also has some great examples/arguments for material and immaterial rewards systems in several of the case studies (Blaser, Lascarides/Vershbow, Dafis/Hughes/James).
I really appreciate the thorough explanation of the datasets available from the site. It really helps to put everything into context, and I think also tells us a lot about the complexity of a project like this - there's lots more data to organize!

Clarity/copyediting stuff

p5 I think the opening sentences could be strengthened a bit. The phrases 'have been developed' and 'come into fruition' are not particularly active nor descriptive - have they eneabled things that wouldn't have been possible before? Have they achieved something special?
p6 The sentence about 'On the Ethics of Crowdsourced Research' is a little awkward. I think it would sound better if it were some variation of "Vanessa Williamson, in her article 'On the Ethics of Crowdsourced Research,' argues/declares....."

Exploring the NYPL Historical Menu Dataset

Most of these comments are based on an assumption that this is for absolute beginners. I know there is usually some explanation at the top about expectations (as there is here with the concepts of data cleaning and tidy data) , so some of these may be rendered moot by the expectation of basic Python familiarity.

I think it may be helpful to either explain why we're keeping all of the lines we're writing in the script and re-running them every time, or to explain that we're doing things iteratively, so we may run the script to show us all the columns, then, once we know all of the names, we'll remove that and write a line that will drop some. That being said, I did appreciate the check-ins that showed the full script to compare to what I had.
p27 & p29 -link to menu.csv and Jupyter file are broken due to relative linking
p33 It may be helpful to clarify that the ellipses represent columns that aren't being printed right now and/or explain that by default it prints the beginnings and the ends and cuts out the middle. It may be confusing to people new to Python and are wondering where the 20 columns are
p41 It may be helpful to clarify that each time we run this, we're re-importing the data all over again. That would explain why, even though I've just dropped 6 columns, when I re-run it, it shows me that I have 20 columns.
p50 - "Or, for instance, it might be that researchers are transcribing menus chronologically, therefore you have records for every menu but not data to possess said records." - I'm not sure what this means.
p60 - I'm a little confused — doesn't it make more sense to remove the rows with missing data, then remove unnecessary columns? In other words, if some things are close to that 25% null line, doesn't it make more sense to remove the rows that are completely empty, and then determine which columns are no longer needed? If there's a reason to do this that I'm missing, it may be good to explain the rationale for the order, just to demonstrate the thinking process that goes into this.
p63 - I think the concept of creating menu as a subset should be explained closer to when it's first used in p56. We've learned that we're creating a new variable, but there's not a clear reason why. I think explaining that it's a subset we're going to work on would be appropriate somewhere in p56-p59, as we're first using this new concept.
p69 typo in"If the dates in question show are in a standardized specific order, the function to_datetime() can be used."
Cell 7 in the Jupyter Notebook is commented out, and also doesn't work at that point. I'm guessing it's just a relic from development that didn't get removed.
Cell 10 (I think?) has several lines put together, so you don't see the result of menu.shape

Conclusion

p77 I don't think it's true that the procedures for cleaning in social or natural sciences (as opposed to humanities) are necessarily simple! While the natural and social sciences may have a more developed practice for doing this, but I don't think that means that those same issues regularly come up or deserve the same amount of thought. While they do frame it for the humanities, I think Rawson and Muñoz show how cleaning flattens distinctions, and I think that's still applicable to fields outside of the humanities, even if it is more applicable to the humanities than others. I do think it's a good distinction to make, as you have, that what you're doing here is largely just removing unnecessary columns, normalizing dates, and removing duplicates—issues that are fairly clear cut. In the case of the Rawson and Muñoz article, the issue of normalizing instances of particular dishes is where things get much more complex, and algorithmic or coded corrections can cause a great deal of issues or flattening.
I've gone back and forth about putting 'Against Cleaning' in the intro or at the conclusion. Since we don't really change data in the way that Rawson & Muñoz discuss, I think it makes sense here.

walshbr commented 4 years ago

Thanks @brandontlocke and @mkudzia for the two thorough reviews!

@hburns2 - I don't think I have anything to add, but we can chat in thread here with the two of them if you had anything you wanted to bounce off of them before working on revisions. If you don't have any concerns, the next step would be for us to settle on a date by which you would have revisions back. Again, this date can be flexible if things come up for you. I think we usually aim for about a month - would mid-November work for you to aim for?

After that round of revisions from you I'll do another read, but any suggestions for revision I have there are usually pretty minimal.

hburns2 commented 3 years ago

@brandontlocke and @mkudzia Thank you so much for your thoughts and insight! I greatly appreciate it. I'll take a look and get back to you with any questions that I have.

@walshbr That sounds great. Would sometime around November 21 work? Mid-October through early November are my busiest months for university instruction and, due to this, I anticipate needing slightly more time to ensure I thoroughly address the feedback.

walshbr commented 3 years ago

Whatever works for you works for me @hburns2! We can say November 21st if you want and then see how things go. If your other obligations get in the way we can certainly be flexible.

walshbr commented 3 years ago

@hburns2 - thanks for this! In response to your email question - I think the date section works fine as is. I had a couple local suggestions for how you might deal with the fact you don't have space to go into detail about particular things. But I think it does a better job now of concluding than it did before. The word count limit on lessons is more of a relative guideline than a fixed rule that helps take care about the labor that goes into translating, in particular, but if you needed a little more space because you wanted to work a bit more on that section that'd be fine. But, again, I think it works fine as is with a couple suggested tweaks below. One you have a chance to incorporate these change (make sure you're working from the latest version on this repository) it should be ready for copyediting. So just check in with me and I'll ping @svmelton for her to pass it along to copyeditors at that time.

Should note that only one of these is really a substantial suggestion. Most are just local moments.

[x] p 2- “That being said, there are some relatively universal steps that can be performed, such as the ones discussed in this tutorial.” I think there is a missing sentence you want here to transition back into why you want to do this work. Something like…”That being said, data normalization is especially useful and necessary for X reason. To carry out, there are some relatively universal steps……” I think this is especially important because then you spend a good deal of time complicating the topic and showing why you might not want to do this.
[x] p3 - it might be worth noting up from that you will refer to the process as data cleaning (since it seems like you will) even though you note the objections to the term. Or you could switch across the board to “data normalization” in the title as well. the latter feels like slightly more work, but might make more sense given the opening remarks you have. And it might not actually be that difficult if you did a find/replace and checked the context around each instance to make sure it worked. I might suggest that given the reviewers' remarks as well, but your call! Let me know if you'd like to discuss further. It seems like you make a good case for using normalization instead but then walk it back as is.
[x] P8 - “crowdsourcing tasks such as transcription” seems grammatically in conflict with “provides”. So I might just modify the first word of the quote to be “provide” since it corresponds with crowdsourcing tasks.
[x] p9 ish - I made some modifications to the header levels you used in the opening sections. the title doesn’t need to appear again, as that will occur in the page anyway. accordingly the top-level header on the page will be a H2 tag (2 #’s in markdown). Some other suggestions (take them or leave them) that might help the structure of the piece feel clearer upfront - I might make “things to consider” and “best practices” both H3’s, which would give a uniform structure to the opening suction. I’d also suggest getting rid of either “getting started” or “prerequisites.” Since structurally that is just a section with only one thing inside of it I think I would just collapse that level and just call it either “getting started” or “prerequisites”
[x] I think it might be worth a sentence somewhere in the getting started section to the effect of “While this is, technically, a programming lesson, the amount of technical work in data normalization can be simplified or complicated greatly by a number of human factors. So these best practices will reduce the amount of time you have to spend normalizing later.” (As a way of tying the advice to the programming later one). Maybe at p 14?
[x] P 19 - I might add a transitional sentence at the beginning of this paragraph that moves from the best practices stuff to the case study. Honestly, you might just move these two sentences from this previous paragraph down to start the NYPL section - “However, no matter how strict your guidelines or your submission protocols, variability inevitably will be present in your collected data. That being said, there are ways to identify and normalize those instances.” And follow that with “The NYPL possesses a digitized collection of approximately 45,000 menus, dating from the 1840s to today, and offers a good case study in how to correct some of these inevitable issues” (bold is my addition)
[x] P43 - I might clarify for people that they’re not actually modifying the Menu.csv file (all these things are just happening internal to Python). I can imagine a new Python programmer being afraid. Also is the reason to do this dropping that it improves performance?
[x] P45 - might be worth a parenthetical here saying (though you might have removed some of the old print statements if you took my advice)
[x] P77 - I think you might want to make clear that the coerce method isn’t something you’re asking them to use here. Could imagine someone reading quickly, seeing python, and not distinguishing the thing you bring up in passing here as something not to include in their code. I think maybe another sentence caveat describing how that is out of scope and pointing them elsewhere to learn more for it could be good!
[x] At p 82 you still have the line in that is intentionally producing an error, so you probably want to remove it (or maybe comment it out and indicate that it was the line that intentionally produced the error?). That way the code will run when you add another line in the next section.
[x] At p 83/84 you probably can remove the print statement around the .to_csv function. Also the parenthetical needs to be closed.
[x] It might be worth a sentence early on, when you describe the case study and research questions, saying that actual analysis is out of scope for the lesson. You’re really just getting things to the point where you could then do follow-up work with them (and that’s fine!).

hburns2 commented 3 years ago

@walshbr sorry for the delay! My computer malfunctioned, but we're up and running again. I think everything has been addressed and it's ready for copyedits. I updated the Jupyter Notebook with edits, as well. Thank you so much for your feedback and insight! It's greatly appreciated

hburns2 commented 3 years ago

Oh! I also changed the name of the lesson as swapped out most iterations of "cleaning" for "normalizing"

walshbr commented 3 years ago

Thanks @hburns2! We're on break until after January 5th, but I'll double check everything before passing off to @svmelton for copyediting then.

walshbr commented 3 years ago

All looks good to me! I caught a couple small things, but I think those will come out in copyediting. @svmelton this is ready to send along to the copyeditor. I think this should be everything but let me know if I'm missing anything?

assets/crowdsourced-data-normalization-with-pandas - asset files no image files lessons/crowdsourced-data-normalization-with-pandas.md - the lesson file gallery/crowdsourced-data-normalization-with-pandas.png - the modified avatar gallery/originals/crowdsourced-data-normalization-with-pandas.png - the original avatar

The bio -

- name: Halle Burns

  team: false

  orcid: 0000-0003-2346-2876

  bio:

      en: |

          Halle Burns is an assistant professor in the University Libraries at the University of Nevada, Las Vegas. She is also a certified instructor with The Carpentries, an international organization that teaches coding and data science skills to researchers.

svmelton commented 3 years ago

Thanks @walshbr! I'll get to work on finding a copyeditor.

svmelton commented 3 years ago

Hi @hburns2! Copyedits are in and can be found via this Dropbox link. You have discretion about accepting wording changes, etc. Please let me know if you have any questions!

hburns2 commented 3 years ago

@svmelton I have completed merging the copyedits with the lesson file. GitHub should now reflect the most recent version of the lesson

svmelton commented 3 years ago

Thank you @hburns2! I'll complete my piece in the next few days, and we can get this published!

hburns2 commented 3 years ago

@svmelton sounds great! I caught a few final spelling errors I didn't catch in my last read-through. So NOW I'm done messing with it.

svmelton commented 3 years ago

And the lesson is now live! Thanks to @hburns2 for all your work, to @walshbr for editing, and to @brandontlocke and @mkudzia for reviewing.

walshbr commented 3 years ago

Yay congratulations! I just tweeted the lesson out - please do promote so folks can enjoy your fine work. Thanks to @brandontlocke and @mkudzia for their great reviews and to @hburns2 for a lovely lesson.

mkudzia commented 3 years ago

Woohoo! Congrats @hburns2 this is so exciting.

programminghistorian / ph-submissions