Closed drjwbaker closed 2 years ago
@thomjur and @rcmapp have agreed to review this. Thanks Thomas and Rennie! The plan is to have both reviews in before 13 January, after which @mjlavin80 I'll summarise the reviews for your attention (though having published with us before, so you know how this works!)
Sounds great! Thanks, everyone, for agreeing to work on this lesson.
Thanks for acknowledging @mjlavin80 (as well as for contributing to PH!)
@drjwbaker @mjlavin80 Just a quick question: the images are not displayed when I follow the above-mentioned link. Could you please add them? Or is this due to a problem on my site? Thanks!
@thomjur Sorry for catching this late. The images are at https://github.com/programminghistorian/ph-submissions/tree/gh-pages/images/linear-and-logistic-regression and I've edited the code to fit our author guidelines, but I can't figure out why they still aren't rendering.
@anisa-hawes: any idea what I am doing wrong?
@thomjur Sorry for catching this late. The images are at https://github.com/programminghistorian/ph-submissions/tree/gh-pages/images/linear-and-logistic-regression and I've edited the code to fit our author guidelines, but I can't figure out why they still aren't rendering.
@anisa-hawes: any idea what I am doing wrong?
solved with commit 5c06b46
@tiagosousagarcia: fab. Thanks.
Hello @tiagosousagarcia and @drjwbaker ... These two images still aren't showing on the Preview for me ... I will take a look and see if another tweak is needed...
@anisa-hawes it seems that the escaped quotation marks in the figure caption are playing havoc with the transformations. A workaround could be to use “ instead of ", as in this lesson https://programminghistorian.org/en/lessons/interactive-text-games-using-twine (figure 5).
Thank you, @tiagosousagarcia. This is solved!
All figure images are showing correctly now, @thomjur. Please let me know if you notice anything else amiss!
@mjlavin80 @drjwbaker @rcmapp First of all, thank you very much for giving me the opportunity to review this interesting tutorial. I very much enjoyed reading it, and—as a trained historian with growing interest in quantitative methods and computational analysis but without any deeper knowledge about LR or LogReg – I assume to be the “ideal” readership. I learned a lot, and I am confident that this tutorial will, once it is finished, be an important contribution to the Programming Historian.
In the following parts, I will focus on those aspects that I found problematic or where I simply had questions (and only to a lesser extent on the several passages that I found convincing – of which there were quite a few).
s()
returns seconds, as far as I can see) Sure, days consists of multiple seconds (^_^), but I found this confusing.CountVectorizer
+ TfidfTransformer
when sklearn’s Tfidfvectorizer
does everything for your (including the selection of the most frequently appearing k words)? This could make the tutorial much shorter; yet, I can also understand if you say that this decision has been made to render things more explicit, which might even be better for learning purposes.random_state
parameter is correct hereAgain, I really liked your tutorial, and I am confident that it will be of great help for many historians and scholars in the humanities once it is ready! I am looking forward to the discussion!
Thanks for your thorough, thoughtful, and positive review @thomjur.
@mjlavin80: quick reminder that I don't expect you to respond directly, rather I'll summarise actions from the reviews once they have both come in. However, one thing to perhaps consider now is the point about the lesson dataset and conclusion. It may be worth spending some time summarising in conclusion what the approach enabled you to do and then guiding the reader towards better understanding the types of scenarios in which they might choose to take a comparable (expanding perhaps on the introductory comments, which I note are more 'traditional' quantitative history than your - very interesting - text based example). And in anticipation of the lesson being translated across our publications, this will have secondary benefit of helping editors/translators localise your article during translation, if that is deemed appropriate (given our international approach).
@drjwbaker I'll start thinking about both of those points!
Now expecting review from @rcmapp end of w/c 31 January. Thanks again to @rcmapp for fitting this in at a busy time!
Hello all,
Please note that this lesson's .md file has been moved to a new location within our Submissions Repository. It is now found here: https://github.com/programminghistorian/ph-submissions/tree/gh-pages/en/drafts/originals
A consequence is that this lesson's preview link has changed. It is now: http://programminghistorian.github.io/ph-submissions/en/drafts/originals/linear-and-logistic-regression
Please let me know if you encounter any difficulties or have any questions.
Very best, Anisa
Review of Linear and Logistic Regression Lesson for Programming Historian By @rcmapp
INITIAL COMMENTS I’m genuinely impressed by the coherence and extent of this lesson: it moves skillfully from concepts to descriptions of discrete steps to math to code. It adheres to the PH model very well as far as I can tell. Although I don’t perform text analytics regularly myself, I’m familiar with the concepts of text analytics and the purposes for which humanists use them. Thus it’s important to say that I’m offering the following comments from the point of view of someone who has organized and attended a lot of technical workshops for humanists, rather than as someone who regularly performs this kind of analysis myself. I will follow the PH Reviewer Guidelines in my commentary, but I won’t have tested the lesson's computational methods.
AUDIENCE
GETTING READY
SKIMMABILITY
PAYOFF
WORKFLOW
SUSTAINABILITY
INTEGRATING WITH THE PROGRAMMING HISTORIAN
(reposting from https://github.com/programminghistorian/ph-submissions/issues/458)
Now the reviews are in, I've had some time to synthasise them. Huge thanks to @thomjur and @rcmapp for their hard work, constructive commentary, and insight. My instinct is that we have a strong article here, and that it will be strengthened further as result of peer review.
So, @mjlavin80 my job now per our editor guidelines is to offer you a clear path to completing the article. We ask that you complete the suggested revisions in 4 week (so deadline of 18 March) with commentary on the revisions you have made and justification for those you haven't.
I suggest you focus on the following:
Do those sound reasonable asks in the next 4 weeks?
@drjwbaker Thanks for synthesizing these reviews so quickly! And thank you to both reviewers @rcmapp and @thomjur ! All of these suggestions sound just right to me. Have you given any thought to the comments about splitting the lesson into a two articles, or a Part I and II? Based on the nature and extent of the requested changes, that's what I would prefer, but I will do whatever you think is best. A deadline of March 18 should work.
@mjlavin80 For two reasons I have a preference for not splitting it. Reasons:
Does that make sense?
@drjwbaker as long as you think it hangs together, I'll defer to your judgement. Thanks!
I do. Keep me posted if 18 March is looking unlikely.
@drjwbaker I have pushed my revisions to this lesson. Thanks again to you and both reviewers, @rcmapp and @thomjur! Here is a summary of the changes I have made (responses inline).
I have addressed this item in three ways. First, I have refined the regression examples in the introduction. Second, I have added a paragraph to the dataset description in whcih i explain why I think this datset is a good choice for the lesson. Third, I have revised the description of the simple linear regression example (pairing a book review's year a publication with its word count) to be clearer about why that example is being used.
I have expanded the lesson goals to make explicit mention of section headings to follow. I have also broken the goals into a numbered list so they are easier to read and separate from one another. Regarding the conclusion, I have done as requested and summarized what the lesson covers, with an eye toward underlining the rationale behind what I have covered. Lastly, I decided to remove the adjective "high-level" from the lesson. I suspect that this term can mean different things to different people, which may have been causing some confusion.
I have changed the phrase 'how regression fits into data analysis research design' to 'how linear and logistic regression models make predictions'. I think this is the most efficient way to resolve the confusion around this phrase.
I have added the lesson-files.zip file to assets > linear-logistic-regression and linked to that file in the lesson text. The ultimate location of lesson-files.zip
(and the link's href) will have to be updated when the lesson is published.
Here I revised the "Suggested Prior Skills" section to be more explicit about which foundational skills might be found in which lesson. I've also added references to a few other lessons. It should now be clear that there isn't a very strong join between this lesson and “Analyzing Documents with TF-IDF” or “Understanding and Using Common Similarity Measures for Text Analysis.”
My revision of "Suggested Prior Skills" addresses this concern as well.
Now paragraph 41. I've addressed this by explaining the value of using regression to assess relationship and linearity. I think it's a good example to use because it demonstrates forming a hypothesis of how two variables might be related, and exploring it with a linear regression. What's more, the two variables are easily understood. In general, I don't think it's problematic to run a linear regression in a context like this one, as long as it's clear that the results don't suggest a strong linear relationship.
I don't know of a good resource to link here, but I'm open to suggestions.
Now paragraph 57. This function converts the data to seconds, but the ratio remains the same. To avoid confusion, I've changed the function to divide "days so far this year" by "days in year" instead of "seconds so far this year" by "seconds in this year".
When I used a path library for my last lesson, this decision seemed to cause more problems than it fixed. Ideally, someone proficient enough to do this lesson should be able to figure out their own path.
Now paragraph 68. Tfidfvectorizer doesn't work if the data is already in CSV format, which is why I am using DictVectorizer
instead of CountVectorizer
for this lesson.
Now paragraph 78. I'm pretty sure I'm right, but I added some text to clarify what I mean in case I was causing the confusion. :)
This is a very big topic, and the methods I would discuss wouldn't be very relevant to the lesson because they aren't really useful with computational text analysis. They are much more applicable with a small number of variables.
Now paragraph 103. I've revised to clarify this point.
Now paragraph 136. It is true that the majority of the reviews with female labels are found in the lowest range bucket, and that the split of m-f labels in each bucket is what's most significant. As the value increases, the probability of an f label increases. To my eye, the use of "absolute numbers" increases the clarity of the example.
para. 124ff.: I am a little puzzled when it comes to the mathematical explanation of logistic regression, and I am uncertain whether this part is necessary for the PH audience (since, if I remember correctly, there were no explanations either of how to get the line of best fit). Maybe I am misunderstanding something here (I am not a statistician), but the sigmoid function appears somewhat out of the blue and then there are several parts where I am not sure if they are correct. For instance, “e represents the natural log” and “Calculate the natural log of that product (e^(a+ bXi))” to me, e^(a+ bXi) here is the exponential function (e^x^), so the inverse of the natural logarithm. Also, “Multiply the variable’s coefficient (x) by the predictor value”, but the coefficient is b, isn’t it? All in all, please make sure that this part is correct, although having finished half of a BA study program in computer sciences (again: ^_^), I don’t feel comfortable evaluating the mathematical details here. Also, as I’ve already said, I am not sure if these details are really necessary here. I think a more general explanation of how to do classifications via thresholds might also do the trick (plus some links to further reading).
Now paragraph 140. Yes, it's correct to say that this equation expresses the inverse of the natural logarithm. Maybe just a product of hurrying on my end, or a case of getting natural log and exponent mixed up at the time. I believe the a + bXi part of the formula is correct, meaning coefficent * variable value, plus intercept. I'm also not sure if these details are helpful or not. Happy to defer to the judgment of @drjwbaker on whether to cut or condense this whole section. For now, I've cleaned up the description and added a couple of (hopefully) clarifying sentences.
Typos
para. 40: “aref” (done) para. 50: “… review WILL have another.” (done) para 102: “… values are outside..” (redundant “are”); “abut 22%” (done) para.167: “all”, “converted” (done) para. 202: “the fact that this reviews ARE ambiguous” (done)
I have added mentions of pandas, matplotlib, and seaborn libraries to the "Before You Begin" section of the lesson.
I have renumbered the endnotes and fixed the markdown so they now work as expected.
@mjlavin80 Thanks so much for your hard work revising the article. Working through the edits now.
- para. 34ff.: When explaining how LR works (which is well done), I think you do not say anything about how the actual line of best fit/formula is calculated, which is fine since this shouldn’t become too mathematical, but a link or short comment might be helpful.
I don't know of a good resource to link here, but I'm open to suggestions.
@mjlavin80: is this now a different paragraph number?
@drjwbaker I believe that would be para 44 now, just before the visual with error lines.
@mjlavin80 Okay. I've gone through the revisions. I am satisfied that the article addresses the substantive and minor points raised in peer reivew, and that where you haven't adjusted based on peer review your rationale for doing so is robust.
Which is to say that I'm happy to proceed to the next stage of publication https://programminghistorian.org/en/editor-guidelines#recommend-publication---editorial-checklist (pinging @svmelton as ME for info)
However, and this is mea culpa, I was in erorr when I said:
@mjlavin80 For two reasons I have a preference for not splitting it. Reasons:
What we have is under our word limit https://programminghistorian.org/en/author-guidelines#step-1-proposing-a-new-lesson
Our word limit is 8k https://programminghistorian.org/en/author-guidelines#step-1-proposing-a-new-lesson This is 20k. We have a word limit both for pedagogical reasons and to support translation/localisation work. So I think we should abide by this.
Looking at what we have, a few things strike me:
I then see three actions here.
Again, many apologies for spotting this so late. Not sure how I miscounted so badly. Hopefully the path I've described can work without too much additional work on your part.
Thank you very much for the revision! Just a quick question @drjwbaker : Are we supposed to go through the changes and (potentially) add further comments? Just an "economical" question because if you've decided that the article is fine as is, I would rather wait and read the final article(s). :-)
@thomjur Thanks for the question. I should have clarified. You and @rcmapp don't need to contribute anything further. Many thanks again for your constructive and insightful peer review.
@drjwbaker I am happy with the proposed split and the nature of the split as you described above. Thank you for offering to work on first round of edits required for the split! For anything that needs to be written from scratch as opposed to edited or moved around, I would be happy to draft those, that way I can honestly say I wrote the whole thing. How would you feel about doing the other steps first and then pointing me to paragraph locations where text is needed?
@mjlavin80 Thank you! And apologies again. What I'll do is make this split, then point out the areas I think need the work.
Hello @drjwbaker. I'm really happy to help you with this. Please me know when the two lessons are (roughly) ready. I can set aside time for copyediting, keeping in view the necessary cross-references.
Happy to meet you, @mjlavin80, and looking forward to supporting the publication of these lessons!
Right, I've made the split so we have https://github.com/programminghistorian/ph-submissions/blob/gh-pages/en/drafts/originals/linear-regression.md and https://github.com/programminghistorian/ph-submissions/blob/gh-pages/en/drafts/originals/logistic-regression.md.
A run down of the main things I've done:
linear-regression
(logistic-regression
doesn't have the dataset section so doesn't need it)linear-regression
pointing readers to logistic-regression
.@mjlavin80 Are you happy to fettle from there? (as there may be the odd line in the lessons that cross-refer that you know better than I where to find!) If so, send me the commit numbers on any changes you make so I can see the diffs. We can then move to publishing.
@anisa: if I could lean on you for a little help please? I'm not sure if all the crosses at https://github.com/programminghistorian/ph-submissions/commits/gh-pages are a problem. The split lessons seems to have staged correctly at http://programminghistorian.github.io/ph-submissions/en/drafts/originals/linear-regression and https://programminghistorian.github.io/ph-submissions/en/drafts/originals/logistic-regression so I think we are okay?
Thank you very much for the revision! Just a quick question @drjwbaker : Are we supposed to go through the changes and (potentially) add further comments? Just an "economical" question because if you've decided that the article is fine as is, I would rather wait and read the final article(s). :-)
Thanks, @thomjur for asking this! (I was just about to dive in, but now I can eat lunch.) I am excited to read the final article, and also to refer our graduate students in the DH Certificate program to it.
Thanks @drjwbaker for managing this peer review process; I have enjoyed the process and would look forward to reviewing again. Congratulations, @mjlavin80 ! I think people will really appreciate and use your work.
Hello @drjwbaker. I'm really happy to help you with this. Please me know when the two lessons are (roughly) ready. I can set aside time for copyediting, keeping in view the necessary cross-references.
Happy to meet you, @mjlavin80, and looking forward to supporting the publication of these lessons!
Our messages crossed. Thanks Anisa!
Thank you, @drjwbaker. The build failures appear to represent places where the deployment was interrupted/or did not complete. Clicking into the crosses, several report: Error: The operation was canceled.
I think perhaps these are cases where you implemented a further change before the previous build check was complete. But the most recent builds are successful and are fully ticked, so all is well.
I think we just need to update the link in /en/drafts/originals/linear-regression so that it points to the new assets file at https://github.com/programminghistorian/ph-submissions/blob/gh-pages/assets/linear-regression/lesson-files.zip. I can do that.
Right. That makes sense.
Right, I've made the split so we have https://github.com/programminghistorian/ph-submissions/blob/gh-pages/en/drafts/originals/linear-regression.md and https://github.com/programminghistorian/ph-submissions/blob/gh-pages/en/drafts/originals/logistic-regression.md.
A run down of the main things I've done:
- Created two new lessons with new titles and slugs (note that I've left the original in place for now just in case!)
- Created /images/ folders for each new lesson
- Created an /assets/ folder for
linear-regression
(logistic-regression
doesn't have the dataset section so doesn't need it)- Adjusted the abstract for both to say they are one of two lessons.
- Adjusted the intro to both to say they are one of two lessons.
- Added series logic into the metadata for both lessons.
- Divided up the endnotes. As I've left the intros to both largely the same, those initial endnotes remain.
- Added a short note at the end of
linear-regression
pointing readers tologistic-regression
.@mjlavin80 Are you happy to fettle from there? (as there may be the odd line in the lessons that cross-refer that you know better than I where to find!) If so, send me the commit numbers on any changes you make so I can see the diffs. We can then move to publishing.
@Anisa: if I could lean on you for a little help please? I'm not sure if all the crosses at https://github.com/programminghistorian/ph-submissions/commits/gh-pages are a problem. The split lessons seems to have staged correctly at http://programminghistorian.github.io/ph-submissions/en/drafts/originals/linear-regression and https://programminghistorian.github.io/ph-submissions/en/drafts/originals/logistic-regression so I think we are okay?
@drjwbaker yes, I'm happy to work from these versions! Thanks again!
Fab. Thanks for your persistence @mjlavin80. When you are happy with the drafts, I'l work with @anisa-hawes to proceed towards publication (hopefully with no more editor induced hitches!)
@mjlavin80 Just checking in on this. Would it be useful to set a deadline here, or do you have one already in mind?
@drjwbaker I was actually just working on it. I looked over everything and just pushed new versions, but all of my changes were relatively small. I expanded the note in Part 2 that directs readers to go to Part 2 for the prep and data sections, but the stitch work you added to the conclusion pointing backwards to part 1 looks good to me. I think that's everything I was supposed to look at, but do let me know if I've missed anything!
Thanks so much @mjlavin80. I'll aim to check before the weekend so we can get this moving towards publication before my annual leave (w/c 18 April)
@svmelton I'm delighted to report that these articles are ready to go. Many thanks for @mjlavin80 for their hard work here.
I say "articles" because we made a split based on word length. So we have two articles: Linear Regression analysis with scikit-learn
and Logistic Regression analysis with scikit-learn
which are encoded as a sequence, and both of which are - IMO - rich additions to our English language journal.
Per https://programminghistorian.org/en/editor-guidelines#5-inform-the-managing-editor-of-your-recommendation-to-publish I provide below all the info you need as Managing Editing to check these before publications. If there is anything I've missed (because as you know I haven't done this particular bit of PH labour for a while!) do let me know.
Linear Regression analysis with scikit-learn
Logistic Regression analysis with scikit-learn
Hello @drjwbaker. Shall I move the file linear-and-logistic-regression.md
pre-spilt to our new inactive
folder inside /en/drafts/originals/?
I realise that it's useful for the project's memory to keep that as a record, but I'd like to try and avoid any confusion about which files are which within the repo.
Yes please. Good idea.
@drjwbaker @svmelton @anisa-hawes It looks like my previous bio statement is out of date. It lists my affiliation as the University of Pittsburgh, which is not true anymore. Here is a new version: "Matthew J. Lavin is an Assistant Professor of Data Analytics specializing in Humanities Analytics at Denison University. His scholarship focuses on book history, cultural analytics, and turn-of-the-twentieth-century U.S. literature and culture." Also, my current Twitter handle is @HumanitiesData if you are still keeping track of those.
Thank you for letting us know, @mjlavin80. Here's Matthew's updated bio including orcID, @svmelton:
name: Matthew J. Lavin
team: false
orcid: 0000-0003-3867-9138
bio:
en: |
Matthew J. Lavin is an Assistant Professor of Data Analytics specializing in Humanities Analytics at Denison University. His scholarship focuses on book history, cultural analytics, and turn-of-the-twentieth-century U.S. literature and culture.
@anisa-hawes I do. It's https://orcid.org/0000-0003-3867-9138
Thanks, all! I believe the only thing that's left before the final publication steps is for @anisa-hawes to complete copyedits. Just a heads up that she'll be on leave next week but should be able to complete them once she's back.
Okay! Did I miss a step at https://programminghistorian.org/en/editor-guidelines or has copyediting not been built into that workflow yet?
The Programming Historian has received the following tutorial on 'Linear and Logistic Regression' by @mjlavin80. This lesson is now under review and can be read at:
http://programminghistorian.github.io/ph-submissions/en/drafts/originals/linear-regression
http://programminghistorian.github.io/ph-submissions/en/drafts/originals/logistic-regression
Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.
I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. @svmelton and the English team have already read through the lesson and provided feedback, to which the author has responded.
Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.
I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me.
Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.
Anti-Harassment Policy
This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.
Permission to Publish
@mjlavin80: please could you post the following statement to the Submission ticket.