Linear and Logistic Regression

drjwbaker commented 3 years ago

The Programming Historian has received the following tutorial on 'Linear and Logistic Regression' by @mjlavin80. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/en/drafts/originals/linear-regression

http://programminghistorian.github.io/ph-submissions/en/drafts/originals/logistic-regression

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. @svmelton and the English team have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. Thank you for helping us to create a safe space.

Permission to Publish

@mjlavin80: please could you post the following statement to the Submission ticket.

I the author|translator hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English|en français|en español to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

drjwbaker commented 3 years ago

@thomjur and @rcmapp have agreed to review this. Thanks Thomas and Rennie! The plan is to have both reviews in before 13 January, after which @mjlavin80 I'll summarise the reviews for your attention (though having published with us before, so you know how this works!)

mjlavin80 commented 2 years ago

Sounds great! Thanks, everyone, for agreeing to work on this lesson.

drjwbaker commented 2 years ago

Thanks for acknowledging @mjlavin80 (as well as for contributing to PH!)

thomjur commented 2 years ago

@drjwbaker @mjlavin80 Just a quick question: the images are not displayed when I follow the above-mentioned link. Could you please add them? Or is this due to a problem on my site? Thanks!

drjwbaker commented 2 years ago

@thomjur Sorry for catching this late. The images are at https://github.com/programminghistorian/ph-submissions/tree/gh-pages/images/linear-and-logistic-regression and I've edited the code to fit our author guidelines, but I can't figure out why they still aren't rendering.

@anisa-hawes: any idea what I am doing wrong?

tiagosousagarcia commented 2 years ago

@thomjur Sorry for catching this late. The images are at https://github.com/programminghistorian/ph-submissions/tree/gh-pages/images/linear-and-logistic-regression and I've edited the code to fit our author guidelines, but I can't figure out why they still aren't rendering.

@anisa-hawes: any idea what I am doing wrong?

solved with commit 5c06b46

drjwbaker commented 2 years ago

@tiagosousagarcia: fab. Thanks.

anisa-hawes commented 2 years ago

Screenshot 2022-01-05 at 14 04 11

Screenshot 2022-01-05 at 14 26 31

Hello @tiagosousagarcia and @drjwbaker ... These two images still aren't showing on the Preview for me ... I will take a look and see if another tweak is needed...

tiagosousagarcia commented 2 years ago

@anisa-hawes it seems that the escaped quotation marks in the figure caption are playing havoc with the transformations. A workaround could be to use “ instead of ", as in this lesson https://programminghistorian.org/en/lessons/interactive-text-games-using-twine (figure 5).

anisa-hawes commented 2 years ago

Thank you, @tiagosousagarcia. This is solved!

All figure images are showing correctly now, @thomjur. Please let me know if you notice anything else amiss!

thomjur commented 2 years ago

@mjlavin80 @drjwbaker @rcmapp First of all, thank you very much for giving me the opportunity to review this interesting tutorial. I very much enjoyed reading it, and—as a trained historian with growing interest in quantitative methods and computational analysis but without any deeper knowledge about LR or LogReg – I assume to be the “ideal” readership. I learned a lot, and I am confident that this tutorial will, once it is finished, be an important contribution to the Programming Historian.

In the following parts, I will focus on those aspects that I found problematic or where I simply had questions (and only to a lesser extent on the several passages that I found convincing – of which there were quite a few).

Major Issues

[x] This is more of a general comment, but one of the major problems that I encountered was your dataset. From my personal experience, I know how difficult it is to find a suitable dataset, and if I interpret your approach correctly, one major reason for choosing this dataset is that you have already worked with it before. I think this is fine, and I also think that it works really well in the context of logistic regression. Yet, the part on linear regression (both the text length and the TF-IDF parts) might still benefit from some additional explanation why you want to apply LR with this dataset, particularly in the beginning. Why would one want to analyze this type of (historical) data, that stems from a rather short period of time and is “closed” (or are there any unknown reviews from NYT out there, with fragmentary dating?), with linear regression?
[x] The next issue is somewhat related to the above-mentioned one and concerns the introduction and the missing conclusion. The introduction is very general and includes exactly those examples that I already know from other introductions to LR (and that are not related to our discipline); and using LR makes very much sense in these cases; yet, it would be very helpful for the reader if you could also explain why using methods for estimating real estate prices can also be valuable tools for historians (throughout the text, but particularly in the introduction). Or maybe better choose fitting examples right from the start. Also, there is no conclusion. It would be good to wrap things up, summarize what you’ve done in your tutorial, and offer potential perspectives of where to continue from here.
[x] Although I found your examination of non-binary data interesting (particularly from a research perspective), I, personally, had the impression that the tutorial was a bit too long and that the second part on LogReg could be slightly shorter and maybe be restricted to the m/f analysis (I know why this could be problematic, so please take it as a personal comment; also, I very much liked the analysis here, for instance the evaluation of the most important terms, etc.)

Minor Issues

[x] para. 31ff.: I know that this is an example to illustrate how LR works, but there is an obvious cluster (1920-1924) of longer book reviews, which might lead to wrong results when predicting the avg. word count of books between 1912–1920, which is a huge part of your data that “only” includes the early 20th cent.; consequently, is this really a good example (you also doubt that in your text)? Perhaps you could mention why this is problematic.
[x] para. 34ff.: When explaining how LR works (which is well done), I think you do not say anything about how the actual line of best fit/formula is calculated, which is fine since this shouldn’t become too mathematical, but a link or short comment might be helpful.
[x] para. 46: does the function divide/operate on the number of days or rather seconds of a year? (s() returns seconds, as far as I can see) Sure, days consists of multiple seconds (^_^), but I found this confusing.
[x] para. 54: maybe refer to one of the various Python path modules here to avoid problems with Win/Linux paths (this is really not essential)
[x] para57 ff.: This also concerns the data preparation in the logistic regression part; maybe you are doing this for better explanations, but why are you using the CountVectorizer + TfidfTransformer when sklearn’s Tfidfvectorizer does everything for your (including the selection of the most frequently appearing k words)? This could make the tutorial much shorter; yet, I can also understand if you say that this decision has been made to render things more explicit, which might even be better for learning purposes.
[x] para. 66: I doubt that the explanation of the random_state parameter is correct here
[x] para. 92: Please elaborate further on the list of measures to avoid multicollinearity
[x] para. 93: This one is confusing and somehow underlines my previous “critique.” I also do not really understand the inverse influence here (date shapes the TF-IDF score, potentially, yes, but how and why?). I am sure that you can clarify all this, it’s most likely me who is stuck. 😊
[x] para. 123 + visualization: I am not certain why you are using absolute numbers here since you are mainly talking about the fact that the two upper bins include a higher proportion of female labels. Also, at least according to the visualization, is this statement true (?): “It’s also the case that the majority of the reviews with female labels are found in this range.” To me, this sounds like absolute numbers, but it does not seem that there are more than 500 female-labeled entries in the upper two bins?
[x] para. 124ff.: I am a little puzzled when it comes to the mathematical explanation of logistic regression, and I am uncertain whether this part is necessary for the PH audience (since, if I remember correctly, there were no explanations either of how to get the line of best fit). Maybe I am misunderstanding something here (I am not a statistician), but the sigmoid function appears somewhat out of the blue and then there are several parts where I am not sure if they are correct. For instance, “e represents the natural log” and “Calculate the natural log of that product (e^(a+ bXi))” to me, e^(a+ bXi) here is the exponential function (e^x^), so the inverse of the natural logarithm. Also, “Multiply the variable’s coefficient (x) by the predictor value”, but the coefficient is b, isn’t it? All in all, please make sure that this part is correct, although having finished half of a BA study program in computer sciences (again: ^_^), I don’t feel comfortable evaluating the mathematical details here. Also, as I’ve already said, I am not sure if these details are really necessary here. I think a more general explanation of how to do classifications via thresholds might also do the trick (plus some links to further reading).

Typos

[x] para. 40: “aref”
[x] para. 50: “… review WILL have another.”
[x] para 102: “… values are outside..” (redundant “are”); “abut 22%”
[x] para.167: “all”, “converted”
[x] para. 202: “the fact that this reviews ARE ambiguous”

Again, I really liked your tutorial, and I am confident that it will be of great help for many historians and scholars in the humanities once it is ready! I am looking forward to the discussion!

drjwbaker commented 2 years ago

Thanks for your thorough, thoughtful, and positive review @thomjur.

@mjlavin80: quick reminder that I don't expect you to respond directly, rather I'll summarise actions from the reviews once they have both come in. However, one thing to perhaps consider now is the point about the lesson dataset and conclusion. It may be worth spending some time summarising in conclusion what the approach enabled you to do and then guiding the reader towards better understanding the types of scenarios in which they might choose to take a comparable (expanding perhaps on the introductory comments, which I note are more 'traditional' quantitative history than your - very interesting - text based example). And in anticipation of the lesson being translated across our publications, this will have secondary benefit of helping editors/translators localise your article during translation, if that is deemed appropriate (given our international approach).

mjlavin80 commented 2 years ago

@drjwbaker I'll start thinking about both of those points!

drjwbaker commented 2 years ago

Now expecting review from @rcmapp end of w/c 31 January. Thanks again to @rcmapp for fitting this in at a busy time!

anisa-hawes commented 2 years ago

Hello all,

Please note that this lesson's .md file has been moved to a new location within our Submissions Repository. It is now found here: https://github.com/programminghistorian/ph-submissions/tree/gh-pages/en/drafts/originals

A consequence is that this lesson's preview link has changed. It is now: http://programminghistorian.github.io/ph-submissions/en/drafts/originals/linear-and-logistic-regression

Please let me know if you encounter any difficulties or have any questions.

Very best, Anisa

drjwbaker commented 2 years ago

Review of Linear and Logistic Regression Lesson for Programming Historian By @rcmapp

INITIAL COMMENTS I’m genuinely impressed by the coherence and extent of this lesson: it moves skillfully from concepts to descriptions of discrete steps to math to code. It adheres to the PH model very well as far as I can tell. Although I don’t perform text analytics regularly myself, I’m familiar with the concepts of text analytics and the purposes for which humanists use them. Thus it’s important to say that I’m offering the following comments from the point of view of someone who has organized and attended a lot of technical workshops for humanists, rather than as someone who regularly performs this kind of analysis myself. I will follow the PH Reviewer Guidelines in my commentary, but I won’t have tested the lesson's computational methods.

AUDIENCE

"Does the author address a consistent model reader throughout the lesson?"
- I don’t like to start with my strongest criticism because this lesson has so many strengths, but I think this is the area where the lesson needs the most work. Here are the stated goals: “to provide a high-level overview of linear and logistic regression, to describe how regression fits into data analysis research design, and to walk through running both algorithms in Python using the scikit-learn library.” By “high-level” I understand you to mean that you are providing a bird’s-eye overview that would be helpful for someone trying to understand the overarching principles as well as the specific steps (which seem masterfully captured to me). There is something of a mismatch between the central goals of the lesson and the way that the lesson’s framing commentary goes on to guide the user. For example:

“high-level overview of linear and logistic regression” In paragraph 26, your high-level overview doesn’t quite deliver: you say “Arguably, what’s most important when learning about linear and logistic regression is obtaining high level intuition for what’s happening when you fit a model and predict from it, and that’s what we’ll focus on here.” I would like to see more conclusions drawn around this idea.

Are some concepts or steps over-explained while other are under-explained?
- I found the math and statistics explanations to be excellent. This is no small accomplishment! I really haven’t seen a more comprehensible or succinct explanation.
- “data analysis research design”: When you stated this as a goal I got quite excited because it’s a really important concept and I’d like to learn more. After many conversations with would-be computational humanists, I’ve learned that the biggest anxiety for those who are beginning to delve into computational methods of analysis is whether they will be worth the trouble. True, your CA article offers an example of a research outcome, but within this lesson I don’t think you deliver fully on this broad promise.
Does the audience seem to match at least vaguely with other Programming Historian lessons? How is it new?
- This lesson is definitely well past the skills of a beginner, and yet it’s not intimidating or less manageable than other lessons. The use cases are clear and easy to understand. I think it’s just right for PH.

GETTING READY

What software / programming languages are required?
- All good except you don’t mention the pandas library until paragraph 39–you should probably bring up earlier.
What prerequisite skills are needed?
- Here’s a confusing statement from the “Preparation” section (no paragraph number) in terms of both audience and getting ready: “To understand the steps used to produce the lesson dataset, it would be helpful to have some familiarity with concepts such as how text files are typically normalized, tokenized, and represented in vector space. However, these details are not needed to complete the lesson.” I don’t see how someone without an understanding of the concepts and principles of data preparation could get through this lesson, honestly. Skipping that discussion is a good decision in a more advanced lesson, but you should be frank about it, and perhaps send readers to some good descriptions of the ideas behind that vital process.
What familiarity or experience is needed?
- see number 2
What data are needed? Is the dataset readily available?
- The dataset is well described but I can’t see any reference to a zipped set of text files, although the other PH lessons you cite (“Analyzing Documents with TF-IDF” and “Understanding and Using Common Similarity Measures for Text Analysis”) do offer zip files. I referred to your CA article and the github repository mentioned in the footnotes to that article, but I couldn’t see it. This may be because of my own lack of technical proficiency, but I think having the dataset provided in an analogous way to the two other PH lessons would make sense.

SKIMMABILITY

Are there clearly defined learning objectives or sets of skills to be learned listed near the top of the lesson?
- Yes. I’ve already covered this under audience.
Are there useful secondary skills to be gained / practiced from the lesson?’
- I believe so
Do screenshots and other diagrams illustrate crucial steps / points of the lesson?’
- Yes, these seem very meticulous.
Do sections and section headings provide clear signage to the reader?
- Yes, very well done.

PAYOFF

the tutorial suggest why the explained tools or techniques are useful in a general way?
- Yes—I really like the way this lesson builds on the CA article. This helps a lot.
Does the tutorial suggest how a reader could apply the concepts (if not concrete steps) of the lesson to their own work?
- This ties in with the earlier issue of data research design. I think a paragraph that offers some more typical research applications could be helpful.

WORKFLOW

Should a long lesson be divided into smaller lessons?
- I’d like to see this divided into a paired set of lessons, with the resulting opportunity to offer more contextualization about data research design.
Are there logical stopping points throughout the lesson
- Yes—it’s very logically structured.
If datasets are required, are they available to download at various points throughout the lesson (or different versions of them as the tutorial may require)?
- See “Getting Ready” number 4

SUSTAINABILITY

Are all software versions and dependencies listed in the submission? Are these assets the most recent versions? 2. If the lesson uses older software versions, does the author note why?
- Not qualified to assess
If you have expertise in the specific methodology or tool(s) for the lesson, is the methodology generally up-to-date?
- Not qualified to assess
What are the data sources for the submission? Are they included in a way that does not heavily on third-party hosting?
- See issues with missing zip file, above
What kinds of other external links does the submission use? Are these current or are there other, more recent or appropriate, resources that could be linked to?
- Addressed above in the dataset section

INTEGRATING WITH THE PROGRAMMING HISTORIAN

the lesson build upon an existing lesson and explain how?
- It does build on other lessons but without much explanation
Does the lesson tie into existing lessons and have appropriate links?
- Same as number 1

drjwbaker commented 2 years ago

(reposting from https://github.com/programminghistorian/ph-submissions/issues/458)

drjwbaker commented 2 years ago

Now the reviews are in, I've had some time to synthasise them. Huge thanks to @thomjur and @rcmapp for their hard work, constructive commentary, and insight. My instinct is that we have a strong article here, and that it will be strengthened further as result of peer review.

So, @mjlavin80 my job now per our editor guidelines is to offer you a clear path to completing the article. We ask that you complete the suggested revisions in 4 week (so deadline of 18 March) with commentary on the revisions you have made and justification for those you haven't.

I suggest you focus on the following:

Making clear what the pay-off is for using these methods on a dataset of this nature (that is, both this specific dataset and comparable datasets other humanists may have or be considering assembling)
Revise the introduction and conclusion so as to a) guide the reader (per my comment at https://github.com/programminghistorian/ph-submissions/issues/436#issuecomment-1011112053) and b) provide - as promised in the intro - a high-level summary of what we have learnt.
Either drop the mention of 'how regression fits into data analysis research design' in the intro, or signpost further down - perhaps by using this exact phrase - where the article is helping readers understand how to do that.
Make it obvious how to get to the dataset.
Make the join with “Analyzing Documents with TF-IDF” and “Understanding and Using Common Similarity Measures for Text Analysis” more explicit.
On the passage "To understand the steps used to produce the lesson dataset, it would be helpful to have some familiarity with concepts such as how text files are typically normalized, tokenized, and represented in vector space. However, these details are not needed to complete the lesson" provide some links to resources that would help readers understand these concepts better. 'Vector space', in particular, is a little jargony without context. @rcmapp's comments should help here.
Attend to the minor issues, focusing on those noted by @thomjur at https://github.com/programminghistorian/ph-submissions/issues/436#issuecomment-1010306176

Do those sound reasonable asks in the next 4 weeks?

mjlavin80 commented 2 years ago

@drjwbaker Thanks for synthesizing these reviews so quickly! And thank you to both reviewers @rcmapp and @thomjur ! All of these suggestions sound just right to me. Have you given any thought to the comments about splitting the lesson into a two articles, or a Part I and II? Based on the nature and extent of the requested changes, that's what I would prefer, but I will do whatever you think is best. A deadline of March 18 should work.

drjwbaker commented 2 years ago

@mjlavin80 For two reasons I have a preference for not splitting it. Reasons:

What we have is under our word limit https://programminghistorian.org/en/author-guidelines#step-1-proposing-a-new-lesson
If we can fettle the framing, it hangs together.
We tend to find that multiple articles can be difficult for authors to manage. For example, you may not have time after the publication of part 1 to work on part 2 right away, and then may find yourself unable to come back to it later.

Does that make sense?

mjlavin80 commented 2 years ago

@drjwbaker as long as you think it hangs together, I'll defer to your judgement. Thanks!

drjwbaker commented 2 years ago

I do. Keep me posted if 18 March is looking unlikely.

mjlavin80 commented 2 years ago

@drjwbaker I have pushed my revisions to this lesson. Thanks again to you and both reviewers, @rcmapp and @thomjur! Here is a summary of the changes I have made (responses inline).

Make clear what the pay-off is for using these methods on a dataset of this nature (that is, both this specific dataset and comparable datasets other humanists may have or be considering assembling)

I have addressed this item in three ways. First, I have refined the regression examples in the introduction. Second, I have added a paragraph to the dataset description in whcih i explain why I think this datset is a good choice for the lesson. Third, I have revised the description of the simple linear regression example (pairing a book review's year a publication with its word count) to be clearer about why that example is being used.

Revise the introduction and conclusion so as to a) guide the reader (per my comment at Linear and Logistic Regression #436 (comment)) and b) provide - as promised in the intro - a high-level summary of what we have learnt.

I have expanded the lesson goals to make explicit mention of section headings to follow. I have also broken the goals into a numbered list so they are easier to read and separate from one another. Regarding the conclusion, I have done as requested and summarized what the lesson covers, with an eye toward underlining the rationale behind what I have covered. Lastly, I decided to remove the adjective "high-level" from the lesson. I suspect that this term can mean different things to different people, which may have been causing some confusion.

Either drop the mention of 'how regression fits into data analysis research design' in the intro, or signpost further down - perhaps by using this exact phrase - where the article is helping readers understand how to do that.

I have changed the phrase 'how regression fits into data analysis research design' to 'how linear and logistic regression models make predictions'. I think this is the most efficient way to resolve the confusion around this phrase.

Make it obvious how to get to the dataset.

I have added the lesson-files.zip file to assets > linear-logistic-regression and linked to that file in the lesson text. The ultimate location of lesson-files.zip (and the link's href) will have to be updated when the lesson is published.

Make the join with “Analyzing Documents with TF-IDF” and “Understanding and Using Common Similarity Measures for Text Analysis” more explicit.

Here I revised the "Suggested Prior Skills" section to be more explicit about which foundational skills might be found in which lesson. I've also added references to a few other lessons. It should now be clear that there isn't a very strong join between this lesson and “Analyzing Documents with TF-IDF” or “Understanding and Using Common Similarity Measures for Text Analysis.”

On the passage "To understand the steps used to produce the lesson dataset, it would be helpful to have some familiarity with concepts such as how text files are typically normalized, tokenized, and represented in vector space. However, these details are not needed to complete the lesson" provide some links to resources that would help readers understand these concepts better. 'Vector space', in particular, is a little jargony without context. @rcmapp's comments should help here.

My revision of "Suggested Prior Skills" addresses this concern as well.

Attend to the minor issues, focusing on those noted by @thomjur at Linear and Logistic Regression #436

para. 31ff.: I know that this is an example to illustrate how LR works, but there is an obvious cluster (1920-1924) of longer book reviews, which might lead to wrong results when predicting the avg. word count of books between 1912–1920, which is a huge part of your data that “only” includes the early 20th cent.; consequently, is this really a good example (you also doubt that in your text)? Perhaps you could mention why this is problematic.

Now paragraph 41. I've addressed this by explaining the value of using regression to assess relationship and linearity. I think it's a good example to use because it demonstrates forming a hypothesis of how two variables might be related, and exploring it with a linear regression. What's more, the two variables are easily understood. In general, I don't think it's problematic to run a linear regression in a context like this one, as long as it's clear that the results don't suggest a strong linear relationship.

para. 34ff.: When explaining how LR works (which is well done), I think you do not say anything about how the actual line of best fit/formula is calculated, which is fine since this shouldn’t become too mathematical, but a link or short comment might be helpful.

I don't know of a good resource to link here, but I'm open to suggestions.

para. 46: does the function divide/operate on the number of days or rather seconds of a year? (s() returns seconds, as far as I can see) Sure, days consists of multiple seconds (^_^), but I found this confusing.

Now paragraph 57. This function converts the data to seconds, but the ratio remains the same. To avoid confusion, I've changed the function to divide "days so far this year" by "days in year" instead of "seconds so far this year" by "seconds in this year".

para. 54: maybe refer to one of the various Python path modules here to avoid problems with Win/Linux paths (this is really not essential)

When I used a path library for my last lesson, this decision seemed to cause more problems than it fixed. Ideally, someone proficient enough to do this lesson should be able to figure out their own path.

para57 ff.: This also concerns the data preparation in the logistic regression part; maybe you are doing this for better explanations, but why are you using the CountVectorizer + TfidfTransformer when sklearn’s Tfidfvectorizer does everything for your (including the selection of the most frequently appearing k words)? This could make the tutorial much shorter; yet, I can also understand if you say that this decision has been made to render things more explicit, which might even be better for learning purposes.

Now paragraph 68. Tfidfvectorizer doesn't work if the data is already in CSV format, which is why I am using DictVectorizer instead of CountVectorizer for this lesson.

para. 66: I doubt that the explanation of the random_state parameter is correct here

Now paragraph 78. I'm pretty sure I'm right, but I added some text to clarify what I mean in case I was causing the confusion. :)

para. 92: Please elaborate further on the list of measures to avoid multicollinearity

This is a very big topic, and the methods I would discuss wouldn't be very relevant to the lesson because they aren't really useful with computational text analysis. They are much more applicable with a small number of variables.

para. 93: This one is confusing and somehow underlines my previous “critique.” I also do not really understand the inverse influence here (date shapes the TF-IDF score, potentially, yes, but how and why?). I am sure that you can clarify all this, it’s most likely me who is stuck. 😊

Now paragraph 103. I've revised to clarify this point.

para. 123 + visualization: I am not certain why you are using absolute numbers here since you are mainly talking about the fact that the two upper bins include a higher proportion of female labels. Also, at least according to the visualization, is this statement true (?): “It’s also the case that the majority of the reviews with female labels are found in this range.” To me, this sounds like absolute numbers, but it does not seem that there are more than 500 female-labeled entries in the upper two bins?

Now paragraph 136. It is true that the majority of the reviews with female labels are found in the lowest range bucket, and that the split of m-f labels in each bucket is what's most significant. As the value increases, the probability of an f label increases. To my eye, the use of "absolute numbers" increases the clarity of the example.

para. 124ff.: I am a little puzzled when it comes to the mathematical explanation of logistic regression, and I am uncertain whether this part is necessary for the PH audience (since, if I remember correctly, there were no explanations either of how to get the line of best fit). Maybe I am misunderstanding something here (I am not a statistician), but the sigmoid function appears somewhat out of the blue and then there are several parts where I am not sure if they are correct. For instance, “e represents the natural log” and “Calculate the natural log of that product (e^(a+ bXi))” to me, e^(a+ bXi) here is the exponential function (e^x^), so the inverse of the natural logarithm. Also, “Multiply the variable’s coefficient (x) by the predictor value”, but the coefficient is b, isn’t it? All in all, please make sure that this part is correct, although having finished half of a BA study program in computer sciences (again: ^_^), I don’t feel comfortable evaluating the mathematical details here. Also, as I’ve already said, I am not sure if these details are really necessary here. I think a more general explanation of how to do classifications via thresholds might also do the trick (plus some links to further reading).
Now paragraph 140. Yes, it's correct to say that this equation expresses the inverse of the natural logarithm. Maybe just a product of hurrying on my end, or a case of getting natural log and exponent mixed up at the time. I believe the a + bXi part of the formula is correct, meaning coefficent * variable value, plus intercept. I'm also not sure if these details are helpful or not. Happy to defer to the judgment of @drjwbaker on whether to cut or condense this whole section. For now, I've cleaned up the description and added a couple of (hopefully) clarifying sentences.

Typos

para. 40: “aref” (done) para. 50: “… review WILL have another.” (done) para 102: “… values are outside..” (redundant “are”); “abut 22%” (done) para.167: “all”, “converted” (done) para. 202: “the fact that this reviews ARE ambiguous” (done)

I have added mentions of pandas, matplotlib, and seaborn libraries to the "Before You Begin" section of the lesson.
I have renumbered the endnotes and fixed the markdown so they now work as expected.

drjwbaker commented 2 years ago

@mjlavin80 Thanks so much for your hard work revising the article. Working through the edits now.

drjwbaker commented 2 years ago

para. 34ff.: When explaining how LR works (which is well done), I think you do not say anything about how the actual line of best fit/formula is calculated, which is fine since this shouldn’t become too mathematical, but a link or short comment might be helpful.

I don't know of a good resource to link here, but I'm open to suggestions.

@mjlavin80: is this now a different paragraph number?

mjlavin80 commented 2 years ago

@drjwbaker I believe that would be para 44 now, just before the visual with error lines.

drjwbaker commented 2 years ago

@mjlavin80 Okay. I've gone through the revisions. I am satisfied that the article addresses the substantive and minor points raised in peer reivew, and that where you haven't adjusted based on peer review your rationale for doing so is robust.

Which is to say that I'm happy to proceed to the next stage of publication https://programminghistorian.org/en/editor-guidelines#recommend-publication---editorial-checklist (pinging @svmelton as ME for info)

However, and this is mea culpa, I was in erorr when I said:

@mjlavin80 For two reasons I have a preference for not splitting it. Reasons:

What we have is under our word limit https://programminghistorian.org/en/author-guidelines#step-1-proposing-a-new-lesson

Our word limit is 8k https://programminghistorian.org/en/author-guidelines#step-1-proposing-a-new-lesson This is 20k. We have a word limit both for pedagogical reasons and to support translation/localisation work. So I think we should abide by this.

Looking at what we have, a few things strike me:

the article has grown through peer review as that process demanded more quaflification of concepts and ideas. I suspect we can find some efficiencies/redunacies in copyediting.
the topic is complex, demanding both detailed explanation and refreshers of that information.
as far as I can see (and I may be wrong) the two main parts of the article are not co-dependent (though they use the same dataset) so the article could sensibly be divided into two parts, one on 'linear' followed by one on 'logistic'. They could clearly be marked as a two part series. See how this is done here https://programminghistorian.org/en/lessons/viewing-html-files
we can adjust the title to reflect a split: something like "Linear Regression analysis with scikit-learn" and "Logistic Regression analysis with scikit-learn".
The intro can be split across the two parts, with some adjustment of language to point from part 1 to 2, and part 2 to 1 as appropriate.
Part 2 will need to clearly start with a little text stating to go to part 1 to learn how to prep, load in the data, etc.
The conclusion can stay where it is, with some adjustment of language pointing backwards to part 1.

I then see three actions here.

First, could you please indicate whether or not you are happy with this split and the nature of the split I describe above. If you are not or there are problems with my logic, perhaps we need to have a brief call to discuss.
Second, if you are happy, I will work on the split, moving text around and adding text per 4-7 above. I will make it super clear any text I've added/moved so that you can check any text before we move to publication (because your name will be listed as the author!)
Third, we can then work on staging both files to go through the publication process together. @anisa-hawes: with apologies for asking, if you have a little spare capacity, I may need a little help moving this along.

Again, many apologies for spotting this so late. Not sure how I miscounted so badly. Hopefully the path I've described can work without too much additional work on your part.

thomjur commented 2 years ago

Thank you very much for the revision! Just a quick question @drjwbaker : Are we supposed to go through the changes and (potentially) add further comments? Just an "economical" question because if you've decided that the article is fine as is, I would rather wait and read the final article(s). :-)

drjwbaker commented 2 years ago

@thomjur Thanks for the question. I should have clarified. You and @rcmapp don't need to contribute anything further. Many thanks again for your constructive and insightful peer review.

mjlavin80 commented 2 years ago

@drjwbaker I am happy with the proposed split and the nature of the split as you described above. Thank you for offering to work on first round of edits required for the split! For anything that needs to be written from scratch as opposed to edited or moved around, I would be happy to draft those, that way I can honestly say I wrote the whole thing. How would you feel about doing the other steps first and then pointing me to paragraph locations where text is needed?

drjwbaker commented 2 years ago

@mjlavin80 Thank you! And apologies again. What I'll do is make this split, then point out the areas I think need the work.

anisa-hawes commented 2 years ago

Hello @drjwbaker. I'm really happy to help you with this. Please me know when the two lessons are (roughly) ready. I can set aside time for copyediting, keeping in view the necessary cross-references.

Happy to meet you, @mjlavin80, and looking forward to supporting the publication of these lessons!

drjwbaker commented 2 years ago

Right, I've made the split so we have https://github.com/programminghistorian/ph-submissions/blob/gh-pages/en/drafts/originals/linear-regression.md and https://github.com/programminghistorian/ph-submissions/blob/gh-pages/en/drafts/originals/logistic-regression.md.

A run down of the main things I've done:

Created two new lessons with new titles and slugs (note that I've left the original in place for now just in case!)
Created /images/ folders for each new lesson
Created an /assets/ folder for linear-regression (logistic-regression doesn't have the dataset section so doesn't need it)
Adjusted the abstract for both to say they are one of two lessons.
Adjusted the intro to both to say they are one of two lessons.
Added series logic into the metadata for both lessons.
Divided up the endnotes. As I've left the intros to both largely the same, those initial endnotes remain.
Added a short note at the end of linear-regression pointing readers to logistic-regression.

@mjlavin80 Are you happy to fettle from there? (as there may be the odd line in the lessons that cross-refer that you know better than I where to find!) If so, send me the commit numbers on any changes you make so I can see the diffs. We can then move to publishing.

@anisa: if I could lean on you for a little help please? I'm not sure if all the crosses at https://github.com/programminghistorian/ph-submissions/commits/gh-pages are a problem. The split lessons seems to have staged correctly at http://programminghistorian.github.io/ph-submissions/en/drafts/originals/linear-regression and https://programminghistorian.github.io/ph-submissions/en/drafts/originals/logistic-regression so I think we are okay?

rcmapp commented 2 years ago

Thank you very much for the revision! Just a quick question @drjwbaker : Are we supposed to go through the changes and (potentially) add further comments? Just an "economical" question because if you've decided that the article is fine as is, I would rather wait and read the final article(s). :-)

Thanks, @thomjur for asking this! (I was just about to dive in, but now I can eat lunch.) I am excited to read the final article, and also to refer our graduate students in the DH Certificate program to it.

Thanks @drjwbaker for managing this peer review process; I have enjoyed the process and would look forward to reviewing again. Congratulations, @mjlavin80 ! I think people will really appreciate and use your work.

drjwbaker commented 2 years ago

Hello @drjwbaker. I'm really happy to help you with this. Please me know when the two lessons are (roughly) ready. I can set aside time for copyediting, keeping in view the necessary cross-references.

Happy to meet you, @mjlavin80, and looking forward to supporting the publication of these lessons!

Our messages crossed. Thanks Anisa!

anisa-hawes commented 2 years ago

Thank you, @drjwbaker. The build failures appear to represent places where the deployment was interrupted/or did not complete. Clicking into the crosses, several report: Error: The operation was canceled. I think perhaps these are cases where you implemented a further change before the previous build check was complete. But the most recent builds are successful and are fully ticked, so all is well.

I think we just need to update the link in /en/drafts/originals/linear-regression so that it points to the new assets file at https://github.com/programminghistorian/ph-submissions/blob/gh-pages/assets/linear-regression/lesson-files.zip. I can do that.

drjwbaker commented 2 years ago

Right. That makes sense.

mjlavin80 commented 2 years ago

Right, I've made the split so we have https://github.com/programminghistorian/ph-submissions/blob/gh-pages/en/drafts/originals/linear-regression.md and https://github.com/programminghistorian/ph-submissions/blob/gh-pages/en/drafts/originals/logistic-regression.md.

A run down of the main things I've done:

Created two new lessons with new titles and slugs (note that I've left the original in place for now just in case!)

Created /images/ folders for each new lesson

Created an /assets/ folder for linear-regression (logistic-regression doesn't have the dataset section so doesn't need it)

Adjusted the abstract for both to say they are one of two lessons.

Adjusted the intro to both to say they are one of two lessons.

Added series logic into the metadata for both lessons.

Divided up the endnotes. As I've left the intros to both largely the same, those initial endnotes remain.

Added a short note at the end of linear-regression pointing readers to logistic-regression.

@mjlavin80 Are you happy to fettle from there? (as there may be the odd line in the lessons that cross-refer that you know better than I where to find!) If so, send me the commit numbers on any changes you make so I can see the diffs. We can then move to publishing.

@Anisa: if I could lean on you for a little help please? I'm not sure if all the crosses at https://github.com/programminghistorian/ph-submissions/commits/gh-pages are a problem. The split lessons seems to have staged correctly at http://programminghistorian.github.io/ph-submissions/en/drafts/originals/linear-regression and https://programminghistorian.github.io/ph-submissions/en/drafts/originals/logistic-regression so I think we are okay?

@drjwbaker yes, I'm happy to work from these versions! Thanks again!

drjwbaker commented 2 years ago

Fab. Thanks for your persistence @mjlavin80. When you are happy with the drafts, I'l work with @anisa-hawes to proceed towards publication (hopefully with no more editor induced hitches!)

drjwbaker commented 2 years ago

@mjlavin80 Just checking in on this. Would it be useful to set a deadline here, or do you have one already in mind?

mjlavin80 commented 2 years ago

@drjwbaker I was actually just working on it. I looked over everything and just pushed new versions, but all of my changes were relatively small. I expanded the note in Part 2 that directs readers to go to Part 2 for the prep and data sections, but the stitch work you added to the conclusion pointing backwards to part 1 looks good to me. I think that's everything I was supposed to look at, but do let me know if I've missed anything!

drjwbaker commented 2 years ago

Thanks so much @mjlavin80. I'll aim to check before the weekend so we can get this moving towards publication before my annual leave (w/c 18 April)

drjwbaker commented 2 years ago

@svmelton I'm delighted to report that these articles are ready to go. Many thanks for @mjlavin80 for their hard work here.

I say "articles" because we made a split based on word length. So we have two articles: Linear Regression analysis with scikit-learn and Logistic Regression analysis with scikit-learn which are encoded as a sequence, and both of which are - IMO - rich additions to our English language journal.

Per https://programminghistorian.org/en/editor-guidelines#5-inform-the-managing-editor-of-your-recommendation-to-publish I provide below all the info you need as Managing Editing to check these before publications. If there is anything I've missed (because as you know I haven't done this particular bit of PH labour for a while!) do let me know.

Linear Regression analysis with scikit-learn

https://github.com/programminghistorian/ph-submissions/blob/gh-pages/en/drafts/originals/linear-regression.md
https://github.com/programminghistorian/ph-submissions/tree/gh-pages/images/linear-regression
https://github.com/programminghistorian/ph-submissions/blob/gh-pages/gallery/linear-regression.png
no bio required as @mjlavin80 has published with us before

Logistic Regression analysis with scikit-learn

https://github.com/programminghistorian/ph-submissions/blob/gh-pages/en/drafts/originals/logistic-regression.md
https://github.com/programminghistorian/ph-submissions/tree/gh-pages/images/logistic-regression
https://github.com/programminghistorian/ph-submissions/blob/gh-pages/gallery/logistic-regression.png
no bio required as @mjlavin80 has published with us before

anisa-hawes commented 2 years ago

Hello @drjwbaker. Shall I move the file linear-and-logistic-regression.md pre-spilt to our new inactive folder inside /en/drafts/originals/?

I realise that it's useful for the project's memory to keep that as a record, but I'd like to try and avoid any confusion about which files are which within the repo.

drjwbaker commented 2 years ago

Yes please. Good idea.

mjlavin80 commented 2 years ago

@drjwbaker @svmelton @anisa-hawes It looks like my previous bio statement is out of date. It lists my affiliation as the University of Pittsburgh, which is not true anymore. Here is a new version: "Matthew J. Lavin is an Assistant Professor of Data Analytics specializing in Humanities Analytics at Denison University. His scholarship focuses on book history, cultural analytics, and turn-of-the-twentieth-century U.S. literature and culture." Also, my current Twitter handle is @HumanitiesData if you are still keeping track of those.

anisa-hawes commented 2 years ago

Thank you for letting us know, @mjlavin80. Here's Matthew's updated bio including orcID, @svmelton:

  name: Matthew J. Lavin
  team: false
  orcid: 0000-0003-3867-9138
  bio:
      en: |
          Matthew J. Lavin is an Assistant Professor of Data Analytics specializing in Humanities Analytics at Denison University. His scholarship focuses on book history, cultural analytics, and turn-of-the-twentieth-century U.S. literature and culture.

mjlavin80 commented 2 years ago

@anisa-hawes I do. It's https://orcid.org/0000-0003-3867-9138

svmelton commented 2 years ago

Thanks, all! I believe the only thing that's left before the final publication steps is for @anisa-hawes to complete copyedits. Just a heads up that she'll be on leave next week but should be able to complete them once she's back.

drjwbaker commented 2 years ago

Okay! Did I miss a step at https://programminghistorian.org/en/editor-guidelines or has copyediting not been built into that workflow yet?

programminghistorian / ph-submissions