Review Ticket: Introduction to Stylometry w Python

acrymble commented 6 years ago

The Programming Historian has received the following tutorial on 'Introduction to stylometry with Python' by @fdlaramee . This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/introduction-to-stylometry-with-python

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I will first read through the lesson and provide feedback, to which the author will respond.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @ianmilligan1 or @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudspeople (Ian Milligan and Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

acrymble commented 6 years ago

Thanks for this submission @fdlaramee. It's very well written and I think will make an incredibly valuable contribution to our tutorials.

I have one major comment before I send this for review:

Everything from 'Acquiring the Federalist Papers' to 'Preparing the Data for Analysis' is actually downloading the files. Since readers are coming here to learn how to do stylometric analysis rather than learning how to download files from Project Gutenberg using Python, can we just give them the 85 files as a zip (more sustainable and in line with our data archiving policies) and get to the Stylometry?

This will also mean the lesson is shorter and less imposing (though I would say I'd classify this as 'advanced' at the moment - which is fine).

fdlaramee commented 6 years ago

Hi @acrymble ; sorry for not responding earlier, I just returned from the Bahamas where Internet connectivity was spotty at best.

I have to double-check the Gutenberg terms of service to see whether we can distribute the files stripped of their legal boilerplate (or maybe just put one copy of it in a separate file) but otherwise that is fine by me. I should be able to adjust the lesson's text accordingly in a day or two.

fdlaramee commented 6 years ago

Hi @acrymble,

I am in the process of revising the lesson text to remove the file manipulation sections. I will assemble a zip archive of the 85 files (plus the original Gutenberg ebook) and link to it from a new subsection; where should I store this archive?

Thanks, FDL

acrymble commented 6 years ago

ph-submissions/assets/introduction-to-stylometry-with-python/

I've created a new folder for you.

fdlaramee commented 6 years ago

Thanks. Everything should be good to go whenever you are ready.

acrymble commented 6 years ago

Thank you @fdlaramee I will look for reviewers.

acrymble commented 6 years ago

@fdlaramee just by way of an update, I have two people who have agreed to review your lesson. I expect those to appear here by or before 7 March 2018.

fdlaramee commented 6 years ago

Duly noted. Thanks @acrymble.

fbkarsdorp commented 6 years ago

Hi @acrymble and @fdlaramee,

It's my pleasure to review your lesson on stylometry with Python. I've enjoyed reading the tutorial very much, and I believe it is well on its way to becoming an excellent primer into computational stylometry with Python. Overall, I like the casual tone and style of the tutorial, which fits the other lessons of the programming Historian well. The lesson is well-structured, providing sufficient background information, pointers for further reading, and, importantly, abundant textual cues to help guiding the reader through the materials. In what follows, I provide some comments and suggestions regarding the use case and embedding in the literature, the employed methods, implementations, coding style, and some minor details.

Use Case and Scholarly Embedding

Of course, the use case of the Federalist Papers is well-known and perhaps the go-to case for illustrating stylometry and computational stylometry. This can be considered both a strength and a weakness of the tutorial. A weakness, because it might bore some of the readers, precisely because it is so well-known. This could be turned into a virtue, however -- and I believe you're already doing that, @fdlaramee -- since it allows readers to better focus on the more technical sides of the case. Yet, it would interesting for those unfamiliar to stylometry to provide or describe some other use cases as well. There are so many other intriguing case studies described in the literature. It would be a shame not to mention some of them. One suggestion is to mention some other work at the very end of the tutorial. In its current form, the tutorial ends rather abruptly, though I like the further reading section, which already addresses some other cases, of course. My guess is that readers might want the concluding section and further reading section to have a little more meat.

While the tutorial cites many important papers in the field, I believe that some crucial work is missing. Juola (2006) for example, might be added, as well as Koppel at al. (2009) and Stamatatos (2009), because of the focus on the importance of function words in stylometry. Also, I believe Stover et al. (2016) are not the first one to describe the imposter method and the work by researchers such as Moshe Koppel is more appropriate to cite here. See e.g. Koppel & Winter (2014) 'Determining if two documents are written by the same author'.

Methodology

I focus on the Burrow's Delta section, since I believe this is the most important methodological section of the tutorial. You've done a great job describing the details of Delta, without dumbing it down too much. It could be improved, however, by adding a bit more about the 'why' besides the 'how'. For example, there's a great body of literature on why the n most frequent words are so powerful for the task of authorship attribution. However, the tutorial merely states that we should use the n most frequent words, without further information. Additionally, I think the readers might want to know a bit more about the parameters used. What n is good, for example? In which contexts? And why? It would be interesting if you could show the users the effect of choosing a particular value. What happens if you only use the 10 most common words? What happens if you use the complete vocabulary.

Also, it is not clear to me why you would use part of speech tagging in this context, especially when the vocabulary is limited to the 30 most common words. These are probably all function words which are unambiguous with regard to their part of speech class. If you decide to keep the part of speech tagging, you might want to consider comparing the results against those obtained without tagging the corpus. Since part of speech tagging is rather uncommon in computational stylometry (to my knowledge), and it limits the applicability of the methods presented to other languages and domains, I would recommend leaving it out.

Implementational Details

In this section, I make some suggestions on how to improve some of the coding blocks and implementations. The function read_files_into_string in paragraph 22 is a bit verbose and might be simplified. Also, the concatenation of the files merges the last word of one paper with the first one of the next, which pollutes our data. Here's a suggestion to improve it using a safer with-statement and str.join to concatenate the files:

def read_files_into_string(filenames):
    strings = []
    for filename in filenames:
        with open(f'data/federalist_{filename}.txt') as f:
            strings.append(f.read())
    return '\n'.join(strings)

The repetition of the different paper collection names is a bit confusing and error-prone. What if you change the initial declaration of the papers lists to a dictionary structure in which the collection names are used as keys and the lists as values:

papers = {
    'Madison': [10, 14, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48],
    'Hamilton': [1, 6, 7, 8, 9, 11, 12, 13, 15, 16, 17, 21, 22, 23, 24, 
                 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 59, 60,
                 61, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 
                 78, 79, 80, 81, 82, 83, 84, 85],
    'Jay': [2, 3, 4, 5],
    'Shared': [18, 19, 20],
    'Disputed': [49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 62, 63],
    'TestCase': [64]
}

You could then simply iterate over the keys of the dictionary to read the files of its corresponding values. Something like:

federalist_by_author = {}  # don't use dict()
for author, files in papers.items():
    federalist_by_author[author] = read_files_into_string(files)

Similarly, you an employ the keys of the dictionary to print the first n characters of each file:

for author in papers:
    print(federalist_by_author[author][:100])

And the same looping style can be used in other code blocks further down the tutorial.

In the code block belonging to paragraph 28 you lower case of word tokens, which is explained further down. I completely understand the rationale, but since you're only counting word lengths at this stage it might confuse the reader. I would recommend moving the lowercase to the section on Kilgariff's Chi-Squared.

The implementations of Chi Squared as well as some of the details in calculating Delta are a little verbose, which might not always make them easier to comprehend. Perhaps you might want to consider employing NumPy or SciPy for some of these computations.

Coding style

[ ] While not forbidden by PEP 8, I believe it's generally accepted that CamelCase is reserved for class names alone. Normal variables and function names should be lower_case_with_underscores.
[ ] I would recommend formatting all code blocks using some kind of linter, which adheres to, for example, the PEP 8 style guide for Python code (see https://www.python.org/dev/peps/pep-0008/). In it's current form, many code blocks are a bit unpythonic (e.g. the use of surrounding spaces for function arguments and indexes, the use of backslashes, and so on and so forth). For automatic formatting, I can recommend using yapf, see https://github.com/google/yapf.

Minor details

[ ] In paragraph 6, you might want to include a one-liner, with which readers can install all required packages before they start, i.e. pip install matplotlib nltk.
[ ] NLTK's tokenizer and part of speech tagger require downloading some data packages. Perhaps you can include something like python -m nltk.downloader punkt averaged_perceptron_tagger to help the reader installing all further requirements.
[ ] Stefan Evart -> Stefan Evert in paragraph 47.

acrymble commented 6 years ago

Thanks so much @fbkarsdorp for this detailed review. We're still awaiting one more, so @fdlaramee please hold off on any revision at this stage.

arojascastro commented 6 years ago

Hello all,

Since "Members of the wider community are also invited to offer constructive feedback", I decided to read this tutorial and share some observations -- not necessarily changes. I hope to be constructive enough. This is an unsollicited revision by the way. The editor did not ask me to contribute.

First of all, let me make clear that my revision focuses on the content from a semantic, pragmatic or political point of view. I did not test the tutorial, but over all I think it is a good tutorial, it is well written and it is not too long.

As Spanish literary historian, my point of view will be shaped by my Hispanic background. On the other hand, as part of the PH board, I would like to highlight some issues that are related to diversity and language independence. As you may know, it is very likely that this tutorial will be translated into Spanish. In consequence, in order to make the process easier, we should take into account a few key aspects. My intention is to point out some issues that can be improved if we consider that the target audience may be non-American.

In summary, on one hand, the tutorial has an American outlook -- that is a vision of history, research, politics, etc. within USA boundaries. Nothing wrong with that, but our audience is international. To be fair the expression should be national outlook according to Ulrich Beck, whose main thesis is that historians should adopt a cosmopolitan outlook in contrast to the national outlook, but here the national is the American. On the other hand, the tutorial has a Male gaze that could be smooth with some little changes in order to make it more diverse and welcoming for women and other non male researchers. This Male Gaze does not treat women as a sexual objects like in the movies (I am borrowing the expression from Laura Mulvey) but simply ignore or make them invisible.

The American outlook

The author is using Project Gutenberg as a source. That is good because it is a multilingual platform. However, I found this tutorial very very American. First of all, the I did not know about the Federalist Papers at all. It is true that I am a Spanish literary historian (XVII Century), but for me this issue sounds very local.

Where can we see these local traces?

The American outlook is visible in paragraph # 1 when the author mentions Hemingway. We love Hemingway, and it is a clear example, but considering the main resource used to illustrate the methodology is very American, why not try to give a non American / English example in the introduction?
The American outlook is also visible in paragraph # 9: "Alexander Hamilton, first Secretary of the Treasury of the United States (and unlikely 21st-century cultural phenomenon!)". I do not understand what the author mean by unlikely 21st-century cultural phenomenon - I am missing something, I am sorry!
The American outlook is also visible in paragraph # 11. I did not understand the problematic at all. There are many references that I do not know and some context is missing.
The American outlook is also visible in paragraph # 15: "Since then, the authorship of the Federalist has remained a common test case for machine learning algorithms." Well, we need to put this into context, it may be a common test in America or for historians focused in XVIII century national American history. I never heard of it before reading this tutorial.
I like the notice in yellow after paragraph # 51 about NLT and language independece. However, can the author provide more details considering that this tutorial will be translated into Spanish? Maybe some examples, links, bibliographic references? Did the author test his method with texts in other language than English? Should the code be modified in order to deal with diacritics, tildes, non latin characters, etc.? How accurate are the results?
Finally, all bibliographic references are in English and, presumably, are American or English authors. There is a lot of work on stylometrics in Spanish and French literature, and I am sure other languages. For instance, an article about delta in Spanish: http://revistacaracteres.net/revista/vol5n1mayo2016/entendiendo-delta/ or this one http://parnaseo.uv.es/Lemir/Revista/Revista20/09_Rosa_Javier_de_la.pdf

To sum up, the author should not assume the American is universal, that everybody must know the referents, problems, etc. It would be good if the method was tested with non English texts to find out if there is any problem with tildes and other characters. It is recommended to add bibliography in other languajes than English and mention a few relevant projects that are not focused on American history.

The Male Gaze

At the PH we are doing a great effort to be diverse in terms of board members. However, I think we need to improve our content from a gender point of view. In this tutorial we can find the male gaze in the following places:

The Male Gaze is found in paragraph # 1. Again, Hemingway... why not to mention a female writer - e.g. Virginia Woolf? Or better, say that stylometric methods have been employed to compare and analize the style of women and men's texts. There are a lot of research on this like http://culturalanalytics.org/2018/02/the-transformation-of-gender-in-english-language-fiction or this https://academic.oup.com/dsh/article-abstract/31/4/746/2748261?redirectedFrom=fulltext
The Male gaze is also visible in paragraph # 63. The tutorial only contains one reference to a woman in the Acknowledgements: Susan Dalton. Well, maybe there are more, but this was not evident to me.

To sum up, it would be good to add some bibliographic references to female authors and use a female writer to illustrate the concept of style.

Other things

The author may want to provide this link to Zotero stylometry group: https://www.zotero.org/groups/643516/stylometry_bibliography/items?
The author may want to say that Burrow's Delta is also available in GUI as part of Stylo package. It is true that it is implemented in R, but it is very easy to use.

acrymble commented 6 years ago

Thanks so much @arojascastro for these coments on language and cultural independence. These will be important issues to address, especially because we'll ultimately want to translate this lesson into Spanish and other languages.

I'm still waiting a couple of more days for our last review. I'll summarise everything here early next week so the author can move forward and we can get to publication.

acrymble commented 6 years ago

The other review by @jkrybicki has been posted into a new ticket: #154. For the sake of simplicity I'll move it here:

In general, the lesson makes a lot of sense - especially if it is supposed to be all-Python. Nothing wrong with that, especially if it is aimed at PROGRAMMING historians rather than those who just need a ready-made tool for analysis without learning to code - in which latter case I would strongly recommend stylo, a package for R (naturally, since I am one of its authors). It is a good if perhaps somewhat ambitious idea to start off with the Federalist Papers. I am much less happy about Mendenhall, whose method is highly dubious from both linguistic and statistical points of view. Perhaps simple t-testing of most-frequent-word usage would work better? Or - even better - add Burrows's Zeta instead (perhaps as third method after his Delta). BTW, it's Burrows, nor Burrow (he's not some kind of Hobbit, just an Aussie), hence Burrows's, not Burrow's! :-) In the introduction to the lesson, I would also suggest a stronger emphasis on the idiosyncratic use of most frequent words (rather than a more general "vocabulary"); and perhaps also a comment on how those frequent words yield the signals of genre and chronology as well as authorship; for the same reason, the link between styometry for authorship and stylometry for distant reading should be mentioned. Overall, though: good stuff!

Thanks so much to our three reviewers. I've learned a lot about stylometry in the process. I'll just digest these three sets of recommendations and try to synthesise a clear path forward for our author. I'll try to do this asap.

acrymble commented 6 years ago

Thank you to everyone for their reviews. I’ve read through them all, as well as the paper. I think the reviews are fairly straight forward. I recommend @fdlaramee that you address each one in a separate comment. You might find it easier to create a tick-box list to work through:

[ ] first thing
[ ] second thing

I would suggest you address @arojascastro first, since those suggestions are about making sure the examples are easy to translate and appeal to an international audience. I agree with him and @fbkarsdorp that it would be really fantastic to offer a second (even hypothetical) example at the end that might give an international or female perspective. Given the fact that you're using Hemingway for a very specific purpose, I am not sure it makes sense to drop that reference. But looking for ways to increase references to international work should not be difficult.

In terms of the more linguistic suggestions, there are a number of suggestions from both @fbkarsdorp and @jkrybicki for adding intellectual context for the reader so they understand the various tests and their limits better. Addressing the query about the use of Mendenhall should be a priority.

The suggested changes to code by @fbkarsdorp will have to be tested and should only be implemented where you're comfortable. You won't be able to change the way our code blocks display - that would be up to our technical team to implement.

Otherwise, I will leave it to you to respond to all three reviewers. And I look forward to seeing this published soon. So that we can keep the momentum going, I'd ask that you submit your revisions by 13 April 2018. If that timeline won't suit, please let me know.

fdlaramee commented 6 years ago

Thank you all for the work you have put into this. I will be reviewing the comments and responding as soon as I can, hopefully by Thursday night. I will also be evaluating the amount of time required to implement the various suggestions in order to see whether April 13 can work; I will be at ESSHC for a week (without a computer) and giving a day-long series of talks on April 11, so it might be a bit tight. I'll let you know soon. Thanks again.

fdlaramee commented 6 years ago

Hi everyone,

This is a response to @arojascastro's review; the others will follow shortly.

First of all, thank you for bringing these issues to the forefront. I must admit that I was somewhat bemused by the "American outlook" comment at first, since I am French Canadian, have never lived in the United States, and only learned English by watching TV as a teenager, but upon reflection this is probably a function of U.S. cultural dominance in this part of the world!

Regarding the use of Hemingway as an example: I am not a literary scholar by any stretch of the imagination and only used him because I happened to know that he was famous for relatively low lexical richness. Is it also the case for Virginia Woolf? Are you aware of a famous Spanish, French, or maybe Russian-language author who is significantly below average in this metric?

Regarding Hamilton as a 21st-century cultural phenomenon: This is a reference to the immensely successful hip hop musical written by Lin-Manuel Miranda, which garnered just about every award imaginable and is sold out for months if not years in advance. A North American cultural artifact, to be sure, but one with interesting diversity implications since the entire (original) cast is made up of African American and Latino actors. A link to the musical's Wikipedia page might help, or would you prefer that I cut the reference entirely?

Regarding language independence in the code: I will add a few lines about this. As a rule, since NLTK works with arbitrary strings, the only restriction regarding the tokenizer and other basic functions is that languages in which there is no physical separation between words (such as written Chinese, I believe) may be problematic. I have used NLTK's tokenizer with French texts and even to process lists of numbers in the past without any trouble; at most, it may be necessary to specify the character encoding (UTF-8, etc.) but that is mostly an issue related to the files being processed rather than the software itself. Part-of-speech tagging, however, is language-dependent, as I mention somewhere in the text, and some language-dependent data models may be of poor quality or absent.

Regarding paragraph #11: I do not understand your comment, I am sorry. What is it that troubles you?

Regarding mentioning references in other languages and projects not about American history: Will do!

Regarding gender parity, the Zotero group and the Stylo package: Thanks for the references you provided.; I will include them in the text.

Regarding the Federalist as a common test case for machine learning: I have seen several recent papers that use the Federalist as a benchmark, comparing a new ML algorithm's classification performance with the state of the art using the Federalist. It is basically the default test case for text-based work performed in English. Are you aware of a similarly famous example in another language, to which I could link?

Thanks again!

Best regards, FDL

fdlaramee commented 6 years ago

Hi everyone,

This is a second response, this time to @fbkarsdorp's review.

First of all, thanks for the thorough review. Much appreciated!

Regarding adding other use cases to the Further Reading section: With pleasure! I was afraid that I was already running long (and pushing it with a rather lengthy reference section) but I will gladly speak of a few other cases. Do you have any favourites in mind?

Regarding additional references to Koppel, etc.: Thanks for providing these. I was already aware of some of them, but not all. Will add them.

Regarding the choice of the N parameter to select N common words in Delta: I will look into the literature, with which I am not familiar. As far as I knew, 30 was reasonable, and a larger number would be appropriate for larger corpora, but that's about it. Any suggestions?

Regarding POS-tagging: This is because I am a native French speaker, and French has several very common small words that can belong to more than one part of speech. For example, "la" and "le" can both translate to "the" (article) and "it" (pronoun), "la" being the feminine form and "le" the masculine. I suggest adding a sentence to say that POS-tagging may not be necessary in all languages and/or for all values of N, but leaving the code as is for demonstration purposes. Does that work for you?

Regarding your code snippets: That works for me; I'll try them out.

Regarding "unpythonic" coding styles: Ah, this old bugbear... Having been 'raised' on C, Pascal and Smalltalk in the late 1980s (yes, I am that old) I am not a fan of using the lowercase/underscore notation for both functions and variables in a language in which one can pass a function name as an argument; too confusing for my taste. I also have a lot of difficulty reading code that has no spacing between brackets and values, etc., but that's probably a consequence of my visual impairment. I don't feel strongly about any of it; if you or @acrymble do, I will make the changes.

Regarding NumPy or SciPy: I was wary of introducing yet more packages into what already seemed like a rather involved tutorial. I'd rather not mess with this.

Regarding pip installs and NLTK data packages: Will do! (Never noticed that the tokenizer needed anything!)

Thanks again and best regards, FDL

fdlaramee commented 6 years ago

Hi all,

This is the third and final response, this time to @jkrybicki's review. (The reviewer's handle does not seem to be active in this thread; @acrymble, can you make sure that the response reaches its destination? Thanks!)

First, thanks for the review. Much appreciated!

Regarding the Stylo R package: Yes, I will definitely add a link. Thanks!

Regarding a stronger emphasis on the idiosyncratic nature of common word usage: Yes, excellent point. I will add the appropriate language.

Regarding "the link between stylometry for authorship and stylometry for distant reading": Can you elaborate? I'm not sure that I follow you, here.

Regarding Mendenhall: I agree that his method is a very blunt instrument that should not be taken too seriously. However, besides its historical value as (AFAIK) the first attempt at stylometry, I chose it as a pedagogical device. Mendenhall's technique is conceptually easy to grasp for someone with little or no background on the subject, it can be programmed with a handful of lines of code, and it yields graphical results that are easy (and kind of fun) to read. Thus, it provides an easily achievable milestone for the reader, as an entry point into a difficult topic. For these reasons, and because the other methods are rather math-intensive and dry-looking, I would like to keep it in the tutorial -- with perhaps some words of caution to the reader, that they should view it as "training wheels" rather than as a serious source of evidence?

If Mendenhall absolutely needs to go, I am not sure which of the two alternate methods you recommend I should choose. On one hand, a t-test would be conceptually quite similar to the chi-squared method and may not add a great deal of value. On the other hand, Zeta (which I do not know at all) would be likely to increase the tutorial's already uncomfortably high level of complexity. Any advice from the group?

Thanks again, and best regards, FDL

fdlaramee commented 6 years ago

Hi @acrymble,

After analyzing the reviews in detail, I believe that an April 13 delivery for the revised tutorial will be comfortably achievable.

The only potential caveat: if Mendenhall's method needs to be replaced by Zeta, of whose existence I was unaware until I read @jkrybicki's review, the amount of literature research, programming and explanatory writing required is unknown. If Zeta is at all similar to Delta, the process should not take more than two or three days, but finding those two or three days between now and April 13 could be difficult. I will reassess the situation once I have heard back from the reviewer and/or from you on this matter.

Best, FDL

arojascastro commented 6 years ago

Thanks for your response.

Regarding the use of Hemingway: I would say Woolf's style is quite the opposite. I only know one similar case: Eliot according to Moretti et al. You can check Pamhplet 11 (section 5) where they argue that in one page of Adam Bede the vocabulary is very repetitive (George Eliot was a woman). It is not exactly the same example and it might be more obscure. So you can keep the use of Hemingway and add Eliot's example. Otherwise, you can simply say that stylometry has been used to compare women and men's texts in a lot of research. For instance: The limits of distinctive words: Re-evaluating literature’s gender marker debate

Regarding Hamilton: I did not know him and I still think it's a very obscure reference. A link to Wikipedia may help, but does it add any value to your tutorial or it is just a joke?

Regarding language independence: I appreciate your clarification. I suggest to take it into account in your tutorial and expand it a little bit because that information will help a lot Spanish readers once we publish the translation.

Regarding paragraphs 11 and 12: the problem is that I have never heard of these papers. I just read the Wikipedia page in Spanish and there is nothing about this authorial problem that you are posing, so I do not think a link will help when we translate the lesson into Spanish. I think some general context would help the reader who does not know anything about, but if you cannot provide I guess the Spanish editor may be able to add some notes.

Regarding the Federalist as a common test: again, I have never heard of it; I work on stylometry applied to Spanish language. Maybe you just need to add in English and then the sentence would be more accurate.

Hope this helps. Thanks.

acrymble commented 6 years ago

Thanks @arojascastro and @fdlaramee. If I can just jump in on two of the points:

I understand the purpose of your Hemingway reference. So I would say just leave that one alone because it is a specific example in support of a point.

I do agree with Antonio that the joke about Hamilton is culturally specific (as a fellow-Canadian, I have no idea what you're talking about), and I don't think it will age well, though we hope this tutorial will remain relevant for many years.

fdlaramee commented 6 years ago

Hi @acrymble

The revised lesson and revised figures are in the repository. I have performed all of the changes that I promised to the reviewers (including many new references, links to additional case studies, changes to the code style, etc.) and a few more (I ended up removing part-of-speech tagging from the code as recommended by F. Karsdorp). Having received no objections to my proposals from the reviewers, I believe that this will be satisfactory. I hope so, anyway.

FYI, I will be away from the computer between March 31 and April 8 inclusively, as I will be travelling to ESSHC and taking a few days off in London before the conference.

Thanks again for everything; looking forward to the next steps in the process.

Best, FDL

acrymble commented 6 years ago

Thanks @fdlaramee, were there any areas you chose not to address, so I can assess the changes?

fbkarsdorp commented 6 years ago

Hi @fdlaramee, I'm looking forward to read the new version of the tutorial!

fdlaramee commented 6 years ago

@acrymble : I did not replace Mendenhall's characteristic curves with another method, for the reasons I outlined in my reply to the reviewer above: pedagogical and historical interest, easy milestone yielding interesting graphical results, etc. I did mention several other methods (including Zeta) in the Further Reading section, as I briefly introduced a selection of other case studies from the literature. I also made the fact that Mendenhall's method is not very reliable more explicit in the text.

I believe that's about it.

Thanks, FDL

fdlaramee commented 6 years ago

@fbkarsdorp Thanks, I hope you'll like it!

acrymble commented 6 years ago

I'm just checking in on a few things, but I'll be in touch in a few days (hopefully).

acrymble commented 6 years ago

Just to keep you informed @fdlaramee I have asked for a comment from one of our reviewers, which we should have in a few days (I hope!)

fbkarsdorp commented 6 years ago

Hi @acrymble and @fdlaramee,

Sorry for the delay. I reread the tuturial and it has been improved quite a lot. Most of the concerns raised by the reviewers have been addressed adequately, and the tutorial has grown into a more comprehensive intro into computational stylometry with Python.

I have a couple of final, minor comments that you could take into account when finalizing the manuscript:

The R package 'Stylo' is mentioned at the very end of the tutorial as a possible alternative for those with passing familiarity with R. Since 'Stylo' is the de facto software package for doing computational stylometric analyses, I think you should give the authors a little more credit. It's probably fair to say that given the GUI provided by the package in combination with the extensive tutorials, users aren't required to have any knowledge of R.
Since you dropped the part-of-speech tagging section, it is no longer required to download the NLTK's average perceptron models in paragraph 10.
The code blocks have improved quite a lot, and they look much more Pythonic. Some remaining PEP8 violations are: a. Lines exceeding the 80 or 90 character limit; b. Lack of whitespace in lists: authors = ["Hamilton","Madison","Disputed","Jay","Shared"] should be authors = ["Hamilton", "Madison", "Disputed", "Jay", "Shared"]. (Btw, I'd prefer to use tuples in these contexts, since the lists won't be updated) c. Remaining whitespace before parentheses of function calls (e.g. any in [token for token in tokens if any (c.isalpha() for c in token)]) d. Since the new jupyter lab has been released recently and it will be the future standard, you should mention how to produce inline plots with the lab as well. Instead of %maplotlib inline or %matplotlib notebook, you should use %matplotlib ipympl.
The use of f-strings requires users to install Python 3.6+. That's a good choice, I'd say, but you should mention that in paragraph 7.

Other than that, great job!

acrymble commented 6 years ago

Thanks @fbkarsdorp for taking another look.

@fdlaramee I think those last minor changes are clear enough, but please let me know if you have any questions. To that, I'd add a request that you go through the piece and make sure that the first instance of any technical term (anything you wouldn't expect an everyday-working-historian to understand inherently) is linked to Wikipedia.

After those last steps are complete, I can move into a copy-edit and put this into the publication queue.

I know you said you were away, so can you give me a timeline so we can get this out as soon as possible?

fdlaramee commented 6 years ago

Thanks @fbkarsdorp.

@acrymble : I am back from Northern Ireland so I should be able to do this within a couple of days. If there are any specific terms that you would want me to link to wikipedia, please let me know, just to make sure that my assessment of the readers' prior knowledge is accurate.

acrymble commented 6 years ago

For example:

stylometry
authorship attribution
library (code)
diacritics
tokenize
corpus

Things like that. Words you wouldn't expect your typical undergraduate history student to know off the top of their head.

fdlaramee commented 6 years ago

Got it. Expect the pre-copyedit version sometime between Friday and Sunday.

acrymble commented 6 years ago

Ok thanks. You're under no obligation to do this over the weekend though. I'll look forward to the next step!

fdlaramee commented 6 years ago

Hi @acrymble. The work is complete.

Thanks, FDL

acrymble commented 6 years ago

Thank you @fdlaramee I will take a look and if everything is ok, we can move to the final steps.

acrymble commented 6 years ago

Possible images for the icon:

https://www.flickr.com/photos/britishlibrary/11171695653/ https://www.flickr.com/photos/britishlibrary/11305956214/ https://www.flickr.com/photos/britishlibrary/11305898666/ https://www.flickr.com/photos/britishlibrary/11305842776/ https://www.flickr.com/photos/britishlibrary/11305680695/

acrymble commented 6 years ago

I've gone through and done copyediting on the lesson. I've adding in some more links, and tried to cut the word count a little (thinking of our translators in future). This is now ready for publication, pending the following tasks:

For the author (@fdlaramee)

[x] Please do a full re-read of the text, making sure you are happy with any changes I made. In particular, I revised lines 138-140 to make them clearer, but I need you to check they are still accurate.
[x] line 251 - you suggest readers are welcome to change the number of features to see if that will change the results. This is problematic, because of course it will change the results, but perhaps not meaningfully. Can you revise this (tersely) to say you should know what you're doing if you're going to make changes to these values?
[x] Line 247 ane 344 - you describe the formulas for your tests 2 & 3 very well, but can we please have the actual mathematical formulas written in mathematica notation? Historians need to get used to seeing this stuff and understanding that's what their code is doing for them.
[x] I need a 1-line bio for you.
[x] if you have a preference on the image icon (see above comment), please express it.

Then it's over to me to move everything to the live site.

fdlaramee commented 6 years ago

Thanks @acrymble. I should be able to go through the text within a day or two.

For the icon: maybe https://www.flickr.com/photos/britishlibrary/11305680695/ ? As long as it isn't the cats, I'll be happy.

For the bio: how about "François Dominic Laramée is a doctoral candidate in history at the Université de Montréal, in Canada. He holds master's degrees in computer science and U.S. history and is a former video game designer, TV personality and screenwriter."

How would you want the formulas to be rendered? Do I produce JPEG files or does the Programming Historian system have the capability to render MathML or something like it? Is "mathematica notation" an actual thing or a typo? (I haven't written formulas on a computer since the early 1990s so I'll have to research a method no matter what.)

-- FDL

acrymble commented 6 years ago

I would say an image is fine for the formula. Assuming this is correct, I believe that's the chi-squared test?: http://www.statisticshowto.com/wp-content/uploads/2013/09/chi-square-formula.jpg

You can upload these as figures. Not that I've converted your figures 6-9 into text, so this would replace those figure numbers.

fdlaramee commented 6 years ago

Sorry @acrymble , I have been caught in a couple of emergencies. I should be able to complete the work tomorrow.

fdlaramee commented 6 years ago

Hi @acrymble,

I am done reviewing the copyediting. I have fixed a few dangling commas and pronouns that referred to clauses that you cut, that sort of thing. Nothing substantial. I did change one line, the one that introduces the chi-squared results; it was fine as a figure subtitle but once it was inserted into the main text it duplicated something said two paragraphs later. I replaced it with something that flows more naturally.

I have inserted the equations you requested, along with short explanations of the symbols found in them.

I also revised line 251 as you requested.

That should be it, I think. When do you expect to publish?

Best, FDL

acrymble commented 6 years ago

List of files to be moved to mail site:

/lessons/introduction-to-stylometry-with-python.md /images/introduction-to-stylometry-with-python/stylometry-python-1.jpg /images/introduction-to-stylometry-with-python/stylometry-python-2.jpg /images/introduction-to-stylometry-with-python/stylometry-python-3.jpg /images/introduction-to-stylometry-with-python/stylometry-python-4.jpg /images/introduction-to-stylometry-with-python/stylometry-python-5.jpg /images/introduction-to-stylometry-with-python/stylometry-python-6.jpg /images/introduction-to-stylometry-with-python/stylometry-python-7.jpg /images/introduction-to-stylometry-with-python/stylometry-python-8.jpg /assets/introduction-to-stylometry-with-python/stylometry-federalist.zip

https://www.flickr.com/photos/britishlibrary/11305680695/ for icon

acrymble commented 6 years ago

Thanks so much @fdlaramee. I've submitted a pull request to get this published. We'll just have another editor check it over to make sure we have all of the files and metadata. I've set the publication date as 21 April 2018.

Thanks for working through this. I think it makes an important contribution. I'll be in touch once we've got it live, so we can promote it.

acrymble commented 6 years ago

I'm pleased to announce that this lesson has now been published and is available on the live Programming Historian site. https://programminghistorian.org/lessons/introduction-to-stylometry-with-python

Thank you to @fdlaramee and to our reviewers @arojascastro @fbkarsdorp and @jkrybicki. This lesson opens up an exciting new area of work for us.

To ensure the lesson reaches as many people as possible, we'd be grateful for your help promoting it. Here are some suggestions for how you can help with that:

* Tweet at least 3 times about the lesson (with a link).
* Retweet our tweets about the lesson (‘liking’ does not help spread the word)
* Promote the lesson in presentations or publications about your research
* Link to it in blog posts when relevant
* Add it to lists of resources in relevant repositories (eg, Wikipedia, community groups, etc).

People don’t find lessons on their own. The hard work is done, so let’s make sure it was worth it!

fdlaramee commented 6 years ago

Excellent, will do! Thanks everyone.

programminghistorian / ph-submissions