Review Ticket for Data Analysis in Python through the HTRC Feature Reader

ianmilligan1 commented 8 years ago

The Programming Historian has received the following tutorial on 'Data Analysis in Python through the HTRC Feature Reader' by @organisciak and @borice. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/text-mining-with-extracted-features

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @acrymble or @amandavisconti if you feel there's a need for an ombudsperson to step in (as I am not going to act as an ombudsperson on a lesson I am actively editing!).

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudspeople (Ian Milligan and Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

amandavisconti commented 8 years ago

I'm not formally reviewing, just trying to work through the lesson for fun. The intro text is great, and I especially like the "possibilities" section with links to the cited work!

Testing on a Mac (El Capitan 10.11.6), I ran into problems with both the graphical and command-line Anaconda installers, poked around on StackExchange for a bit, and haven't yet figured out the issue (it's likely particular to my computer setup, but I'm reporting it in case another Mac user also has trouble; I'll update if I figure out the problem.)

organisciak commented 8 years ago

@amandavisconti, what's the error that you're getting? Is it in installing Anaconda, or the Feature Reader library? We tried many different ways to simplify the installation as much as possible, if there are still glitches, I'd love to address that pre-emptively. Let me know here, or via email (my username at Gmail).

We recently switched over from pip instructions to conda (because the latter favours already built distributions, making things easier for Windows users), so my suspicion is that maybe there's a problem there?

DigitalHumanitiesSlack commented 8 years ago

@organisciak The error is with the Anaconda installer, not the Feature Reader library; the graphical display installer gives an error message that suggests I contact the Anaconda project maintainers, and I'll check on the command-line error message when I'm home. (Will also update when I figure out the issue in case it's useful for others.)

Thanks for writing this lesson—I really like the introduction and am looking forward to reading the rest!

(Edit: This is from @amandavisconti—forgot to log out while doing Slack-GitHub integration stuff.)

ianmilligan1 commented 8 years ago

Thanks @amandavisconti @DigitalHumanitiesSlack for looking into this! It worked on OS X but I don't have access to a Windows box (something I really need to get at some point..).

Anyways, just a quick update @organisciak – I've reached out to two peer reviewers, just waiting on their responses. My experience with academic August is that sometimes we can all be a bit delayed (myself included!), so patience is probably necessary. 😄

ianmilligan1 commented 8 years ago

Update: two reviewers have accepted and agreed to review the lesson.

sgsinclair commented 8 years ago

This is probably discussed somewhere in the PH world, but I find it a bit awkward to reference particular parts of a document, short of annotating a PDF view or referencing a blame view. Advice?

ianmilligan1 commented 8 years ago

I don't think you're alone in that problem, actually, which speaks to the hypothesis comments that @amandavisconti coordinated a few months ago.

The quick and easy solution might be to reference line numbers from the edit view, or you could use hypothesis to mark up the document.

I'll ping @wcaleb just for a future conversation to be had... I wonder if in the future, line numbers or something could appear in the ph-submissions site preview.

sgsinclair commented 8 years ago

I think this will make a fabulous addition to the PH family for several reasons, including how it explains many key concepts of working with Pandas and how it provides useful and generative examples of interacting with HTRC. The structure and pace of the document are generally excellent, though I'd suggest that the authors have another couple of passes reading it as though they were encountering for the first time several of the more technical components. Linking even more to external documentation on Python, libraries and even HTRC may be a good way of addressing relative newcomers while not weighing down the notebook too much.

I've embedded several comments into the attached PDF (apologies, but that seemed like the most expedient way to annotate) and I'd be happy to discuss or develop any of the comments (apologies in advance for their brevity).

Some other bigger picture comments not in the PDF:

the notebook provides great info on working with the sample dataset or (presumably, I didn't try) the full dataset, but practically, for someone following along at home, the big gap seems to me to be in creating a subset of HTRC based on certain criteria (especially in anticipation of better workset/data-capsule capabilities)
this may be promptly discarded by the PH editors as irrelevant, but the guide doesn't seem to me especially geared toward historians, a few more choice conceptual and code examples might go a long way to address this
this is mentioned above and in some comments, but I'd emphasize again that at times we go from zero to sixty very quickly for the technical aspects, which may be somewhat unavoidable if anything's to be accomplished, but I'd still suggest looking for ways of more gradually ramping up and of providing links to helpful resources for understanding what's happening
this too is mentioned in the comments, but I had trouble getting things to work on Windows and even on Mac I was reminded of how finicky some things can be with multiple version of Python, permissions for pip/3, etc.; I'm mostly just worried about people who get frustrated and give up before they even really start...

I've noted above and in the PDF comments quite a few things came to mind as I was reading, things that might be called constructive criticism, but I want to reiterate that there's lots to love here and that I'm looking forward to using this guide with my students!

Data Analysis in Python through the HTRC Feature Reader _ Programming Historian SGS.pdf

ianmilligan1 commented 8 years ago

Thanks for this @sgsinclair – very much appreciated. The annotated PDF works swimmingly, and I think all of the comments look manageable and well-advised to me. The four big picture comments are also very helpful – points 1, 3, and 4 I'll leave to your thoughts.

As for point 2, I think that has been one tension we've had as the mandate of the Programming Historian continues to grow. Perhaps we could recast a bit with a few examples from literary history? I would be happy to chat and think about some examples we can run.

I'll hold off on more comments at this point as we're still waiting on one more external peer review. Once I receive that, I'll read both reviews, summarize commonalities, and we'll move forward into the revision process.

Again, thanks @sgsinclair for your thoughtful and comprehensive review!

cderose commented 7 years ago

Hi, everyone! I second all of the enthusiastic comments that have already been posted. It was a pleasure to work through this lesson; I look forward to recommending it to colleagues once it's up and running on PH.

Here's a PDF with notes I took while working through the lesson: DataAnalysisInPython_cderose.pdf I'll be glad to elaborate on or clarify any of them.

For a couple more general thoughts:

Depending on the target audience, some additional breakdown of what specific code is doing and why you would take x approach over y could be helpful (I've noted the places in the PDF). Very early beginners will still be able to work through the lesson because of how clearly it's arranged - they just might fall into copy-and-paste mode without being conscious of what the code is doing and why you'd use it and when. If beginner-intermediate users are more the target, then as is works.
As has already come up, I also had trouble with the Anaconda installation and went the pip install route instead. Here's what "source activate" led to in Terminal (I'm working on a MacBook Pro):

If in addition to the pip install there are other alternatives or troubleshooting recommendations you could add, that would be especially great for beginners who might get anxious when the code doesn't immediately work and there's no obvious next step, as @sgsinclair points out. If you do come up with other alternatives, I'd be happy to try them out or brainstorm different roadblocks people might encounter in setting things up so that you can preemptively address them.

Such a terrific lesson - can't wait to see it on PH!

ianmilligan1 commented 7 years ago

Thanks so much @cderose for your insightful comments, both in the prose above as well as in the attached PDF. I think they'll all serve to enhance the lesson's flow and readability, and thanks for catching little things here and there. And the source activate error trace is very helpful.

Really appreciate it, again.

OK. At this point we'll close "open review," and I will begin synthesizing the two reviews for you Peter.

ianmilligan1 commented 7 years ago

I think these were two very positive peer reviews, @organisciak, with good concrete suggestions for way to improve the lessons.

My recommendation at this stage is that you go through and incorporate the in-line suggestions from both @cderose and @sgsinclair. I've read them all and they're worth responding to, and I don't think they should be too onerous. And if you want a Milligan reference, feel free to ping me.

Both reviewers noted in different ways how sometimes the lesson skipped over technical things quite quickly. Both Catherine and Stéfan did an excellent job of showing specific locations in the PDFs.

The one outstanding issue is then the choice of examples. As you go through, why don't we try to brainstorm a few quick historical questions we can ask. Maybe we can work through the notebook together, and try to think of a few examples.

So, at this stage, why don't you incorporate the suggestions as you're able Peter. Ideally, this would be completed within the next three weeks. And then we can touch base, have a quick chat about any specific historical examples or other outstanding issues, before beginning the process of moving this over to the main repo.

organisciak commented 7 years ago

Great. Thanks everybody! I agree that there are solid recommendations here. This message was buried in my inbox during some travel, so I'm turning to the updates now. @capitanu and I will update as best as we can, then follow-up with @ianmilligan1.

Also, looks like more tinkering is in order on how to most easily install the software. We switched from pip instructions to conda instructions because building low-level dependencies was tricky on Windows, but clearly overlooked new sticking points there. Post revision, it will be worth double-checking whatever we figure out: I very much want to make it smooth.

ianmilligan1 commented 7 years ago

Fantastic, thanks @organisciak. Please do keep me posted.

Also – @sgsinclair – we ended up implementing line numbers on the submissions repo, so thanks so much for the suggestion. Please check it out here. I think this will make our reviewing process moving forward.

sgsinclair commented 7 years ago

Fancy numbering! Out of curiosity, is there a reason why headers don't get numbers? (I'm just thinking of how reviewers may want to refer to those as well.)

ianmilligan1 commented 7 years ago

@sgsinclair I'll ask your question in this issue here, as feedback is really useful!

organisciak commented 7 years ago

@ianmilligan1, I uploaded a new version of the lesson (https://github.com/programminghistorian/ph-submissions/commit/968143b1b2aacaf826a5245093840aa2aa5c46a4#diff-3e93430), responding to the notes from @sgsinclair and @cderose. To the reviewers, your comments were phenomenal, and I believe the lesson is better for it.

What's the next step?

ianmilligan1 commented 7 years ago

Great, thanks @organisciak. I'll review the diff this week, hopefully by Thursday, and get back to you with any last minute suggestions. Once we're both happy, I'll begin the process of migrating it over to the live Programming Historian repo. You'll get a chance to proof the lesson, and then we'll be off to the races.

Be in touch soon. Can't wait to get this out to the public!

organisciak commented 7 years ago

Thanks, Ian. I earlier neglected to format the images in the Jekyll format, so I corrected that tonight.

I have three formatting notes/questions:

How do I control the image size in Jekyll? Some of our images are large because they were taken on my hi-dpi monitor. In my draft, I controlled the size with 'width' in an HTML tag, so that the resolution wasn't lost on better screens. For Jekyll, I tried resizing the internal DPI of the images, but it made no difference.
Is there a way to render formulas? I have things like $n^{th}$ showing up.
The line numbering overlaps with the blockquotes that I use for asides, such as line 31: http://programminghistorian.github.io/ph-submissions/lessons/text-mining-with-extracted-features

ianmilligan1 commented 7 years ago

For Jekyll questions, I'll ping you over to @wcaleb – any thoughts on images and formulas, Caleb?

For line numbering, they won't be in the final version on our main repo (although you'll still have a chance to proof before we connect it to the main site!).

ianmilligan1 commented 7 years ago

I've had a chance to go through your diff, @organisciak, comparing it with the comments by both @sgsinclair and @cderose. You've done a fantastic job and I think you should be commended with the care and attention you've made to the suggestions.

I've made two small suggestions which I've pushed forward (a link to JSON and your username).

Both reviewers ran into the same issue with the conda install -c organisciak htrc-feature-reader command. I have tried to make it a bit more explicit that they should swap out the username. I did so with this language:

Now, you need to type one command:

conda install -c USERNAME htrc-feature-reader

This command installs the HTRC Feature Reader and its necessary dependencies. We specify -c so that the installation command knows to look for the program under the USERNAME channel, the username of the library author. You will need to change USERNAME to your own username, which you can find by typing whoami on a Mac or echo %USERNAME% on Windows. For example, the author uses:

conda install -c organisciak htrc-feature-reader

Feel free to push back on this. I've put the language into the repo but you can revert. 😄

organisciak commented 7 years ago

Actually, it should be my username! Stefan mentioned that it might cause confusion and I clearly didn't do a good enough job addressing that. I'll give it another try. The alternative: maybe @borice and I should start an 'htrc' account on Anaconda to host the library?

The problem the reviewers ran into was that one of the dependencies, PySolr, isn't in the default conda channel, and I hadn't built a version for Mac OS. I believe I've fixed this. Now that I've spent more time figuring out conda, I'm confident that it is the better way to prepare things for novice users, but it took some learning on my part to offer it properly.

ianmilligan1 commented 7 years ago

Ahhh, this makes sense. Yeah, my instinct would be to swap out the usernames (I guess I didn't when I installed this way back when however). OK. Let me roll that back.

Maybe an HTRC account for Anaconda would make more sense, if it is trivial to implement?

organisciak commented 7 years ago

When you installed it, it was probably when I had the instructions as pip, so no worries. Turns out it was trivial to add an organization so I made it conda install -c htrc htrc-feature-reader :)

organisciak commented 7 years ago

Updated. https://github.com/programminghistorian/ph-submissions/commit/b3a1274cf7975a4409518f283725e83b04aa0ef2

ianmilligan1 commented 7 years ago

Great, works like a charm.

Migrated proof over to: http://programminghistorian.org/lessons/text-mining-with-extracted-features

We'll figure out figures & formulas soon. Can you give a bit more detail on what you might want done with figures - you think some are too big?

Also @organisciak @borice: could you send me short bios, either here or via e-mail. Just a few lines if possible.

wcaleb commented 7 years ago

With the images, I think the easiest way to make them smaller would be to actually change the size of the image file. Our site uses Jekyll templates to generate the images, so it's not easy to tweak the size on a lesson by lesson basis.

As for formulas, would you be able to format the math in LaTeX as described here? I think that our Markdown parser would be able to handle that syntax if you can get your formulas into it.

organisciak commented 7 years ago

Ah, two dollar signs, slightly different than the Pandoc markdown that Jupyter follows. I'll also bulk resize the images.

Should updates continue in this repo, to be merged into programminghistorian/jekyll later?

ianmilligan1 commented 7 years ago

Aye, please do them here and I will merge them. Just ping me when you're happy with both.

organisciak commented 7 years ago

Updated with f2b2a18. The resized images won't look great on higher resolution screens (Macbooks and Surfaces), so I kept the originals in case the template is ever updated.

ianmilligan1 commented 7 years ago

Awesome. Let me move this over and we'll see how it looks on the main site!

ianmilligan1 commented 7 years ago

OK proof is living here: http://programminghistorian.org/lessons/text-mining-with-extracted-features.

Can you take a look and make sure it looks good to you? And also – short bios would be great, and I can add them to the system.

organisciak commented 7 years ago

We found a few errors, the new commit should be the last.

Bios: Peter Organisciak is an information scientist, at the HathiTrust Research Center as a postdoctoral research associate. Boris Capitanu is a research programmer at the HathiTrust Research Center.

Images What do you think of this pamphlet box:

ianmilligan1 commented 7 years ago

Thanks for the bios! I'll review and move over in a bit. And picture looks great, let me tinker with it.

(May take me a day or two as some short-deadline work has just appeared on my plate..)

borice commented 7 years ago

Hi Ian,

I didn't respond to Peter in time when he asked me for a bio as I've been out with my little girl for a checkup at the doctor. Could you please use the following for my bio:

Boris Capitanu is a Research Programmer at the HathiTrust Research Center. His research interests include large scale data analysis, machine learning, data mining, and educational technologies. Boris holds a B.S. and M.S. in Computer Science from University of Illinois at Urbana-Champaign, and a MBA+MHRIR from the same institution.

Thank you, Boris

On Nov 23, 2016, at 4:41 PM, Ian Milligan notifications@github.com wrote:

Thanks for the bios! I'll review and move over in a bit. And picture looks great, let me tinker with it.

(May take me a day or two as some short-deadline work has just appeared on my plate..)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/29#issuecomment-262647705, or mute the thread https://github.com/notifications/unsubscribe-auth/AAaT87Nw-qQb2h0XTFnyUGfu-2CxV0-Jks5rBMEZgaJpZM4JZCtG.

ianmilligan1 commented 7 years ago

Great, thanks @borice & @organisciak.

I've put a new proof up at http://programminghistorian.org/lessons/text-mining-with-extracted-features (you may need to do a hard refresh). I've also taken your suggested image, Peter, and converted it into a 200x200 greyscale. Look good?

https://github.com/programminghistorian/jekyll/blob/70a3a250c8ac9c7e2c251f64361173092899131c/gallery/text-mining-with-extracted-features.png

organisciak commented 7 years ago

Looks great!

On Thu, Nov 24, 2016, 7:43 AM Ian Milligan notifications@github.com wrote:

Great, thanks @borice https://github.com/borice & @organisciak https://github.com/organisciak.

I've put a new proof up at http://programminghistorian.org/lessons/text-mining-with-extracted-features (you may need to do a hard refresh). I've also taken your suggested image, Peter, and converted it into a 200x200 greyscale. Look good?

https://github.com/programminghistorian/jekyll/blob/70a3a250c8ac9c7e2c251f64361173092899131c/gallery/text-mining-with-extracted-features.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/29#issuecomment-262791576, or mute the thread https://github.com/notifications/unsubscribe-auth/AADj72rq1-_tNcZKJQ0NUNXYYlGovfVzks5rBaKFgaJpZM4JZCtG .

ianmilligan1 commented 7 years ago

Hurray!

I have done a soft launch, so you can see your entry here (http://programminghistorian.org/lessons/) - under distant reading. The lesson is also fully live.

That said, I won't launch publicity until Monday (Friday of the Thanksgiving weekend in November is not an ideal launch time) so if you catch anything, let me know. 😄

organisciak commented 7 years ago

Awesome. That's also when HathiTrust is publicizing the full collection Extracted Features Dataset which we soft launched in October, so it may get more eyes on the lesson.

On Fri, Nov 25, 2016 at 1:27 PM Ian Milligan notifications@github.com wrote:

Hurray!

I have done a soft launch, so you can see your entry here ( http://programminghistorian.org/lessons/) - under distant reading. The lesson is also fully live.

That said, I won't launch publicity until Monday (Friday of the Thanksgiving weekend in November is not an ideal launch time) so if you catch anything, let me know. 😄

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/programminghistorian/ph-submissions/issues/29#issuecomment-263019180, or mute the thread https://github.com/notifications/unsubscribe-auth/AADj73qLZPUFWu6Q_Fm_Ym1jd_1CZRmVks5rB0SlgaJpZM4JZCtG .

ianmilligan1 commented 7 years ago

🎉 Hurray, that's awesome timing.

programminghistorian / ph-submissions

Review Ticket for Data Analysis in Python through the HTRC Feature Reader #29

Anti-Harassment Policy