Closed norvig closed 3 years ago
In what format was the book originally formatted? Do you still have those original files somewhere? Original source -> something great is probably easier than original source -> pdf -> conversion -> something great...
It is a sad, sad story ... The original files were in LaTeX, but the publisher (Morgan Kaufman) hired a design company to do final page layout/design/formatting, so I never had the final-final files. I got careless with my near-final files: during one transfer of machines the files were lost because I didn't realize that I needed to do tar --dereference
to follow symbolic links. I wasn't too worried because I assumed I could always get the files from Morgan Kaufman if I needed them; however, when Morgan Kaufman was bought by Elsevier, they too managed to lose the files.
That is indeed a pretty sad story... Someone on hackernews mentioned that there was a clearer PDF version available for ACM members (https://news.ycombinator.com/item?id=16470513). Would it be possible to upload that version instead of the version that is online right now? Might make the conversion a bit easier...
Can we use this issue to coordinate assigning the chapters to people and porting to Markdown? I'll volunteer for a chapter.
I'll volunteer for a chapter, but please establish the Markdown guidelines first (e.g. how heading levels will correspond to the typographic levels in the book).
Thank you for volunteering! I think we need to
Hello,
In the interest of potentially averting duplicate work, I thought I'd let you know that I am intent on making a "modern" mobile-friendly HTML version of PAIP. (I am not yet fully committed to this project, because it would indeed delay a very important project of mine again. I can't make any promises yet, I'm still in the exploration phase.)
I completed a very similar project a few months ago: starting from a much more primitive HTML version, I made a modern public domain version of chapters 5 and 6 of AMOP (the CLOS MOP specification), mostly by hand: https://clos-mop.hexstreamsoft.com/
So, given my current infrastructure, this new modern HTML conversion project should be relatively straightforward (hopefully!), although the volume is far, far more... voluminous. I'll go at this lone-wolf style, per my usual modus operandi.
I hope this information is deemed to be of some utility (and not too spammy).
edit: Oh geez, this book is around 1000 pages?? I didn't quite realize how massive this book is... I could eventually tackle such a project, but absolutely not anytime soon. I'm officially relegating my PAIP conversion project to the indeterminate future, then. Sorry for the noise.
Just wanted to weigh in that I think @norvig is right, a better PDF version would make conversion easier... Also: three hours developing a semi-automatic tool will definitely outweigh three hours of manually correcting/hacking parts of the original together.
Also: if we end up with a good "ground-truth" version of the book I see a nice OCR formatting challenge for AI algorithms ;)
I hacked together a Python (3.6+) script to separate out different chapters from the text file. I wonder if this script can be used as the starting point for automated spell-checking, parsing for footnotes and code snippets, etc. in the future.
The script contains a dictionary that maps pdf page number to chapters. Excluding front-matter, preface, appendix, bibliography and index, there are 25 chapters in total. Considering the voluminous-ness of the book, I wonder if we could start another issue for people to claim their favorite chapters to work on.
Might I suggest we use GitHub-flavored Markdown for the purpose of this project? It seems to be one of the more popular flavors of Markdown, and we would have a variety of tools available to convert GFM to other document formats.
edit I used form feed character in the text file to delineate page boundaries. A recent pull request seems to have replaced them with whitespaces. Hmm..
@rmeertens I checked out the ACM pdf -- it is different than the ones we've seen before, and at 91MB almost twice as big. Not sure if it is better.
I did upload an alternative txt file, PAIP-alt.txt
which we can investigate to see if it is better. If it is, we want to make sure to capture the work that was done on the previous PAIP.txt
What I saw in converting a few PDF files to text for a wiki a few years back makes me thing that it will be hard to develop good tools to undo the non-systematic errors. But I'm willing to wait, no problem.
Does someone want to take a more careful look at PAIP.txt
and PAIP-alt.txt
and weigh in on which is better? Or if they make different errors, any good ideas for a way to use both?
@norvig Here are the primary difficulties I see:
PAIP.txt
has more spelling errors / wrong characters.PAIP-alt.txt
does not have empty newlines after paragraphs, which might be gnarly when converting to Markdown.We could consider either of the solutions below:
PAIP.txt
primarily, and run a script to detect spelling errors; when an error is detected, go to the corresponding location in PAIP-alt.txt
, fetch the alternative word there, and prompt the editor for verification.PAIP-alt.txt
primarily, but iterate through PAIP.txt
, looking for paragraph breaks. As far as I can see, the content of the lines in both files appears to match up, so we can insert paragraph breaks at appropriate locations in PAIP-alt.txt
.The first approach seems to be more labor intensive (we still need to manually confirm if the script finds the right word in the alternative file), so maybe we can use the second approach, and figure out how to work from there.
When I did some visual skimming, PAIP-alt.txt
didn't seem to offer significant benefits. Paragraph breaks were mentioned. These also affect chapter breaks and page headers. Anything that is not English has to be fixed manually, I fear. "2 - f 2 = "
or " 2 -f 2 ="
are both bad. " 2 2 + "
is slightly better than " 22+ "
but both need manual intervention. Same goes for code blocks. Here's an example where somebody mistakenly removed the 4
from PAIP.txt
:
> (+ 2 2)
4
>
vs.
> (+ 2 2)
>
But further down both use > (+ 2 2) ^ 4
instead of > (+ 2 2) ⇒ 4
.
Another example where both variants need manual intervention:
> 'John =^ JOHN
> '(John Q Public) =^ (JOHN Q PUBLIC)
> '2 2
> . => .
> ' ( + 2 2) =.> (+ 2 2)
> (+ 2 2) =^ 4
> John ^ Error: ]OHN is not a bound variable
> (John Q Public) ^ Error: JOHN is not a function
vs.
> 'John => JOHN
> '(John Q Public) => (JOHN Q PUBLIC)
> '2 2
> . => .
> '(+ 2 2) =.> (+ 2 2)
> (+ 2 2) => 4
> John ^ Error: ]OHN is not a bound variable
> (John Q Public) ^ Error: JOHN is not a function
That's why I conclude that PAIP-alt.txt
is not necessarily better and furthermore, since a lot of manual intervention is required simple tools might not suffice.
I think my process would consist of tons of interactive search and replace operations in Emacs, some keyboard macros, and a lot of actual reading and formatting and carefully comparing code with what the book says in addition to a lot of talking with others about what markup to use for quotes at chapter beginnings and stuff like that. Things that can't necessarily be automated and shared. Perhaps we can share some common search and replace operations once we have converted a chapter or two. I think we need this experience, first. Perhaps I should just pick a chapter and time myself, see how much time it takes per page?
GitHub flavoured Markdown, # for chapter headings like Chapter 1: Introduction to Lisp and ## for section headings like 1.1 Symbolic Computation. Create a new file per chapter, named PAIP-chapter01.md
and the like.
I'm going to wait for the weekend before trying any of that but this would be my proposal.
I couldn't get miyuchina's script to work, and I opted to instead add chapter headers, in a fork. It's still in one file, so ##
seemed appropriate for chapters; I verified that string wasn't used elsewhere in the book. Apologies if I'm jumping the gun before there's a consensus or anything; I woke up and couldn't sleep, so I was fiddling around.
@pronoiac pull request #7 got rid of the line feed characters, which broke my script. I was using line feeds to identify page breaks, and using page numbers to separate chapters. Perhaps that was the issue you encountered?
I'd be interested in seeing PAIP converted to HTML5/epub format like that of SICP http://sarabander.github.io/sicp/ The github project is at https://github.com/sarabander/sicp . It seems to have been converted from texinfo format but I like the formatting/style of the result.
@miyuchina: It wouldn't run, it likely was missing a module. I started working on line numbers when I realized it was simpler to annotate the text.
I then noticed Sublime Text was silently making changes - eating control characters or confused by an unusual encoding? I might try to straighten that up.
I'm going to take a look a chapter 5 and model it on chapter 1 in the docs subdirectory.
It took me about 3h to do 24 pages.
I had some formatting questions as I went through the text.
(variable . value)
→ (variable . value)
What do you think? Agree, disagree? We will have to put up a consolidated rule set eventually.
@kensanata good work on chapter5! The code highlighting plugin has been changed so you should use lisp to highlight the code block instead of
emacs
I'm curious, @nticaric and @kensanata, how much different are the text in the chapters from the text within PAIP.txt?
I can see how to reference pages and chapters within one Markdown file, and how to fix up the links for a file per chapter, generated from that source file. Note, this handles linking, but there could be more discussion about appearances: should the page numbers be visible?
I'd thought I'd read a good argument for having links for page numbers, maybe in @Hexstream 's comment, since edited? Looking at a Markdown cheatsheet, we could add anchors with something like:
156 ELIZA-DIALOG WITH A MACHINE <a id="page-156"></a>
And use it with See [page 156](#page-156) for more details.
If we do the same with chapters, then splitting up the files is straightforward, and it's easier to figure out which file a certain page number is in and reference the appropriate chapter Markdown file.
This takes some processing / automation to split the book into chapters, but it seems like the easiest links to write.
For chapter 5, I removed the page headers and the page breaks. I believe we don't need the text to recreate the book, and I don't need a website that recreates the book. With the text, we can reflow it, create an ebook without page breaks, and many more interesting things besides, and in all those uses, the page breaks are an annoyance. That's why I think we should create anchors for the things the text links to and link to them, and get rid of cross-references by page number. That's what I think we should do.
@pronoiac My comment you're thinking of about page number links and stuff is this one. :)
@kensanata: It's not hard to remove the page headers and breaks as we add some anchors.
I agree that it would be nice, down the line, if a reference to another page goes right to the appropriate paragraph. To me, it feels more like a "day 2" thing, after we have the book in readable Markdown, with less mangled text. I have misgivings about, in the first pass, manually adding anchors for referenced subject matter:
Having a source file could make it easier to enforce consistency. Are we done with broad changes? Like, is exercise markup settled? What about footnotes?
For the purposes of tracking changes, I'm also not keen on removing trailing spaces (in a chapter, but not in the book), or making each paragraph one line. Seeing what's been changed can help other editors figure out what to do on other chapters.
I think the 3–4h I spent on this project are the limit of what I will do, unfortunately. I suspect that the easiest solution to the entire problem would be if we could find 20+ other volunteers to do something like it and one final editor to go through all the chapters and decide how cross-references should work. With all the talk of automation being the preferred solution I’m curious to see the tools people have been working on. Is there a place where I can look recent developments and progress other than the commit history of this project?
@kensanata: Apologies if I’ve steamrolled over you, or disregarded your opinion; it wasn’t my intent. You fixed up a chapter in great time.
Locally, I've added markers for chapters and pages, and written scripts to make a file per chapter.
I haven't yet written scripts to merge an edited chapter back into the text, or handle footnotes.
I’ve done parts of chapters 10 and 11. Fixing up code - and some assembler - feels very slow. I think I'll kick the assembler down the road for now.
No worries! :) I’m just happy somebody is working on the issue. :)
@Hourann: that's great news! It looks like the Safari version is (mostly) text, instead of scanned bitmaps.
I'm in favor of a new issue for ripping and converting the Safari version. I'd start work on it, but I don't have an active Safari account.
@hakano @pronoiac I can access the book in safari in fact. I just wonder if it is legal to crawl the book from safari and post it here. I have crawled it and saved in html format.
The legal aspect makes me wary of talking about it here. Personally, I wouldn’t be comfortable openly sharing html or an epub I’d downloaded. Once I’d converted it into Markdown and checked for fingerprinting, I'd feel free to share.
@Hourann, @hakano, wanna compare downloads?
Just for visibility, there's a cleaner source available, and a separate issue to track work with it.
The .txt version has a lot of errors; I got it from the default
Save as other / ...Text
menu item in Acrobat. An automated tool could rejoin the lines that end in hyphens, and perhaps find missing spaces, as inprogrammingpractices
andanunfortunate
. Other errors would require significant human labor to clean up.