A counted words list to help with proof reading

DavidHaslam commented 7 years ago

FIO: The attached Zip file contains a counted words list for the verse text only, excluding notes.

merged.vpl.words.count.txt.zip

It may be of use for proof reading, etc.

Notes:

Hyphens and the apostrophe were retained as parts of words.
Bracketed text was not excluded (e.g. the Prologues for Ecclesiasticus)
Spellings were not standardized in 1611 like they became in modern English
Letter case was preserved.

DavidHaslam commented 7 years ago

I was surprised to see the digit 1 at the top of the list. Not yet located this.

DavidHaslam commented 7 years ago

Technical tip: Although the hyphen/minus is a punctuation mark, the hyphen U+2010 is not.

To generate the words list, I temporarily replaced the hyphen/minus by the hyphen, so that these were not removed when I used the PCRE class [[:punct:]] in a subsequent search/replace pattern.

A similar method was used to retain the sole apostrophe.

DavidHaslam commented 7 years ago

Windows users may like to know that I generated the counted words list by means of a bespoke TextPipe filter.

DavidHaslam commented 7 years ago

A further contrast with modern editions. The KJV of 1611 had a greater number of common nouns capitalised. Here's one such example:

00079   temple
00255   Temple
00010   temples
00003   Temples

Contrast this with a modern KJV (albeit without the DC books):

000204  temple
000009  temples

But here's the same for just the DC books:

000131  temple
000003  temples

The noun temple never occurs at the start of a sentence, so it was a useful candidate for the comparison.

DavidHaslam commented 7 years ago

The counted words list can be pasted into Excel™ and filtered on the Count column to browse through [say] all the hapax legomena.

NB. Take care after pasting. The words false and true will have become Booleans.

It's already become apparent that there are instances where two words were joined together in the HTML.

We can begin to list these here:

beconfounded
himcontinually
intheir
Lordcame
maydeclare
mineenemies
ofsorrows
preciousstones
shallbreake
shalloffer
shallsay
thedeepe
thereofshall
withthe

This is different to merely observing varieties of spellings in Early Modern English. Of course, it's conceivable that this exercise will just uncover printers' mistakes from 1611.

DavidHaslam commented 7 years ago

Even in a culture where spellings had not been standardised, it's still possible to observe some obvious printers' mistakes.

Beeer-sheba
breehren
chldren
fifteeene
Ind
looosed
monrning

Is it fair to say that a triple vowel is more likely to be a printer's mistake than a spelling variation?

Extend the list as more come to light.

DavidHaslam commented 7 years ago

To modern readers, surely the most surprising hapax legomenon in the whole KJV_1611 must be the word Jesus ?

Upon closer inspection, this appears to be due to an HTML transcription error!

[The Prologue of the Wisdome of Jesus the sonne of Sirach]

should read:

[The Prologue of the Wisdome of Iesus the sonne of Sirach]

It appears that the transcribers were unaware that in italics the capital letter I looked as it did. Agreed? @lb42

btw. The prologue titles should be styled in the TEI with italics.

lb42 commented 7 years ago

By all means list these obvious typos somewhere. If you provide a reference to show where they occur that would be helpful. Even more so if you can indicate whether it is a printing error (i.e. present in the original source) or a transcription error (i.e. not present in the original source).

For example: Lordcame Jonah 3.1 fix beconfounded Psa 69.6 fix Beeer-sheba Gen 21.32 sic breehren Acts 15.33 sic etc.

DavidHaslam commented 7 years ago

That's what I intended, and it'll be tab-delimited for readability.

It's a three stage process. The first stage is complete. I browsed through all the words with Count=1. Now to locate them. After that to examine the page images. May take a while.

lb42 commented 7 years ago

You can locate them very quickly using "grep" or whatever the windows equivalent is!

DavidHaslam commented 7 years ago

Well, it's easy enough just with having the concatenated [xml] file open in Notepad++ to search for anything. The search/replace UI is very powerful.

DavidHaslam commented 7 years ago

In the original 1611 text of the KJV, there were only 12 words that contained the letter J.

All of them had the ij digraph.

Abijah
Abijam
Ahijah
Aijalon
Aijeleth
Baijth
Elijah
Hodijah
Iehouah-ijreh
Irijah
Tobijah
Urijah

The hyphenated one in Genesis 22:14 was the real surprise. Modern editions have "Jehovah-jireh".

lb42 / KJV_1611

A counted words list to help with proof reading #15