openscriptures / morphhb

Open Scriptures Hebrew Bible
https://hb.openscriptures.org
Other
182 stars 63 forks source link

Questions regarding the words list file Words.csv #29

Open DavidHaslam opened 7 years ago

DavidHaslam commented 7 years ago

I just examined the file Words.csv which has 64587 lines.

This CSV file has two columns; the first appears to be the the frequency of occurrence, the second the actual Hebrew word (complete with diacritics). The file is sorted on column one in ascending order.

What's puzzling are the last 136 lines of the file, where the first column contains a single letter rather than a four digit integer (packed with leading zeros where required).

A further question is prompted by a brief examination of the Hebrew words.

None of the words contains a maqaf which means that there are no compound words (usually proper names) in the list. Here's a character frequency analysis of Words.csv obtained using BabelPad.

Words.csv.character.frequency.txt

When performing the statistical analysis of words found in any particular text, it's preferable to classify each compound word as an item to be counted.

DavidHaslam commented 7 years ago

FIO. Seeing as MapM is already a single XML file for the whole work, I just extracted and counted all the compound words found therein.

File format is tab delimited text (like CSV but without the quotation marks or commas).

MAPM.maqaf.words.count.txt

DavidHaslam commented 7 years ago

For Biblical Hebrew, this kind of analysis might be worthwhile repeating with the cantillation points first removed, leaving only the letters and vowel signs. The list of unique words would then be much shorter.

DavidTroidl commented 7 years ago

This list already has the cantillation stripped. The only extra mark remaining is the meteg. This list, along with the WlcWordList was intended to be able to record submitted parsing data for each form, to provide a suggestion list on the parsing website. However, it never made it into the database, for technical reasons. The lines at the end record different forms for the prepositional prefixes, by their letter codes. The numbers would be Strong numbers for those forms, to help distinguish words of the same spelling.

On 2/2/2017 2:41 PM, David Frank Haslam wrote:

I just examined the file |Words.csv| which has 64587 lines.

This CSV file has two columns; the first appears to be the the /frequency/ of occurrence, the second the actual Hebrew /word/ (complete with diacritics). The file is sorted on column one in /ascending/ order.

What's puzzling are the last 136 lines of the file, where the first column contains a single letter rather than a four digit integer (packed with leading zeros where required).

A further question is prompted by a brief examination of the Hebrew words.

None of the words contains a maqaf which means that there are no /compound words/ (usually proper names) in the list. Here's a character frequency analysis of |Words.csv| obtained using BabelPad.

Words.csv.character.frequency.txt https://github.com/openscriptures/morphhb/files/748781/Words.csv.character.frequency.txt

When performing the statistical analysis of words found in any particular text, it's preferable to classify each /compound word/ as an item to be counted.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openscriptures/morphhb/issues/29, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKwBXKhLwgzcOKRcQowGvU-19tR0gcaks5rYjFQgaJpZM4L1i2H.


This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus

DavidTroidl commented 7 years ago

The maqqef is a mark that does tie words together, but does not in general create "compound words". In the OSHB we separate out maqqef into a element between the words.

On 2/2/2017 3:32 PM, David Frank Haslam wrote:

FIO. Seeing as MapM is already a single XML file for the whole work, I just extracted and counted all the /compound words/ found therein.

File format is tab delimited text (like CSV but without the quotation marks or commas).

MAPM.maqaf.words.count.txt https://github.com/openscriptures/morphhb/files/748889/MAPM.maqaf.words.count.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openscriptures/morphhb/issues/29#issuecomment-277074652, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKwBeO_XCM7wVOPeJt1qcwC1nOjE20xks5rYj1WgaJpZM4L1i2H.


This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus

DavidTroidl commented 7 years ago

See previous response.

On 2/2/2017 3:38 PM, David Frank Haslam wrote:

For Biblical Hebrew, this kind of analysis might be worthwhile repeating with the cantillation points first removed, leaving only the letters and vowel signs. The list of unique words would then be much shorter.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openscriptures/morphhb/issues/29#issuecomment-277076375, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKwBbv_zJiIdCyFBG1gxIC5sQe4_Tu-ks5rYj7fgaJpZM4L1i2H.


This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus

DavidHaslam commented 7 years ago

I guess a lot depends on the degree of precision for the definition of "compound words". As all these words containing at least one MAQAF are proper names (either people or places), the common sense approach would be to think of them as complete words (albeit hyphenated in most translations) rather than as two or more words separated by a punctuation mark.

After all, if you wish to search for Beer–sheba, you'd expect to enter that as the search string rather than searching for the Boolean Beer AND sheba, even though the latter should work just as well.

FIO. The list of words in the KJV Sword module containing the EN DASH character (U+2003) which was used where the Hebrew has a MAQAF.

KJV.diatheke.endash.words.count.txt

NB. My database lists possessives separately.

DavidHaslam commented 7 years ago

Aside: I heartily approve of the fact that the OSHB has a special <seg ...> element for each MAQAF. The WLC text at tanach.us doesn't and is weaker on that account.

DavidTroidl commented 7 years ago

Maqqef is a very common symbol, used first in Genesis 1:2, joining two words that have nothing to do with proper names. Beer Sheba is and example of two separate names joined by a maqqef. Chedorlaomer, on the other hand, is mostly used as a single word, but once is divided by a maqqef. The maqqef seems to fall in more with the conjunctive accents, that have to do with the cadence of the verse, as much as they do with semantic connections of words. A full study of the usage of the maqqef is certainly beyond the simple task of reporting the text of the Hebrew bible, for use by students at various levels.

On 2/3/2017 10:32 AM, David Frank Haslam wrote:

I guess a lot depends on the degree of precision for the definition of "compound words". As all these words containing at least one MAQAF are proper names (either people or places), the common sense approach would be to think of them as complete words (albeit hyphenated in most translations) rather than as two or more words separated by a punctuation mark.

After all, if you wish to search for Beer–sheba, you'd expect to enter that as the search string rather than searching for the Boolean Beer AND sheba, even though the latter should work just as well.

FIO. The list of words in the KJV Sword module containing the EN DASH character (U+2003) which was used where the Hebrew has a MAQAF.

KJV.diatheke.endash.words.count.txt https://github.com/openscriptures/morphhb/files/750837/KJV.diatheke.endash.words.count.txt

NB. My database lists possessives separately.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openscriptures/morphhb/issues/29#issuecomment-277277849, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKwBXoDeUsp9Vp1pBNDd1COpD0h_T5bks5rY0iHgaJpZM4L1i2H.


This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus

DavidHaslam commented 7 years ago

Interesting observations, especially about cadence.

Aside: I wonder how consistently the KJV translators used the 'hyphen' where the Hebrew has a MAQAF ?

Unlikely they added a 'hyphen' where there's no MAQAF, but there must be many places where they didn't use a hyphen.

But that's outside the scope of this project.

eyaler commented 3 years ago

@DavidTroidl what is the source of this file?

DavidTroidl commented 3 years ago

There is an XSL Transform that extracts the words from the WordList.xml, listing the vowel form and the augmented Strong number. These together constitute a unique key.

eyaler commented 3 years ago

thanks! afaict the words in the end appear in the bible but have no strong number. how come? is it correct to assume that words.csv has all words in the bible (including all inflections and prefix combinations)?