Comparing word frequency analyses with the HunKar diatheke plain text output of 2011-09-23

DavidHaslam commented 4 years ago

I had providentially retained my outputs of the earlier HunKar module done on 2011-09-23, so I was able to compare both old & new diatheke outputs as well as both the derived word frequency analyses.

The former can't help but include reporting a lot of differences merely due to a systematic change of some punctuation marks such as hyphen - by endash –, viz

U+002D  -   2,562   HYPHEN-MINUS

by

U+002D  -   2,461   HYPHEN-MINUS
U+2013  –   100 EN DASH

and also the following improvement:

U+0022  "   28  QUOTATION MARK
U+0027  '   4   APOSTROPHE

by

U+2019  ’   4   RIGHT SINGLE QUOTATION MARK
U+201D  ”   14  RIGHT DOUBLE QUOTATION MARK
U+201E  „   14  DOUBLE LOW-9 QUOTATION MARK

Setting these aside in order to focus on differences at word level still leaves a significant number of such!

These are best examined initially by the comparison of word frequency counts.

I have used WinMerge to generate a patch file, see within the attached Zip file.

HunKar.diatheke.word.frequency.diff.zip

There are 2146 differences!

Many of these are places where the space between two words is now missing!

Each of these should be reviewed and fixed where necessary.

krisek commented 4 years ago

I see now, the low quality of this source (very weird since they are the official publisher of the text) strikes back. I think I need to completely revise this activity: this source has section titles and references, but as we can see the text quality is really low. Other sources are better text quality, but have no section titles.

krisek commented 4 years ago

Can you re-run diatheke after the module rebuild? Hopefully it will look better this time.

DavidHaslam commented 4 years ago

Updated

@krisek

The attached Zip file contains the updated analyses for a module built from the latest XML file.

Analysis2.zip

I have also included the log output of osis2mod and the output of emptyvss.

NB. I have not yet repeated any comparisons.

krisek commented 4 years ago

Thanks a lot, it immediately ponted out a pretty huge bug: last verse in every chapter was missing. I fixed it. What are the commands you run? osis2mod is clear, but I could do the rest too, so that you don't have to re-run for every update.

DavidHaslam commented 4 years ago

Characters unexpected include:

U+0060  `   4   GRAVE ACCENT
U+2022  •   3   BULLET

The former are in these verses:

Isaiah 28:23: Vegyétek füleitekbe és halljátok szavam`, figyeljetek és hallgassátok beszédem`!
Isaiah 32:9: Ti gondtalan asszonyok, keljetek fel, halljátok szavam`, és ti elbizakodott leányzók, vegyétek füleitekbe beszédem`!

The latter are in these verses:

I Kings 12:21: • És mikor megérkezett Roboám Jeruzsálembe, összegyűjté Júda egész házát és Benjámin nemzetségét, száznyolczvanezer válogatott hadra való férfiút, hogy hadakozzanak az 
Matthew 5:48: • Legyetek azért ti tökéletesek, miként a ti mennyei Atyátok tökéletes.   
James 4:8: • Közeledjetek az Istenhez, és közeledni fog hozzátok. Tisztítsátok meg kezeiteket, ti bűnösök, és szenteljétek meg szíveiteket ti kétszívűek.

@krisek Please review these locations.

DavidHaslam commented 4 years ago

The counts of left and right parenthesis do not match:

U+0028  (   220 LEFT PARENTHESIS
U+0029  )   219 RIGHT PARENTHESIS

The unmatched locations need to be tracked down.

DavidHaslam commented 4 years ago

The counts of double quotation marks do not match:

U+201D  ”   12  RIGHT DOUBLE QUOTATION MARK
U+201E  „   13  DOUBLE LOW-9 QUOTATION MARK

The unmatched locations need to be tracked down.

DavidHaslam commented 4 years ago

Word frequency anomalies:

1   Ben-Hadad
1   Ben-Hadád

One is without the acute accent.

1   Benjamin
130 Benjámin

Ditto!

1   Beszéljetek
4   Beszéljétek

Ditto!

There are probably many more examples.

DavidHaslam commented 4 years ago

There are 36 words that end with a hyphen/minus.

1   Búza-
2   alsó-
6   arany-
1   atya-
3   be-
1   dob-
2   egy-
3   ezüst-
3   fa-
1   faolaj-
1   fel-
1   fige-
1   fiú-
1   gyapjú-
5   jobb-
1   jog-
5   ki-
2   kő-
1   mogyoró-
1   méreg-
1   nyár-
1   nőstény-
1   paizs-
3   réz-
1   szőlő-
1   szőlőtő-
1   trombita-
2   tulok-
1   tölgy-
1   tűz-
4   vas-
1   véres-
1   árpa-
7   égő-
2   ércz-
9   étel-

@krisek Check each location for possible missing spaces.

DavidHaslam commented 4 years ago

The Sword utilities come bundled with Xiphos. My usual procedure is to run the following Windows CMD file called ExportMod.cmd from a subdirectory.

@echo off
rem Analyse a SWORD module
..\xiphos\diatheke -b %1 -f plain -k "Gen-Rev" >..\Export\%1\%1.diatheke.txt
..\xiphos\mod2imp %1                           >..\Export\%1\%1.raw.imp.txt
..\xiphos\emptyvss %1                          >..\Export\%1\%1.emptyvss.txt

Parameter %1 is the first command line parameter, thus:

ExportMod HunKar

generates all 3 output files in a suitable Export folder, one that I create manually beforehand in Windows Explorer.

Notes:

In my Sword path, I have a symbolic link to the Xiphos bin path, made ages ago using the Windows mklink command.
I also have subst drive S: mapped to my Sword path.

DavidHaslam commented 4 years ago

@krisek

Further to your repair for the last verse in each chapter...

The Analysis2.zip file has been updated and replaced in the earlier comment..

krisek commented 4 years ago

U+0060    `   4   GRAVE ACCENT
U+2022    •   3   BULLET

fixed in 0421aa8

The counts of left and right parenthesis do not match:
U+0028    (   220 LEFT PARENTHESIS
U+0029    )   219 RIGHT PARENTHESIS

fixed in a6a7ef9

The counts of double quotation marks do not match:
U+201D    ”   12  RIGHT DOUBLE QUOTATION MARK
U+201E    „   13  DOUBLE LOW-9 QUOTATION MARK

fixed in c532a5e

krisek commented 4 years ago

1 Benjamin 130 Benjámin

This is like this in all onlice sources I reviewed. I think it reflects that the greek (new testament) and hebrew (old testament) writng form might be different.

Confirmed in printed version too:

benjamin

krisek commented 4 years ago

1 Beszéljetek 4 Beszéljétek

These are the two different forms of the same word (formed through agglutination) with different meaning. It's okay as it is.

krisek commented 4 years ago

There are 36 words that end with a hyphen/minus.
1 Búza-
2 alsó-
6 arany-

These are valifd forms in Hungaian. (For listing)

example

És vőn Jákób zöld nyár-, mogyoró- és gesztenye-vesszőket, és meghántá azokat fehéresen csíkosra, hogy látható legyen a vesszők fehére.

Here the meaning is "nyárvesszőket, mogyoróvesszőket és gesztenyevesszőket" (branches of poplar, almond, and cheesnut trees), but the way how it is in the text is the correct way, so we list only the first parts.

And besides these there are a lot of places where we use hyphens. For listing (max 2 elements), to question a specific word in a sentence (by adding the -e suffix), etc. etc.

krisek commented 4 years ago

1 Ben-Hadad 1 Ben-Hadád

This is as per the printed version.

ben-hadad

krisek commented 4 years ago

I think I fixed all in this now, let's open dedicated issue if there's anything left.

krisek / HunKar

Comparing word frequency analyses with the HunKar diatheke plain text output of 2011-09-23 #5

Updated