Open kristoffer-paulsson opened 2 years ago
After running a character statistics on my full corpora after betacoding and normalizing ("NFD") it I got these statistics:
0x20 SPACE 6972952
! 0x21 EXCLAMATION MARK 197 " 0x22 QUOTATION MARK 6121
$ 0x24 DOLLAR SIGN 1 % 0x25 PERCENT SIGN 101 & 0x26 AMPERSAND 1730 ( 0x28 LEFT PARENTHESIS 3251 ) 0x29 RIGHT PARENTHESIS 5513
0x3e GREATER-THAN SIGN 2811 ? 0x3f QUESTION MARK 513 @ 0x40 COMMERCIAL AT 3 J 0x4a LATIN CAPITAL LETTER J 146 V 0x56 LATIN CAPITAL LETTER V 182 [ 0x5b LEFT SQUARE BRACKET 6999 \ 0x5c REVERSE SOLIDUS 35 ] 0x5d RIGHT SQUARE BRACKET 6900 ^ 0x5e CIRCUMFLEX ACCENT 847 ` 0x60 GRAVE ACCENT 39 a 0x61 LATIN SMALL LETTER A 2 e 0x65 LATIN SMALL LETTER E 2 i 0x69 LATIN SMALL LETTER I 5 j 0x6a LATIN SMALL LETTER J 41 o 0x6f LATIN SMALL LETTER O 1 u 0x75 LATIN SMALL LETTER U 4 v 0x76 LATIN SMALL LETTER V 727 { 0x7b LEFT CURLY BRACKET 1 | 0x7c VERTICAL LINE 491 } 0x7d RIGHT CURLY BRACKET 2 ~ 0x7e TILDE 1 ¨ 0xa8 DIAERESIS 1 « 0xab LEFT-POINTING DOUBLE ANGLE QUOTATION MARK 3 ¯ 0xaf MACRON 338 ´ 0xb4 ACUTE ACCENT 1 · 0xb7 MIDDLE DOT 135045 » 0xbb RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 3 ½ 0xbd VULGAR FRACTION ONE HALF 1 × 0xd7 MULTIPLICATION SIGN 18 ʹ 0x2b9 MODIFIER LETTER PRIME 1 ˘ 0x2d8 BREVE 221 ˙ 0x2d9 DOT ABOVE 1 ̀ 0x300 COMBINING GRAVE ACCENT 2105332 ́ 0x301 COMBINING ACUTE ACCENT 3446570 ̄ 0x304 COMBINING MACRON 8359 ̆ 0x306 COMBINING BREVE 114 ̈ 0x308 COMBINING DIAERESIS 12189 ̓ 0x313 COMBINING COMMA ABOVE 2232978 ̔ 0x314 COMBINING REVERSED COMMA ABOVE 854795 ͂ 0x342 COMBINING GREEK PERISPOMENI 1494888 ͅ 0x345 COMBINING GREEK YPOGEGRAMMENI 313026 ͵ 0x375 GREEK LOWER NUMERAL SIGN 5 Α 0x391 GREEK CAPITAL LETTER ALPHA 90672 Β 0x392 GREEK CAPITAL LETTER BETA 17410 Γ 0x393 GREEK CAPITAL LETTER GAMMA 14794 Δ 0x394 GREEK CAPITAL LETTER DELTA 29999 Ε 0x395 GREEK CAPITAL LETTER EPSILON 38267 Ζ 0x396 GREEK CAPITAL LETTER ZETA 8129 Η 0x397 GREEK CAPITAL LETTER ETA 14171 Θ 0x398 GREEK CAPITAL LETTER THETA 14500 Ι 0x399 GREEK CAPITAL LETTER IOTA 21623 Κ 0x39a GREEK CAPITAL LETTER KAPPA 41941 Λ 0x39b GREEK CAPITAL LETTER LAMDA 21959 Μ 0x39c GREEK CAPITAL LETTER MU 26943 Ν 0x39d GREEK CAPITAL LETTER NU 9204 Ξ 0x39e GREEK CAPITAL LETTER XI 2444 Ο 0x39f GREEK CAPITAL LETTER OMICRON 16269 Π 0x3a0 GREEK CAPITAL LETTER PI 38122 Ρ 0x3a1 GREEK CAPITAL LETTER RHO 13761 Σ 0x3a3 GREEK CAPITAL LETTER SIGMA 29499 Τ 0x3a4 GREEK CAPITAL LETTER TAU 18872 Υ 0x3a5 GREEK CAPITAL LETTER UPSILON 2684 Φ 0x3a6 GREEK CAPITAL LETTER PHI 14899 Χ 0x3a7 GREEK CAPITAL LETTER CHI 5537 Ψ 0x3a8 GREEK CAPITAL LETTER PSI 476 Ω 0x3a9 GREEK CAPITAL LETTER OMEGA 2031 α 0x3b1 GREEK SMALL LETTER ALPHA 4394439 β 0x3b2 GREEK SMALL LETTER BETA 220648 γ 0x3b3 GREEK SMALL LETTER GAMMA 613742 δ 0x3b4 GREEK SMALL LETTER DELTA 979886 ε 0x3b5 GREEK SMALL LETTER EPSILON 3902672 ζ 0x3b6 GREEK SMALL LETTER ZETA 77105 η 0x3b7 GREEK SMALL LETTER ETA 1476161 θ 0x3b8 GREEK SMALL LETTER THETA 524929 ι 0x3b9 GREEK SMALL LETTER IOTA 3872677 κ 0x3ba GREEK SMALL LETTER KAPPA 1324722 λ 0x3bb GREEK SMALL LETTER LAMDA 1203541 μ 0x3bc GREEK SMALL LETTER MU 1210710 ν 0x3bd GREEK SMALL LETTER NU 3624622 ξ 0x3be GREEK SMALL LETTER XI 152161 ο 0x3bf GREEK SMALL LETTER OMICRON 3915347 π 0x3c0 GREEK SMALL LETTER PI 1373014 ρ 0x3c1 GREEK SMALL LETTER RHO 1551043 ς 0x3c2 GREEK SMALL LETTER FINAL SIGMA 1665989 σ 0x3c3 GREEK SMALL LETTER SIGMA 1332228 τ 0x3c4 GREEK SMALL LETTER TAU 3087762 υ 0x3c5 GREEK SMALL LETTER UPSILON 1909859 φ 0x3c6 GREEK SMALL LETTER PHI 332022 χ 0x3c7 GREEK SMALL LETTER CHI 371646 ψ 0x3c8 GREEK SMALL LETTER PSI 52316 ω 0x3c9 GREEK SMALL LETTER OMEGA 1258769 ϲ 0x3f2 GREEK LUNATE SIGMA SYMBOL 8 Ϲ 0x3f9 GREEK CAPITAL LUNATE SIGMA SYMBOL 27 0x1f5c N/a 1 ᾽ 0x1fbd GREEK KORONIS 1 ῀ 0x1fc0 GREEK PERISPOMENI 4 ‐ 0x2010 HYPHEN 1682 — 0x2014 EM DASH 6568 ‘ 0x2018 LEFT SINGLE QUOTATION MARK 739 ’ 0x2019 RIGHT SINGLE QUOTATION MARK 243630 “ 0x201c LEFT DOUBLE QUOTATION MARK 3493 ” 0x201d RIGHT DOUBLE QUOTATION MARK 3178 † 0x2020 DAGGER 830 … 0x2026 HORIZONTAL ELLIPSIS 10 ⊤ 0x22a4 DOWN TACK 222 ⌎ 0x230e TOP RIGHT CROP 1097 ⌏ 0x230f TOP LEFT CROP 1096 〈 0x3008 LEFT ANGLE BRACKET 3 〉 0x3009 RIGHT ANGLE BRACKET 3 � 0xfffd REPLACEMENT CHARACTER 9
It seems that j, J, v, V, ?, &, # could have better support, there are lots of them not coded, well done but perfect needed.
Thanks for the issue.
I'm not surprised that there are some combinations missing, as it is hard to get an exhaustive list. Let me take a look and try to resolve some of these.
Completely agree that perfection is needed here!
Here are what I see as initial issues from your comment:
j
is are completely unsupported right now. Support should be easy to add.Just to be clear there's no casing distinction in this library for input. So J and j are treated identically (there's no '*j' or 'J'), and by the same token there's no 'V'.
There may be more issues but these are easy to start with.
Some of the ones I don't see any immediate issues with but will have to investigate (or more examples would be helpful):
If you are looking to convert so much real text we'll probably also need some more back and forth on this if you want high quality. Please let me know if you'd like to spin up an email or gitter chat to make this easier.
Lets make things easier by spinning up a chat, more people could be involved over time perhaps. On 9 October 2022 at 22:47:36 +02:00, Matias Grioni @.***> wrote:
Thanks for the issue. I'm not surprised that there are some combinations missing, as it is hard to get an exhaustive list. Let me take a look and try to resolve some of these. Completely agree that perfection is needed here! Here are what I see as initial issues from your comment:
j is are completely unsupported right now. Support should be easy to add.
- 'v' and '*v' are also completely unsupported.
- There's no support for '?'. I'll have to add that in too. It's a combining character so just more work to look up all the characters it can combine with legally.
- No support for '#' characters.
- No '%' support. These are apparently escape characters. Just to be clear there's no casing distinction in this library for input. So J and j are treated identically (there's no '*j' or 'J'), and by the same token there's no 'V'. There may be more issues but these are easy to start with. Some of the ones I don't see any immediate issues with but will have to investigate (or more examples would be helpful):
'&' has some support. Maybe I'm missing some macron combinations.
- '!' is weird to see in the output.
- There's a lot of parens in the output, that seems fishy. If you are looking to convert so much real text we'll probably also need some more back and forth on this if you want high quality. Please let me know if you'd like to spin up an email or gitter chat to make this easier.
— Reply to this email directly, view it on GitHub https://github.com/matgrioni/betacode/issues/14#issuecomment-1272624888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJVAJZUO6MPGPYAXFLAIJHLWCMVORANCNFSM6AAAAAARAKTVE4. You are receiving this because you authored the thread.Message ID: @.***>
Hi, I just recently wrote 0.1 of perseus-converter, using my own developed converter I successfully exported the whole Perseus Digital Library to utf-8 normalized and decomposed text files.
I recognized that not all betacode is properly restored, please look at https://github.com/kristoffer-paulsson/koine-corpora/blob/main/koine/_elegy-and-iambus-volume-ii.txt on rows 4, 9, 14, 176 and 177 for an example. Could you please consider reimplement the missing combinations that may be missing.
Maybe there are also missing implementations described in https://en.wikipedia.org/wiki/Beta_Code