matgrioni / betacode

A small python package to flexibly convert from betacode to unicode and back.
MIT License
20 stars 2 forks source link

Not all betacode signs and combinations implemented. #14

Open kristoffer-paulsson opened 2 years ago

kristoffer-paulsson commented 2 years ago

Hi, I just recently wrote 0.1 of perseus-converter, using my own developed converter I successfully exported the whole Perseus Digital Library to utf-8 normalized and decomposed text files.

I recognized that not all betacode is properly restored, please look at https://github.com/kristoffer-paulsson/koine-corpora/blob/main/koine/_elegy-and-iambus-volume-ii.txt on rows 4, 9, 14, 176 and 177 for an example. Could you please consider reimplement the missing combinations that may be missing.

Maybe there are also missing implementations described in https://en.wikipedia.org/wiki/Beta_Code

kristoffer-paulsson commented 2 years ago

After running a character statistics on my full corpora after betacoding and normalizing ("NFD") it I got these statistics:

 0x20  SPACE  6972952

! 0x21 EXCLAMATION MARK 197 " 0x22 QUOTATION MARK 6121

0x23 NUMBER SIGN 112

$ 0x24 DOLLAR SIGN 1 % 0x25 PERCENT SIGN 101 & 0x26 AMPERSAND 1730 ( 0x28 LEFT PARENTHESIS 3251 ) 0x29 RIGHT PARENTHESIS 5513

It seems that j, J, v, V, ?, &, # could have better support, there are lots of them not coded, well done but perfect needed.

matgrioni commented 2 years ago

Thanks for the issue.

I'm not surprised that there are some combinations missing, as it is hard to get an exhaustive list. Let me take a look and try to resolve some of these.

Completely agree that perfection is needed here!

Here are what I see as initial issues from your comment:

Just to be clear there's no casing distinction in this library for input. So J and j are treated identically (there's no '*j' or 'J'), and by the same token there's no 'V'.

There may be more issues but these are easy to start with.

Some of the ones I don't see any immediate issues with but will have to investigate (or more examples would be helpful):

If you are looking to convert so much real text we'll probably also need some more back and forth on this if you want high quality. Please let me know if you'd like to spin up an email or gitter chat to make this easier.

kristoffer-paulsson commented 2 years ago

Lets make things easier by spinning up a chat, more people could be involved over time perhaps. On 9 October 2022 at 22:47:36 +02:00, Matias Grioni @.***> wrote:

Thanks for the issue. I'm not surprised that there are some combinations missing, as it is hard to get an exhaustive list. Let me take a look and try to resolve some of these. Completely agree that perfection is needed here! Here are what I see as initial issues from your comment:

  • j is are completely unsupported right now. Support should be easy to add.

    • 'v' and '*v' are also completely unsupported.
    • There's no support for '?'. I'll have to add that in too. It's a combining character so just more work to look up all the characters it can combine with legally.
    • No support for '#' characters.
    • No '%' support. These are apparently escape characters. Just to be clear there's no casing distinction in this library for input. So J and j are treated identically (there's no '*j' or 'J'), and by the same token there's no 'V'. There may be more issues but these are easy to start with. Some of the ones I don't see any immediate issues with but will have to investigate (or more examples would be helpful):
  • '&' has some support. Maybe I'm missing some macron combinations.

    • '!' is weird to see in the output.
    • There's a lot of parens in the output, that seems fishy. If you are looking to convert so much real text we'll probably also need some more back and forth on this if you want high quality. Please let me know if you'd like to spin up an email or gitter chat to make this easier.

— Reply to this email directly, view it on GitHub https://github.com/matgrioni/betacode/issues/14#issuecomment-1272624888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJVAJZUO6MPGPYAXFLAIJHLWCMVORANCNFSM6AAAAAARAKTVE4. You are receiving this because you authored the thread.Message ID: @.***>

matgrioni commented 2 years ago

Sure. You can join the betacode room here and we can discuss there.