Improved handling of locales & invariant

bclothier commented 3 years ago

Is your feature request related to a problem? Please describe. I've always found VB*' support of locales to be weak and has that bolted-on feeling. Currently we are limited to:

Setting the lettercase sensitive with Option Compare Text and Option Compare Binary
In Access databases only, use whatever collations the host file is using with Option Compare Database
Using StrConv and undocumented methods for comparing in a different locale.

The Option Compare statement does not allow us to use different locales or to handle different aspects of comparison (e.g. we cannot specify that we want an accent-insensitive comparison but we can specify that we want a case-insensitive comparison). Furthermore, as revealed in twinbasic/twinbasic#193 , Like keyword doesn't handle locales. Finally, there is that tendency to assume that the Windows settings should govern the formatting, which creates more problems when building an application meant to be used in different locales.

Describe the solution you'd like We should be able to specify at the module level whether we want to use a specific locale or more broadly control those elements:

Lettercase sensitivity (we already have that with the Option Compare [Text | Binary])
Accent sensitivity
Kana sensitivity (applies to Japan locales)
Width sensitivity (applies to East Asia locales for non-Unicode code pages, I think)

As a possible example we could allow the syntax:

Option Compare Text Case Sensitive, Accent Insensitive

(only allowed on Text, not on Binary)

For a good example of different collations, this SQL Server article on the collation is a good start.

Finally, we should allow for invariant locale and be able to use it as the default to provide the best experience across different computers. Here's a page from .NET discussing the invariant culture. For a brand new project, that will usually greatly simplify the handling of strings and formats within the code and allow it to work independently of the Windows settings.

Several internal VB* functions also have implicit LCID parameters, which are not normally exposed which can make it hard to access the i18n features that already exists in the standard library. In the case of the Like operator, this could be addressed by introducing a Collate keyword, similar to the example from SQL Server page linked above:

If "abc" Like "def" Collate 0x1033 Then
ElseIf "abc" Like "def" Collate 0x1035 Then
ElseIf "abc" Like "def" Collate Invariant Then
End if

Note: I'm not a fan of using magical numbers for locale. Would be better as an enum or something like that.

Describe alternatives you've considered While we can work around this by using StrConv, this is not available with all operators -- Like operator is an example of this. Furthermore, Option Compare Database is special cased for Access VBA project and is not available to other project types and still doesn't cover the case where there are multiple locales involved or whether there's a need to treat it invariantly.

Kr00l commented 3 years ago

To note is that UCase$, LCase& are invariant too. Only way to upper/lower case with a locale is via StrConv. (vbUpperCase, vbLowerCase) But as you pointed out, for Like is no alternative. So as the Like operator, the UCase$ and LCase$ functions should be affected as well by this potential new module wide syntax.

Edit: and of course all operators, like =, <> etc.

Kr00l commented 3 years ago

I don't want to open a new issue as I think it fits good in here.

All the C* conversion functions in VBA are LOCALE_VARIANT. (e.g. CDec())

The only LOCALE_INVARIANT functions are Str() and Val(), which can be useful sometimes. However, Val() returns a Double and is therefore not good for long strings.

Maybe it would be useful to include ValDec() function which converts a String to Decimal, but as LOCALE_INVARIANT.

WaynePhillipsEA commented 3 years ago

However, Val() returns a Double and is therefore not good for long strings. Maybe it would be useful to include ValDec() function which converts a String to Decimal, but as LOCALE_INVARIANT.

That's a good point and a good suggestion.

Kr00l commented 3 years ago

However, Val() returns a Double and is therefore not good for long strings. Maybe it would be useful to include ValDec() function which converts a String to Decimal, but as LOCALE_INVARIANT.

That's a good point and a good suggestion.

The "nicest" way would be a Val() function with the decimal data type hint symbol. In VB.Net it is @ for Decimal. So, Val() would be Double and Val@() would be Decimal. Thus not wasting/shadowing another name. (like Str() and Str$()) The issue is here that @ is reserved in VBx for Currency. I pointed this out in twinbasic/lang-design#8

Edit: It's not possible to have different data types for same function name. Oops. So, ValDec() would be then a solution.

wqweto commented 3 years ago

Wouldn’t VarDec be a simple wrapper around VariantChangeType?

I mean it must be a one-liner to already implement outside built-in functionality.

WaynePhillipsEA commented 3 years ago

Wouldn’t VarDec be a simple wrapper around VariantChangeType?

I mean it must be a one-liner to already implement outside built-in functionality.

Did you mean the proposed ValDec (not VarDec)?

The OLE / variant conversion functions are not suitable for Val / ValDec as they allow for things like CDec("(5)") giving -5, whereas Val() does not allow that. The correct implementation for Val (and potentially ValDec) is VarParseNumFromStr and VarNumFromParseNum, in order to match VBx exactly, which is what tB does.

Actually I've just noticed that the tB implementation of Val is throwing an error in some cases where it should be returning zero - I'll get that fixed.

WaynePhillipsEA commented 3 years ago

ValDec is now available in v0.10.5755. Val is also fixed in v0.10.5755 to not throw errors in cases where the input cannot be parsed as a number.

wqweto commented 3 years ago

Btw, Val bombs in VBx with overflow on Val("1e1000")

WaynePhillipsEA commented 3 years ago

@wqweto yes, (still) same in tB.

Kr00l commented 2 years ago

To improve locales I think the StrConv can be enhanced by supporting more conversions.

For instance the VB.Net StrConv supports in addition to VBx the following 3 conversions:

VbStrConv.LinguisticCasing (Can be used only with UpperCase and LowerCase) VbStrConv.SimplifiedChinese VbStrConv.TraditionalChinese

Kr00l commented 2 years ago

Few more suggestions

This site has many additional enums for StrConv(). Examples:

vbUTF8 | 4352 | Convert a Unicode string to a UTF-8 string. vbUTF8Bytes | 256 | Convert a Unicode string to a UTF-8 byte array. vbFromUTF8 | 4608 | Convert a UTF-8 string to a Unicode string. vbFromUTF8Bytes | 512 | Convert a UTF-8 byte array to a Unicode string.

mansellan commented 2 years ago

Odd... it seems to be mostly powers of two, like for a flags enum, except... for the ones that aren't. I wonder what happened.

Regardless, adding these (and potentially more) StrConv types are a good idea IMO.

bclothier commented 2 years ago

Odd... it seems to be mostly powers of two, like for a flags enum, except... for the ones that aren't. I wonder what happened.

The mistake is to try and read it as decimal. Read it as hex and it'll make sense.

Back to the subject at the hand.... I want to ask -- can we do better than that?

Right now, our options are pretty much restricted to either String or Byte(), and doing a lot of string conversions. I'd love to come up something to minimize the requirement to do conversions.

For example, we cannot do this:

Const MyAnsiString As String = "Hello, world!" 'This is an unicode string

Instead, we must:

Dim MyAnsiString As String = StrConv("Hello, world!", vbFromUnicode)

Why can't we:

Const MyAnsiString As String Collate Ansi = "Hello, world!" 

If MyAnsiString = "Hello, world!" Collate Ansi Then
  Debug.Print "It's an ANSI string!"
Else
  Debug.Print "nope, not an ANSI string...."
End If

Or, alternatively:

Const MyAnsiString As AnsiString = A"Hello, world!" 

If MyAnsiString = A"Hello, world!" Then
  Debug.Print "It's an ANSI string!"
Else
  Debug.Print "nope, not an ANSI string...."
End If

However, that dangerously takes us into the C/C++ & Delphi territory where we have to care a bit too much about encoding and track distinctions between an ANSI string, an Unicode string, a UTF8 string, a ZTS, a PascalString, a BSTR, oh my! That is not very friendly for those who are unfamiliar or don't want to think too hard about those minutiae detail and can open up to mistakes like:

If MyUnicodeString = MyAnsiString Then
  'Why won't it ever be true?!?
End If

To keep it user-friendly, tB would need to support implicit conversions so that an ANSI string "Hello, world!" will be equal to an Unicode string "Hello, world!".

So to ask again --- can we do better here?

mansellan commented 2 years ago

The mistake is to try and read it as decimal. Read it as hex and it'll make sense.

Yep thanks, it makes a bit more sense. But it's still kinda twisted.

WaynePhillipsEA commented 2 years ago

Just to give you a little insight into the compiler here, for normal Declares (i.e. not DeclareWide), String datatypes are actually internally redirected to an AnsiString datatype that you never get to see. The internal AnsiString type then conveniently handles all the runtime conversions from BSTR<->ANSI via coercion rules set within AnsiString. So technically, we already have an AnsiString datatype... but we just don't ever expose it.

I think StrConv offering UTF8 conversions would be quite simple and cover quite a lot of use-cases without over-complicating.

I do agree with @bclothier though that handling Ansi/UTF8 data is a common problem that perhaps could benefit from some other solution...

WaynePhillipsEA commented 2 years ago

I wonder if we could perhaps harness generics here. Generics are actually more like templates in tB, so we can technically pass literal values as generic specifiers, e.g. MyGeneric(Of 123). So I was thinking we could potentially use generics to pass a codepage literal as part of the string definition e.g. Dim MyUtf8String As String(Of Codepage.Utf8). This would internally create an AnsiString type (or more accurately, a multi-byte string type) with a fixed codepage. When passed to an ordinary String type, it would get uplifted to a standard BSTR of course. You could also have Codepage.SystemDefault etc.

Kr00l commented 2 years ago

It's difficult.. An AnsiString points directly to the first char and must be null terminated. So a Len(BSTR) works different than Len(AnsiString) internally. So it's prone for confusion in the BASIC world.

But we agree that StrConv has it's valid "first/second approach". That said enriching StrConv would be a good start, IMO.

Kr00l commented 2 years ago

It's difficult.. An AnsiString points directly to the first char and must be null terminated. So a Len(BSTR) works different than Len(AnsiString) internally. So it's prone for confusion in the BASIC world.

But we agree that StrConv has it's valid "first/second approach". That said enriching StrConv would be a good start, IMO.

Oops. Ok an AnsiString seems to have a similar memory layout. So forget the confusion point.

Another point is the linking of internal datatype to a codepage. There are soo many codepages and it would be difficult to guess if it's now AnsiString or BSTR. So in the end it might be (?) easier to declare it directly. Dim MyUtf8String As AnsiString = "Hello World" ' Auto coerce from BSTR literal to AnsiString.

Kr00l commented 2 years ago

And the next "problem" would be how to declare an AnsiString literal ? And then mixing the codepages. I can imagine that's a can of worms to open.. :>

Kr00l commented 2 years ago

Sorry to pollute with many comments.

At the end IMO we shall just have BSTR and that's it. That is our string container. Which encoding or what we have in a BSTR is up to the developer.

It would keep the world easy. Also in regards to DeclareWide. We simple do a codepage conversion with StrConv (e.g. UTF8) to a BSTR and pass it in a DeclareWide. Easy as that.

So after thinking a while that's my personal conclusion.

bclothier commented 2 years ago

I think we should establish the use cases and see how it can be better.

If the end goal is to display a string on a user's screen via some kind of control, I suspect that string conversion is inevitable & unavoidable because a control will likely expect a certain string and won't support just anything. Furthermore, I think we definitely want to avoid having a UnicodeTextbox and a AnsiTextbox controls. if it did matter, then that is arguably a case for specialization.

Now, if the end goal is to read or write into a file, we already can control the encoding when we deal with the file. Whether the conversion is inevitable depends on where we got the string from. If it came from controls as a part of user's input, then yes we can't avoid it. But if it's coming from some other API, then we are probably better off treating it as a byte array and just not worrying about the contents. Things get more complicated if we need to transform the results from API before writing to the file. That is where we'd potentially see a lot of StrConv() calls.

As a matter of fact, my earlier remarks was inspired by the fact that sqlite C API defaults to working with UTF-8, which is not compatible with UCS-2 encoding we use in VBx (and tB, too?). I can envision that if we tied the output from sqlite into a control like WayneGrid as example, we would be potentially doing a lot of string conversions. I earlier said that conversions for use in controls are unavoidable but in this situation, we are going to be doing a lot of slicing'n'dicing with the strings between the sqlite's output and the inputs into WayneGrid control.

One way to avoid this is to always store the UCS-2 encoding in the sqlite database --- that would then restrict the UTF-8 only to dealing with the sqlite's API itself (e.g. passing in a filename, a column name, etc.). But if the database is not ours, came from someone else, we may have not much of a choice.

Writing this out, I think the problem is usually in the form of:

MyCustomControl.Value = "Totals: " & Left$(SomeApi.SomeOutput, 10)

If the SomeApi.SomeOutput is UCS-2, then everything's peachy. But if it's not, then we must:

MyCustomControl.Value = _
  "Totals: " & Left$(StrConv(SomeOutput, vbUnicode), 10)

Now if we customize the MyCustomControl to use UTF-8 by default to maximize the performance, we then end up:

MyCustomControl.Value = _
  StrConv("Totals: " & Left$(StrConv(SomeApi.SomeOutput, vbUnicode), 10), vbFromUnicode)

^{Aside: Can we agree that vbFromUnicode is a lousy name?} 😄

We can simplify a bit:

MyCustomControl.Value = _
  StrConv("Totals ", vbFromUnicode) & LeftB$(SomeApi.SomeOutput, 10)

^{Wait a minute, why are we doing binary stuff on a string?!?}

That's a lot of thinking required just because we are no longer dealing with a String which implicitly assumes an UCS-2 encoding and we must also change our code (e.g. use LeftB instead of Left) and all I wanted to do was concatenate some literal to some substring!

I think the ideal outcome that keeps to the spirit of BASIC would be just what we started with originally:

MyCustomControl.Value = "Totals: " & Left$(SomeApi.SomeOutput, 10)

The only way to do that is to somehow keep track of the encoding so that the compiler can handle the implicit conversions where required and do the necessary substitution (e.g. internally use LeftB instead of Left).

That implies that we would be able to define the Value property on the custom control:

Public Property Let Value(NewValue As String(Of Utf8)

and on the SomeOutput:

Public Property Get SomeOutput() As String(Of Utf8)

And the compiler can do the rest of the optimizations.

And thus we have abstracted out all the hard-thinking exercises around the strings. But that's also a bad news because an unwary user will look at the code:

MyCustomControl.Value = "Totals: " & Left$(SomeApi.SomeOutput, 10)

and have no clue that there's some magical things going on. See: Law of Leaky Abstractions. We can make it a bit less magic by not handling the conversion automatically:

MyCustomControl.Value = "Totals: "(Of Utf8) & Left$(SomeApi.SomeOutput, 10)

Now the user is at least alerted to the fact that there's something afoot. The compiler can furthermore warn when two incompatible Strings are used together in some way while optimizing the string manipulation functions.

At that point, have we improved things? Is that easy to reason about? What do you think?

bclothier commented 2 years ago

One more followup. I looked again at the linked WinWrap's constants. I noticed they have those constants:

vbUnicode          |   64 | Convert an ANSI (locale dependent) byte array to a Unicode string.
or vbFromANSIBytes |      |
-------------------+------+-------------------------------------------------------------------
vbFromANSI         | 4160 | Convert an ANSI (locale dependent) string to a Unicode string.
-------------------+------+-------------------------------------------------------------------
vbFromUnicode      |  128 | Convert from Unicode to an ANSI (locale dependent) byte array.
or vbANSIBytes     |      |
-------------------+------+-------------------------------------------------------------------
vbANSI             | 4224 | Convert from Unicode to an ANSI (locale dependent) string.

It looked strange to me that they have both vbANSIBytes and vbANSI, so I did some more digging and I think their solution is to make use of SetLocale, in which the strings will be in that locale. Thus, we can have a true ANSI string rather than just a BSTR holding binary contents with WinWrap but only one type of encoding active at a time.

If I've understood correctly, I think that is too magical because we could be miles away from SetLocale instruction and not know what's going wrong with the expression that's mashing together 2 different types of strings. While I think it's more useful to alias the stupid vbFromUnicode flag and hide the original flag, I don't think we should have the Bytes variant. That seems to add more thinking, not less. After all, we already can:

  Dim b() As Byte
  b = StrConv("Make me an ANSi string!", vbFromUnicode)

^{I'd have used vbANSI, but this need to work in VBx, too}

So I'm definitely not in favor of having bytes flag borrowed.

Kr00l commented 2 years ago

What wayne proposed was that a UTF8 string literal could look as "Hello World"(Of cpUTF8)

Will this be as ? typedef AnsiStringT<65001> UTF8String;

So the compiler would know how to convert automatically on assigning to another type/codepage etc. ?

Do we want to change BASIC so deep? The VBx community is used ever since to assign all codepages to BSTR and of course yourself need to remember what's actually in.

That would be an overall change.

Maybe an compromise would be to first enrich StrConv and postpone this discussion post-v1 if then use cases arise which deseperately need such a thing. Just an idea..

bclothier commented 2 years ago

For the record, I'm in favor of postponing the changes, if any, to the String data type. I don't think it's a high priority item. However, we should at least have some idea of where we want it to go in future or else we may be in the situation where the ideal solution is no longer possible because of other design decisions that prevents this from happening. The enhancement of StrConv and providing a truly invariant functions to avoid programs from behaving differently on different computers hopefully is much easier to implement in short-term.

You are correct about being already used to handling the strings that aren't UCS-2 encoded either as a byte array or as a String that cannot be read. However the real question is whether we'd benefit from being able to eliminate the conversions that is required, especially if we need to do some string manipulations with non-UCS-2 strings.

Have thought about it some more, I think that I'd rather do the conversion explicitly than have the compiler do it -- too much magic can be a bad thing. Maybe the TypeDef ( twinbasic/lang-design#32 ) support is what we really need so we can communicate clearly that this particular byte array should contain a UTF8 string. That does not really address the string manipulations & literals, though. Does anyone know if the string conversions that is forced (e.g. the WayneGrid example I cited earlier) is actually significant?

uwbwsvd commented 2 years ago

I have to handle the German-String-Set. Recently I had an encoding adventure, at least for my level of knowledge. I had (have) to export data (from/with Access/VBA) to csv. After putting the right data together, I thought to use Docmd Transfertext to export them. After solving some problem with the export definition, one thing left: myfile.csv was converted to myfile#csv. After 3 hours searching i tried according to feeling the characterset. UTF-8 solved the problem. But know. When I open the file in Excel and close it, there is a typmismatch in the name of the first column. For me it was so much "magic".

bclothier commented 2 years ago

Just to point out that because you are dealing with Access' DoCmd.TransferText and with Excel's .... quirky CSV handling, it might not be just the encoding problem you are dealing but also the application's handling. For instance, Excel is known to automatically convert text to numbers which may be why you get type mismatch after you have opened the file. But yes, I totally can relate to the frustration because it's not always obvious what's really happening underneath all those layers.

WaynePhillipsEA commented 2 years ago

What wayne proposed was that a UTF8 string literal could look as "Hello World"(Of cpUTF8)

Nope, I didn't say that. Not for string literals, I don't think that would be the right syntax at all.

On reflection I feel that my generic-solution is probably a step too far on this one. I think perhaps we can instead offer good extension types on String and byte-arrays, such as MyString.ToAnsi(codepage) and MyString.ToUtf8() and that sort of thing to help make code clearer.

But yes, I'll look at adding the StrConv options for now, as they should be simple to implement.

bclothier commented 2 years ago

I find the extension method proposal much more attractive, FWIW. That would also make it easy to allow the community to build extensions for their weird & funky encoding should they need it, too. 😄

This also can be extended to "Hello, world".Left(10) which then enables this:

Dim MyAnsiString As AnsiString 'Custom type
MyAnsiString.Convert("Hello, world!")
Debug.Print MyAnsiString.Left(10)

I think that is far more clean & clear and the user don't have to double guess what to do with a funky AnsiString. Just look up the available methods associated with that data type.

Ref: twinbasic/lang-design#7

Kr00l commented 2 years ago

Side info: The enums mentioned on winwrap do not include the chinese enums mentioned on https://github.com/WaynePhillipsEA/twinbasic/issues/201#issuecomment-995855297 IMO it's definitely also worth to add chinese. Not for me personally but for international use cases.

twinbasic / lang-design

Improved handling of locales & invariant #18