Properly encode exported forms as UTF-8 when using the UTF-8 Beta language option

joyfullservice / msaccess-vcs-addin

Synchronize your Access Forms, Macros, Modules, Queries, Reports, and more with a version control system.

Other

215 stars 40 forks source link

Properly encode exported forms as UTF-8 when using the UTF-8 Beta language option #378

Open joyfullservice opened 1 year ago

joyfullservice commented 1 year ago

Windows 10 provides a "Beta" option to attempt UTF-8 support for programs that may not natively support it. You can read more about this option here.

We have seen this issue in the past (#60, #186, #180, #246, #377) but I am wondering if we might be able to resolve this by properly converting the encoding of the exported file to a standard UTF-8 BOM encoding so that it can be universally used by computers with or without the Beta option.

The goal here is to identify the native encoding of the raw export file, and convert it to the correct format before sanitizing and saving the content as UTF-8 in the final source file.

@ollypsilon - Would you be able to attach a sample form, exported with the UTF-8 Beta option active? Please use the following command directly in the immediate window to export the form in the default, unsanitized format, just exactly as it comes from Access.

SaveAsText acForm, "Form1", "Form1.bas"

This will allow me to do some testing on various encoding conversions to see what we need to use for this file format.

It would also be helpful to know what happens if in the same database, you delete the Form1 object, then run the following command to attempt to import it back into the database. Does it import successfully and correctly? This will give us some clues on the format that Access is expecting for the import when using the UTF-8 Beta option in Windows.

LoadFromText acForm, "Form1", "Form1.bas"

ollypsilon commented 1 year ago

Hi @joyfullservice , sorry for being late on replies! I've attached the exported form. When exporting from VBA directly the file looks good, no weird characters or such. Have a look yourself. Also import was fine with this file, even the form is "complex" :thumbsup: Form1.zip

joyfullservice commented 1 year ago

@ollypsilon - Thanks for attaching the sample file! Opening this file in Notepad++ confirms that it is encoded as UTF-16 LE BOM. This is what I was expecting when observing the double-wide character allocations in the corrupted files.

I also confirmed that the exported file has the expected BOM, which is critical to reading the file with the correct encoding. If you run the following command in the immediate window on your sample file, does it return true? (It does on my system.)

?HasUcs2Bom("C:\...\Form1.bas")

I then tried running the sanitizing function directly on this file. If you run this from the immediate Window, is the resulting file correctly encoded in UTF-8? (You should see normal text content, not spaces between every letter.)

SanitizeFile "C:\...\Form1.bas", False

Both of these functions appear to be running fine on my system, so I might need to you to do some additional testing/debugging on your system to see if we can pinpoint where the encoding is getting messed up. The really neat thing about having this add-in as a Microsoft Access library database is that you can actually debug and step through the add-in code, just as easily as you can in your own VBA project. You can even make changes in the add-in code and test them. (Just keep in mind that changes will not be saved in the add-in when it is loaded as a library. It has to be loaded as the current database to save changes.)

If you would like a snippet to export a single form without all the other GUI overhead, you can use the following code. (Paste into any module in the add-in project when it is loaded as a library.) This would allow you to set some break points and step through the export code to observe how the export and sanitize processes work when processing a form object.

To load the add-in as a library, open your sample database. (This makes your sample database the current database.) Then use the button on the toolbar to open the VCS add-in. All you need to do is load the form once, this will automatically load the add-in database as a library. At this point you will see the VCS project in the VBE project explorer, alongside your current project. From there you can paste in the following code snippet into one of the modules on the VCS project (such as modFunctions) and then run it by pressing F5, or by running TestExport from the immediate window. (In the context of the add-in project)

Again, just keep in mind that any changes you make to the add-in project while it is loaded as a library will be lost when the library is unloaded (typically when closing Microsoft Access).

Public Sub TestExport()
    With New clsDbForm
        With .Parent
            Set .DbObject = CurrentProject.AllForms("Form1")
            .Export "C:\...\Form1"
        End With
    End With
End Sub

Let me know what you learn!

ollypsilon commented 1 year ago

I also confirmed that the exported file has the expected BOM, which is critical to reading the file with the correct encoding. If you run the following command in the immediate window on your sample file, does it return true? (It does on my system.)

?HasUcs2Bom("C:\...\Form1.bas")

I then tried running the sanitizing function directly on this file. If you run this from the immediate Window, is the resulting file correctly encoded in UTF-8? (You should see normal text content, not spaces between every letter.)

SanitizeFile "C:\...\Form1.bas", False

Hi @joyfullservice, thx for all your efforts. Well, HasUcs2Bom gives FALSE on my UTF-8 beta enabled system.

Unfortunately, sanitizing still is encoded incorrect, I guess related to HasUcs2Bom returns not expected result. I have attached the sanitized file. Form1.zip I'll check further on my side on Friday hopefully, currently other higher priorities 😔

joyfullservice commented 1 year ago

Hi @joyfullservice, thx for all your efforts. Well, HasUcs2Bom gives FALSE on my UTF-8 beta enabled system.

I think this is the issue!! The HasUcs2Bom function is not correctly identifying the BOM (byte order mark) in the exported files, even though these files do have the Ucs2 BOM. All we need to do is update the underlying function to work correctly on both traditional and UTF-8 systems. This will then help the sanitize function to work correctly when reading the file.

Notice in the first section of the Sanitize function, it checks the BOM to see how to read the source file:

In the case of an exported form object, it needs to read the file as "Unicode" (UTF-16 LE). Because the HasUcs2Bcom function was returning the wrong value, it attempted to read the file as UTF-8, which explains why we were seeing a null character between every character. (double-byte vs. single-byte encoding)

Fixing the HasUcs2Bcom function is definitely a critical part in resolving this issue.

In the modEncoding module, you will find a few BOM related functions. These evidently need to be adjusted to work correctly on a UTF-8 (beta) system.

'---------------------------------------------------------------------------------------
' Procedure : HasUtf8Bom
' Author    : Adam Waller
' Date      : 7/30/2020
' Purpose   : Returns true if the file begins with a UTF-8 BOM
'---------------------------------------------------------------------------------------
'
Public Function HasUtf8Bom(strFilePath As String) As Boolean
    HasUtf8Bom = FileHasBom(strFilePath, UTF8_BOM)
End Function

'---------------------------------------------------------------------------------------
' Procedure : HasUcs2Bom
' Author    : Adam Waller
' Date      : 8/1/2020
' Purpose   : Returns true if the file begins with
'---------------------------------------------------------------------------------------
'
Public Function HasUcs2Bom(strFilePath As String) As Boolean
    HasUcs2Bom = FileHasBom(strFilePath, UCS2_BOM)
End Function

'---------------------------------------------------------------------------------------
' Procedure : FileHasBom
' Author    : Adam Waller
' Date      : 8/1/2020
' Purpose   : Check for the specified BOM by reading the first few bytes in the file.
'---------------------------------------------------------------------------------------
'
Private Function FileHasBom(strFilePath As String, strBom As String) As Boolean
    FileHasBom = (strBom = StrConv(GetFileBytes(strFilePath, Len(strBom)), vbUnicode))
End Function

Also of note are the constants defined in modConstants

I used simple constants to represent the byte order marks, thinking that this would be a more readable way to represent these in the source code. The first thing I would check is to make sure that the constants look like what I have pictured in the screenshot above. If they look like something else, then we would need to address that first.

Here is an example of what the Ucs2 BOM looks like in a file:

If you could do some testing with the FileHasBom function, that would probably help us identify how we need to read the BOM on your system. It could be as simple as needing to read twice as many bytes, and/or doing a different type of conversion to convert the read bytes to a string that we can compare to our BOM constant.

I will look forward to hearing what you are able to learn about reading the BOM from a file when running under UTF-8 (beta) mode.

ollypsilon commented 1 year ago

First finding: constant UCS2_BOM are "interpreted" differently on systems.

UTF-8 beta option not enabled

UTF-8 beta option enabled

I'll get the latest dev version now and create the add-in from it, then dig deeper.

ollypsilon commented 1 year ago

I had a look how this unicode thing needs to be understand and how it's working, I was puzzled when reading your UTF-8, UTF-16 LE terms 🤷‍♂️ So I had a look here and found also this one. When I had a look at your code I found the GetFileBytes function which I found useful. Instead of using your constants I would use the hex format as defined for unicode rather than comparing strings. I found a useful function in this article. It has some caveats on large byte arrays but this doesn't matter for this special purpose. I added it to module modFunctions:

Public Function ByteArrayToHex(ByRef ByteArray() As Byte) As String
Dim l As Long, strRet As String

For l = LBound(ByteArray) To UBound(ByteArray)
    strRet = strRet & Hex$(ByteArray(l)) & " "
Next l

'Remove last space at end.
ByteArrayToHex = Left$(strRet, Len(strRet) - 1)
End Function

When using on non UTF-8 beta enabled system it doesn't make a difference to compare string or check hex code:

But on UTF-8 beta enabled system it makes a difference:

I guess detection of file encoding could be tweaked. I found this article which describes the challenge:

Only ASCII, UTF-8 and encodings using a BOM (UTF-7 with BOM, UTF-8 with BOM, UTF-16, and UTF-32) have reliable algorithms to get the encoding of a document. For all other encodings, you have to trust heuristics based on statistics.

Based on the sample codes there I could create a function to get/estimated file encoding. Or you can develop your own 😉

joyfullservice commented 1 year ago

I did some additional testing on my machine using the UTF-8 beta option, and I think this is going to take some more testing/development work to resolve. For example, at present, the export of the code module is not working correctly for modConstants. I need to work through that part, then once the export is working correctly, I plan to move on to addressing the reading of the BOM. Thanks again for your testing efforts!

joyfullservice commented 1 year ago

Great news! It looks like there were only a couple things I had to resolve in order for the add-in to function in UTF-8 (beta) mode. I have updated the BOM detection functions to read the bytes directly rather than comparing them to string conversions. The string conversions do not work reliably for these characters in UTF-8 mode, probably because they involve invalid code points under UTF-8.

I am now able to export and build the add-in in UTF-8 (beta) mode, as well as export and build the sample database provided by @ollypsilon. I will be uploading a new beta release shortly. Version 4.07 should be able to successfully export and build in both Unicode and non-Unicode environments. Let me know how this works!

ollypsilon commented 1 year ago

@joyfullservice , I was able to export and build my database on both systems, great! Unfortunately there's a small bug on export: when exporting VBA code in forms on UTF-8 beta option enabled system chars get mangled: VBA code in database: In exported file: As reported by GitHub Desktop:

I spotted this only on forms where I use some special characters like currency symbols and language specific characters. I have added some test code in my "With-UTF-8-beta-option" branch, see there for details.

joyfullservice commented 1 year ago

I did some more testing on this last night, and I have bad news and good news. The bad news is that as far as I can tell from testing, the SaveAsText/LoadFromText functions simply don't support extended characters in VBA code modules. (In UTF-8 beta mode) I think the reason is because they are combining two different types of content that use two different character encodings, and combining them into a single file. The Ucs2 (UTF-16) format is able to support the extended/Unicode characters in the form content, such as label captions. But the VBA code is stored using the local (non-Unicode) codepage. When those extended characters from the VBA side get added to the Ucs2 content during the SaveAsText operation, they become invalid characters. They do not display or convert correctly in the source file, and do not import back in to the original characters. (Again, this is only the case when the UTF-8 beta mode is active.)

The good news is that I believe I have a way we can work around this. And it might also push us over the edge to make an important (although significant) change to the way form exports work. The built in SaveAsText/LoadFromText functions are great in most cases and environments, but specifically in the UTF-8 beta mode, they simply don't support extended characters on the VBA side. What I am considering/proposing here is that we go ahead and make the change to split the form source files into two files. One to handle the VBA code, and another to handle the form definition.

This gives us some helpful advantages in the long run. First, it separates two distinctly different types of content into different files. To me it makes sense that Form1.cls file would contain only VBA code, just like the native VBE form export. This makes it easier to manage and track changes, and even edit the source file directly. Form1.frm would contain the form definition, layout, controls, etc... When the form is imported or merged, it would import the form definition file first, then add the VBA code through the VBE object model. This would allow us to properly read and convert the encodings on these two areas of content so that the form is reconstructed to match the original form, supporting extended and Unicode characters in the form, and extended characters in the VBA code.

While we could use an option to give the user a choice between the combined or split file approach, I am kind of leading towards going with the split files for everyone. I think it brings advantages to everyone, even on systems that don't use UTF-8 beta, and makes the source files more intuitive to use and cleaner on the change tracking. (You will be able to see just from the file names whether a change involved the form definition, the VBA code, or both.)

@hecon5, @A9G-Data-Droid, @Tanarri, @Indigo744, feel free to chime in if you have any thoughts regarding the proposal to split the form export into two source files, similar to how VBE works with native VBE forms.

mwolfe02 commented 1 year ago

While we could use an option to give the user a choice between the combined or split file approach, I am kind of leading towards going with the split files for everyone.

For full transparency, I'm not using this project for my version control, but I've got a fairly advanced custom process that I use. One thing that I find myself frequently doing is using LoadFromText to import a single form or report at a time. Splitting the form into two exported files would break that functionality.

That said, I agree that the benefits outweigh the costs. I also agree that forcing everyone to use the same approach is the better decision for the project in the long run. If I were to switch to this project, I would like to see the add-in include support for importing one form or report at a time. With the split files approach, this would need to be something built-in to the add-in.

Great job with the project, by the way. I've been watching from the sidelines and I'm eager to see what the next big release looks like. Keep up the great work 👍 .

joyfullservice commented 1 year ago

Thanks, Mike! I appreciate your perspective and thoughts on this. Great suggestion on the ability to export/import a single object. That would sure come in handy for testing and tweaking. On an issue like this, I often write a few lines of code to export and import a particular test object, but it would be even easier to just select the item in the navigation pane and click a button on the ribbon. 😄 I just might add that in the near future...

Thanks for all the tips and content you share on your blog! Your highlighting of twinBASIC and the ability to easily create a COM add-in is what inspired and empowered me to add the ribbon UI to this project. - Probably the single greatest usability enhancement in the history of the project. 😁 👍

hecon5 commented 1 year ago

I'm late to the party here, but I am going to disagree: I think this splitting should be an "Opt-in" feature (even at the likely cost of then forcing a bifurcated development)

Reasoning:

It has far-reaching consequences for long-term version tracking for the many users who don't need split files (though, perhaps with the pervasive nature of emojis, we'll see them in UX more often..).
After discussing with some devs on our team, and reading @mwolfe02's comment, the ability for objects to remain as a contiguous file is a really nice version control feature.
We end up using load/save internal functions often enough that there'd need to be a fairly good (read: easy/seamless) addon development to handle the new flow.

In terms of file extensions:

.cls is already a class file, I'd be cautious against using it for anything except a VBA class. Perhaps .VBA or .VB would be more indicative of the file contents.
.Frm is already used as a form, again, I'd be cautious to usurp another already in use extension (one that would not contain all of the contents required to create the form as it needs the code behind the form). But perhaps it would be clear enough in use this second one doesn't matter.

Indigo744 commented 1 year ago

Thanks for pinging me @joyfullservice

First time I hear about this UTF8 beta-mode 🐱

I do agree with @hecon5 that splitting files have a lot a consequences, especially on established code base where it would lead to a huge commit and "losing" some history (meaning existing code will be in a new file).

On the technical side, if you split files (form and VBA), how will this new export/import function look like? You didn't actually explained so I'm making some educated guess:

For exporting, using SaveAsText to export as a single file, then splitting manually between the form and the VBA module (on CodeBehindForm)?
For importing , using LoadFromText to import the form (without VBA module) then using the VBE API to import the VBA module?

If that's case, have you considered to keep a single file, but when importing on a machine with UTF8 beta-mode enabled, split the .bas file temporarily in two (Form / VBA module), apply a proper encoding on the VBA module part (UTF16 instead of local CP) and then import the module using the VBE API?

Maybe you've already tried it. I didn't read all linked posts so sorry if that's the case 😸

joyfullservice commented 1 year ago

@hecon5 - Thanks for the feedback. I will definitely take those points into consideration.

@Indigo744 - You are right, I didn't go into a lot of the technical detail on how I plan to do this. For the export, I would export using the SaveAsText function to get the form definition. The VBA code would be stripped out, and the form definition saved as *.frm. I would then export the VBA code using the VBE object export, just like I am currently doing for modules and classes. This would allow any extended characters to be properly preserved before converting the file to UTF-8. The VBA code would be saved as *.cls. It technically is a class module, so this seems consistent with how (non-Access) VBA exports and imports (MSForms 2.0) forms in the VBE.

The import process involves first importing the form definition file using LoadFromText, then using the VBE object model to load the VBA code. While it would be ideal to load the entire thing in one step, I have not found an encoding format for the extended characters in VBA code that will import correctly using the LoadFromText on a system that is operating in UTF-8 (beta) mode. (My hunch is that it is simply not possible because the function isn't designed to work with that combination of encodings.)

We do have the option of stitching the two parts back together after export to produce a single (UTF-8) source file, identical to what Access produces through the SaveAsText function, but keep in mind that this file will probably not be natively importable in this format. (The add-in converts the source file back to Ucs2 just before import, because that's the encoding Microsoft Access is expecting.) If we take the approach of stitching the file back together, we would still have to split it again before import because any extended characters in the VBA code are now in UTF-8, and won't convert properly to Ucs2 for direct import.

I recognize that changing the names of the source files for forms (.bas to .frm and *.cls) is a major change, and not one that I take lightly. That's one of the reasons that I am reaching out for feedback from the community. At the same time, I also recognize that sometime the right choice is the difficult one, and while change is inconvenient, it yields the best long-term benefit to the project and those it serves. I think this is what @mwolfe02 was alluding to in some of his comments. Version 4 of the add-in seems like an appropriate place for this kind of breaking change.

An example of a difficult change in the past is the decision to use UTF-8-BOM for ALL source files. I was not excited about this initially because it meant a change in nearly every source file, and broke the native compatibility of some of the source files. (For example, you could no longer drag and drop a module file into the IDE and have it import without needing to clean up the BOM characters afterwards. You couldn't run LoadFromText on a form source file because it was in UTF-8 not UTF-16.) But the change also had some key benefits that ultimately won me over. The universal consistency and compatibility with virtually ALL source management software was a big win. Anyone in the world can build the add-in from source and get the same end product.

At the end of the day, the primary purpose of the add-in is to be able to generate source files that accurately and intuitively define the individual components of the database at a source file level, and to consistently build the database entirely from source files. If you put yourself into the shoes of a database developer that has never seen the source files before, he will likely open up the combined *.bas file of his form, and wonder where the VBA code is. On the other hand, if he sees two files, a *.frm and a *.cls, my guess is that he would intuitively expect his code to be in the *.cls file, and find the *.frm a logical location for the form controls and layout. (Just like you would see if you exported a VBA form from Microsoft Excel.)

All of that being said, as @Indigo744 pointed out, this is a really big deal to lose the commit history in git when you rename the files. I don't like that, but it simply how git works. I suppose another options would be to keep the *.bas extension for the form object, and just move the VBA portion to the *.cls. This would do a better job of preserving the change history. I am also open to making the split file vs. combined file an option, if people feel pretty strongly about keeping the code combined with the form definition. (I suppose that whatever we decide for forms would also apply to reports, since they work the same way.)

Thanks again for the feedback, and please feel free to add any additional factors that would be helpful to consider in this decision on whether to split out the VBA code in the source files.

A9G-Data-Droid commented 1 year ago

There are many older programming languages that should be written in English. I'm afraid VBA is one of them. Historically, programmers around the world write code and comments in English to avoid these sorts of language support issues. I would recommend avoiding extended characters in VBA\MSAccess and if you need full internationalization support then you should use a more modern programming language and database system. The support MSAccess gets from Microsoft is minimal. It may never be fully UTF8 compliant. The patchwork support that currently exists will lead to a long list of edge cases. What you think is working fine today may be plagued by bugs tomorrow. It's fragile.

Even if you upgrade your back-end to SQL to get internationalization compliant storage you will still be limited by the strange encoding issues from VBA on your front-end.

mwolfe02 commented 1 year ago

All of that being said, as @Indigo744 pointed out, this is a really big deal to lose the commit history in git when you rename the files. I don't like that, but it simply how git works.

I've been feeling like a dinosaur for clinging to Mercurial, but this appears to be an area where Mercurial is simply better than git.

I'm considering moving from my own custom Export to source script to this project, but I use a completely different naming convention. However, Mercurial supports renaming the files even after they've been created. For example, I can use the following commands to convert from my Form naming convention to match this project's:

hg copy --after MyForm.form forms/MyForm.bas
hg remove MyForm.form

Doing this preserves all history. So, for me, renaming files is not such a big deal. But I can see where it would be a major issue for git users (i.e., most of the civilized software development world). Glad I'm a dinosaur :wink:

A9G-Data-Droid commented 1 year ago

However, Mercurial supports renaming the files even after they've been created.

So does git. It automatically detects renames as long as the file hasn't changed radically. The issue here is splitting a file into two parts. Which one is the "renamed" one? Obviously, we would want to track the one with the source code.

To make this as seemless as possible I would keep the extension so that the old filename Form1.bas remains the same and a new Form1.frm is added. You would then see the form code being removed from Form1.bas and being added to Form1.frm. I don't think you would lose your history this way.

mwolfe02 commented 1 year ago

The way I understand it based on the linked article, git uses automatic rename detection. That's convenient when it works, but it will never be as reliable as an explicit rename command. Perhaps git's --follow flag provides such explicit renaming capability. If not, though, I am much more sympathetic to those concerned about losing code history.

Another thing that Mercurial supports (I believe) is having two new files both point to the same source file for history purposes. In other words, even splitting forms into two different files Mercurial would be able to maintain the history in both new files. The initial commit for the form layout properties file would simply show that the code had been deleted. The initial commit for the code-behind file would show that the form layout properties had been deleted. Going forward, you could look at either the form layout file or the code-behind file and see the full history back to the original creation of the form in Access.

I only go into this in such depth to help explain why I wasn't more concerned about the loss of file history by splitting the form and report objects into separate files.

I think it might be worth exploring exactly what the options and workarounds would be for people using git to maintain file history following such a move. Ideally, there would be a script that could be run to maintain the file history similar to what I showed for Mercurial. Unfortunately, it sounds like that may not be possible.

A9G-Data-Droid commented 1 year ago

Note for the future: https://stackoverflow.com/questions/3887736/keep-git-history-when-splitting-a-file

joyfullservice commented 1 year ago

Good news! It looks like with some minor modifications, Raymond Chen's approach to splitting files in git works with batches of files so we only have to use a few commits to pull off the upgrade process while preserving the file history in git. Now I just need to finish fleshing this out in the code...