Closed hecon5 closed 2 years ago
We have two main approaches we could take for encoding the VBA code modules. They both have pros and cons, and might be better suited as an option to allow the user to decide what is best for their environment.
Option 1: (Current) Modules are exported in UTF-8 BOM, consistent with every other export file in the project. This fully supports localized encodings and correctly represents the content in virtually all version control systems with the universal UTF-8 encoding. The BOM header ensures that VCS programs "guess correctly" when it comes to detecting the content type, especially when using localized codepages. The drawback is that if you drag and drop the file into a VBA Project, the UTF-8 BOM comes through as a syntax error in the imported module.
Option 2: Modules are exported without the UTF-8 BOM, if they do not contain any extended characters. This would fully support drag-and-drop from the file system to a VBA Project without any subsequent compile errors. (More closely mirroring the format used when native exporting the VBE component.) However, if Unicode content was added to the file, it would not import correctly in a drag and drop operation since the file would not be converted from UTF-8 to the local codepage. This approach would probably work well in English environments where extended characters are rarely used, and drag-and-drop import is a preferred approach for building a project from template files.
Note that with Option 1 you can still drag-and-drop the source file to the source folder of the target project, and it will import correctly on the next build/merge, but this is probably not quite as seamless as dropping it right into the VBA project and having the module immediately available for use.
After extended discussions on the UTF-8 encoding topic with a number of International participants, (see #154 and related issues) I think Option 1 is still the generally preferred default, but I am open to an enhancement to add Option 2 as an optional setting if there is sufficient interest.
Is there an Option 3: hybridized version of each?
Add (ugh) two options?
Not even sure if that's viable with extended characters. Don't want drag/drop to be corrupt, and don't want extended characters to be lost when they exist, either. Can one read UTF-8 files without BOMs that have extended characters (VS Code, etc.)? My limited testing (on my locale) indicates you can.
The downside of having an option is that users (myself included) will invariably add one inadvertently and break things.
Feel like we've discussed this, but then I go ahead and find another edge case...
That said, with the speed of compile, once merge works, I think Option 1 (as-is) would be the way to go; the reason I am even encountering this is to do that now requires a fairly long time waiting for linking of tables to complete vice removing BOM and dragging in.
Do we know how drag-and-drop works with International environments? What file types are they encoded in? Perhaps knowing this will help figure out if there's a way we're not seeing.
If the IDE can handle international file import without issue, then perhaps we could copy that version.
The issue there is that drag and drop uses the current system encoding. That works great for importing and exporting, but doesn't play so well with version control systems that are used to handling everything in UTF-8. You have cases where an autodetect guesses wrong and it comes out unreadable, or breaks the internal diffing tools. When files are shared across systems that have different locale settings you have another level of things that break. 😄 UTF-8 is hands down, the best universal format that works just about anywhere, and is a great fit for the source code at rest.
The VBA IDE is the exception, obviously, because it was established long before UTF-8 was the standard and it just hasn't made the transition yet. The Option 2 described above, works because Unicode is backwards compatible with non-extended ASCII, meaning they share the same characters for those low ranges. That's why you can drop a UTF-8 file into the VBE and it comes in at all. It just doesn't work with extended or Unicode characters. With the typical VBA code module that doesn't include anything special on the character side, you can get away with just skipping the BOM and then you get the best of both worlds. UTF-8 compatibility and drag-and-drop support. That perfect world starts coming apart though, as soon as you introduce extended characters. (Which are very definitely used by many Access developers, especially on the International side.)
After looking into writing the feature for this, the complexity to properly handle the various components outweighs the occasional use, IMHO. Since you can simply open the file and save as UTF-8 W/O BOM or other encoding (a literal button click when using VSCode or similar) and drop it in, I don't think there's a need for the feature, so I'm closing this.
I'm noticing when I plop in a module (drag-drop from Explorer into IDE) or Import>File, if I import a file with
UTF-8 w/BOM
encoding, it doesn't correctly import and grab the correct module name. However, when I reencode as plain 'oleUTF-8
, it works fine. I don't even need to remove the BOM at the start.I don't know how to best address this, because the encoding is required for many of the modules, especially where we need to import non-standard symbols.
I also know this issue is consistently gone back and forth on.
I don't suppose there's a way to detect if the file needs to be re-encoded into
UTF-8w/BOM
, a'la RubberDuck @folder notation?We could add a header bit in a comment preamble to flag encoding, so drag-drop and Add-In import/export works identically?