Umlaut not properly exported to .bas, End keyword truncated. Possibly UTF-8 / ANSI encoding problem

effjot commented 4 years ago

I just started to test this fork, because the original version was very slow. Now exports are quite fast. Thanks! But there’s a strange error in the exported files (it happened in queries and macros with my database).

When there’s an umlaut, it is not properly reproduced and the End keyword in the exported .bas files is truncated.

This is the original of an example in my database (some text masked by x):

Operation =1
Option =0
Database ="X:\\xxx\\GCI-WRRL-Trend.mdb"
Begin InputTables
    Name ="_xxx_Qualitätsnorm"
End
Begin OutputColumns
    Expression ="[_xxx_Qualitätsnorm].*"
End
dbBoolean "ReturnsRecords" ="-1"
…

The resulting file is:

Operation =1
Option =0
Database ="X:\\xxx\\GCI-WRRL-Trend.mdb"
Begin InputTables
    Name ="_xxx_Qualit�tsnorm"
d
Begin OutputColumns
    Expression ="[_xxx_Qualit�tsnorm].*"
d
dbBoolean "ReturnsRecords" ="-1"
…

I’ve opened the resulting file in Notepad++, it says the encoding is UTF-8-BOM.

(Tim Abell’s version exports to ANSI encoded files.)

I guess it is some problem with the encoding in my database, which causes the export to stumble.

Does anybody have an idea how I can fix this, or some pointers on where to look? Thanks!

joyfullservice commented 4 years ago

I tried reproducing this error on my system, creating a table with an Umlaut in the table name, a field name, and then in a query field. Everything seemed to export correctly in my test database.

Would you be able to create a minimalistic test database that you could upload here that would demonstrate the problem? Also, there are a few more details that would be helpful to see.

The version of the add-in that you are using (v3.1.56).
Any language specific settings you are using with the database and/or operating system.

If you can help me reproduce the error on my side (US-English system), that would give a good start for troubleshooting the issue.

effjot commented 4 years ago

Thanks for your suggestions!

Yes, I’m using the add-in version v3.1.56.

I’m working on a German Windows 10 Pro Version 1909

Access is from “Microsoft 365 Apps for Business”; Acess is “Version 2006 (Build 13001.20384 Klick-und-los)” according to the product informations in Datei→Konto (I guess it is File→Account in English). However, the „Datenbankdokumentierer“ (“database analysis / documentation”?) tool says: AccessVersion: 09.50 Build: 1001

I’m not aware on any additional language specific settings besides working on a German system.

I created a blank MDB (“Access 2002–2003”) database file and created a table with umlaut in the name and a query using it. The table def was exported correctly, the table data however wasn’t. The query .sql turned out alright, the .bas had the “End” on the next line truncated. I copied the table and renamed without umlaut. For this table, I could export the data, too. Umlauts in the table data are ok, too.

For further testing I created a further query with 2 umlauts. You can see that the second umlaut eats another 2 characters on the next line (the d of End and the CR):

export-1-and-2-umlauts

Interestingly, in Version Control→Options→Table Data I can see the table with umlaut in the name gets a typical encoding-confusion gibberish replacement. Obviously, the export could not find the table with that name. And after every invocation, the gibberish grows (presumably because the international chars in there get replaced again).

options-table-data-umlaut-1 options-table-data-umlaut-2

Here’s a zip archive of database and exported files: fj-test.zip

joyfullservice commented 4 years ago

Thanks for the additional details! Let me take a closer look at your examples...

joyfullservice commented 4 years ago

Thanks again for providing the sample database. I was able to reproduce the issue in your sample database, and used that to work through several code changes to better support these extended characters.

After several hours of research and testing, it seems like the issue here is that characters like the Umlaut are above the normal 7-bit ASCII range, but are still single-byte encoded. (See this link on ISO-8859)

If we save the non-Unicode text files using the iso-8859-1 encoding, it supports the full range of 255 characters (Latin-1) and the Umlaut and other characters come through correctly. (And accurately represent the actual characters used in the database file.) That's the approach I am taking at present, and it appears to work successfully in the sample database you provided.

One caveat is that there is an open bug in GitHub Desktop that affects the display of these characters in the changes and history panes.

The display on the GitHub website rendering of the change seems to come through fine on my system.

It is possible that the display in GitHub desktop would also be correct if I was using a different codepage as default on my system. Dropping the file into Notepad++ on my system attempts to use UTF-8, which doesn't display correctly, but if I switch to ASCII, it shows the characters just fine.

Attached is an update that includes these code changes to save non-Unicode content as Latin-1 extended ASCII (iso-8859-1) as well as some other changes to better handle Unicode file paths. Would you be able to test this with some databases on your system and see if it effectively resolves the issue?

@cenx1 and @A9G-Data-Droid - If you have a chance to review this change as well, I would appreciate your input on the direction I am heading with this. 😄

Version_Control_v3.1.60.zip

A9G-Data-Droid commented 4 years ago

I'm not an expert at internationalization but I can see this going poorly. It starts with the Umlaut and then next we will see that Eastern languages don't work at all. We need to use UTF8 whenever possible. The biggest problem I see with iso-8859-1 is that The characters above 127 will display differently on different computers. There are plenty of people in that SO link you shared saying not to use this encoding. If you want to support internationalization there is a can of worms here that Microsoft hasn't made any easier for us.

effjot commented 4 years ago

Thanks! Export works well, the files exported from my application have the correct umlauts and no truncations in the lines after umlauts. Notepad++ says the files are “ANSI” encoded, but displaying them as ISO-8859-1, -15, or Windows-1250 works, too. So they should in fact be in ISO-8859-1. The funny table name in Option→Table Data disappeared after setting it to “No data”. All table data was successfully exported as ANSI/ISO-8859-1.

I agree with @A9G-Data-Droid that 8bit encodings will surely cause problems in the not-so-long run. UTF-8 would be great. I’m still confused by encodings in Access/VBA and am just happy that it somehow mostly works…

When I start “Build from source”, I get this error dialog after a short time: Runtime error '438': Object doesn’t support this property or method (my translation).

The debugger puts me in Sub Build, line 237:

' Loop through all categories
For Each cCategory In GetAllContainers

    ' Get collection of source files
    Set colFiles = cCategory.GetFileList  ' <=Debugger arrow on this line

    ' Only show category details when source files are found
    If colFiles.Count = 0 Then

The Immediate Area says “Killed build timer 21484” (numbers are different each time).

To find out where it’s getting stuck, I’ve added a simple Debug.Print cCategory.Category at the top of the foreach, resulting in

themes
VB project
vbe references
vbe forms
proj properties
saved specs
shared images
imex specs
db properties
tables

joyfullservice commented 4 years ago

Based on the likely need to fully support Unicode characters in object names and file paths, I have gone through the code base and replaced the built-in VBA file access functions (i.e. Dir(), Open) with replacement functions that fully support UTF-8 encoding and paths. We are back to using UTF-8 encoding for text files and sanitized source files. This also supports extended Latin-1 characters like the Umlaut, as well as the double-byte encoded Unicode equivalents.

@effjot - The other issue with your sample database is that it was using another (legacy?) encoding that I didn't have on my system. When I created a blank accdb database with the same name, imported all the objects from the legacy database, I was then able to export all source and build the database from source with no problems. The Umlaut and other special characters came through just fine. While an export might work on the legacy .mdb format, you will want to use the accdb format for any projects that you wish to build from source.

Attached is version v3.1.62 of the add-in. Could you test this on your database after importing the files into a new database? (Make sure you are using at least 2007 format and General collation. Let me know how it goes!

Version_Control_v3.1.62.zip

A9G-Data-Droid commented 4 years ago

That's awesome. This should put us on track to support full internationalization if anyone ever comes along and wants to translate\support their language.

effjot commented 4 years ago

Thanks for all the work you put into i18n! Sadly, I can’t contribute much, because VBA isn’t my strong point.

I have set the default collating order to „Allgemein“ (pretty sure that corresponds to “general”; however, there’s an option „Allgemein – Vorgänger“ = “General – predecessor”?) and created a new AccDB file. The exported files look good.

Unfortunately, both after copying tables+queries from my MDB sample file and after creating a completely new table in the fresh database, I encounter a problem during Build:

Dialog ”Runtime error '3078': The Microsoft Access database module can’t find the input table or query 'MSysResources'. Ensure that it exists and that the name has been entered correctly.” [my translation]

The debugger drops me into line 253 of modImportExport:

' Loop through each file in this category.
For Each varFile In colFiles
    ' Import the file
    Log.Increment
    Log.Add "  " & FSO.GetFileName(varFile), Options.ShowDebug
    cCategory.Import CStr(varFile)  ' <= Debugger arrow on this line
Next varFile

Immediate area says Killed build timer 16797

varfile is D:\src\msaccess-vcs-integration_joyfullservice\fj-test v62\test-umlaut-new-dbformat.accdb.src\themes\Office.thmx . The file exists, is readable, but binary.

I haven’t used any themes in my databases.

(In case this information helps for debugging: export from MDB results in .bas files with UTF-8-BOM encoding, like in v3.1.56)

I’ve attached my test databases: fj-test v62.zip

effjot commented 4 years ago

@joyfullservice How can I find out what encoding my DB uses?

I’m working with legacy code here, so it probably is ISO-8850-1 or -15, the old “German standard encoding”.

BTW, do you know about any pitfalls converting Access+VBA applications from MDB to AccDB? For example, did libraries/APIs change?

Thanks! Florian

A9G-Data-Droid commented 4 years ago

@effjot It should be safe to convert all DB from MDB to ACCDB. You can absolutely use old code, libraries, and APIs from a modern ACCDB file. The only way to be sure is to do your own testing.

It is possible that system tables and other objects will be localized. Meaning that there are things that we call by name which might be displayed in the native language of the computer it was made on.

Follow this guide to show your system tables: https://www.techonthenet.com/access/database/view_systables2007.php

If the tables appear in your native language then that will be the cause of the errors you are seeing. (This is not an easy problem to solve)

joyfullservice commented 4 years ago

The MSysResources table is what holds the theme file used in Microsoft Access. It is also used to store embedded shared images. In my testing, the theme file is created automatically when you add your first form to the database. In the clsDbTheme class module, you will find the following function:

'---------------------------------------------------------------------------------------
' Procedure : VerifyResourceTable
' Author    : Adam Waller
' Date      : 6/3/2020
' Purpose   : Make sure the resources table exists, creating it if needed.
'---------------------------------------------------------------------------------------
'
Public Sub VerifyResourcesTable()

    Dim strName As String

    If Not TableExists("MSysResources") Then
        ' It would be nice to find a magical system command for this, but for now
        ' we can create it by creating a temporary form object.
        strName = CreateForm().Name
        ' Close without saving
        DoCmd.Close acForm, strName, acSaveNo
        ' Remove any potential default theme
        CurrentDb.Execute "DELETE * FROM MSysResources WHERE [Type]='thmx'", dbFailOnError
    End If

End Sub

In Access 2010, when you create a form in design view, it automatically creates the MSysResources table. This table does not exist in a newly created blank database, and is required to import an existing theme.

I am able to export and build both of your sample databases with no issues on my end. (And the Umlaut characters come through correctly in the database built from source files.) It seems that the issue here is related to the creation of the MSysResources table during the rebuild.

There is an additional setting that may help in the debugging process here... In the VBA options, select the option to Break in Class Module. That should allow you to see a bit more of the code line where the error is thrown.

Based on the error message you kindly translated for me above, I have a pretty good guess as to where it is occurring, but at this point I am not able to reproduce it on my system.

As to your question on potential issues from converting a database to the accdb format, I have never personally encountered any adverse affects. The only reason I could see to keep a database in the mdb format would be for compatibility with older software that is also accessing the database. The accdb format has been around for 13 years now, so most integrations out there should be able to support it. Obviously you should test as much of the functionality as you can, but in most cases I think you will find that Microsoft Access does a pretty good job upgrading the database format.

Also, just in case you haven't used it before, Access allows you to import objects from another database (including older database formats) directly into a new database. I typically right-click in a folder, create a New Microsoft Access Database, then open the new file, and use External Data -> Access to import the database objects from my old database. This saves me from having to create them from scratch.

If you (or others) are able to pinpoint the issue on the MSysResources table creation in your version of Microsoft Access, that would help in establishing a good method to create this table.

Another test you could try would be to export all source, then manually delete the theme folder, then build the database from source. That should sidestep the theme issue, and let us know if there are any other issues specific to your environment.

2020-08-04 10_31_06-test-umlaut-in-blank-db accdb src

effjot commented 4 years ago

Thanks for looking into the MSysResources / themes issue! Unfortunately, I will probably only be able to try your suggestions the day after tommorow.

Regarding importing objects into a blank DB: that’s how I “upgraded” my test database. Once Build is working, I’ll try to convert the whole app to AccDB.

effjot commented 4 years ago

Sorry for the long delay! I have turned on the “break in class” option. The debugger drops me into the sub VerifyResourcesTable you have shown above. The error happens on the line CurrentDb.Execute "DELETE …. strName has the value Formular1.

If I remove the themes folder, I can build from source.

However, restoring table data is incomplete: data for any columns with an umlaut in the name are skipped. (Umlauts in the name of the table don’t matter though.) The exported table data is complete. All tabledefs are in UTF according to Notepad++, although the tabledef of a completely umlaut-free table is in UTF-8 and tabledefs with umlauts are in UTF-8-BOM.

joyfullservice commented 4 years ago

Would there be a possibility of doing a brief remote screen-sharing session with you? That would help me to confirm what is happening with the system table. At this point I am not sure if it has a different name, or if it is not being created at all. If this might be a possibility, please reach out to me using: web at cadlinx dot com.

On the data restoration, you might try testing different formats for the XML import file. (Try converting the file to UTF-8, or UCS-2 before importing to see if that makes a difference.)

effjot commented 4 years ago

I have exported table data as XML now, and import of text works. (The files are still with or without BOM, but this doesn’t affect import of the umlauts in table names, column names, or data). However, numbers get imported as text. Not only floats (which are of course imported with a decimal point instead of a comma) but also integers, which should not have locale-specific issues. This looks like the problem in issue #61

Following @A9G-Data-Droid’s suggestion, I have displayed the system tables in my test database. There are only 5, none of them MSysResources:

MSysACEs
MSysComplexColumns
MSysObjects
MSysQueries
MSysRelationships

At least, their names haven’t been translated…

I created a form; and both after closing without saving (as mentioned in your code) and after adding some element and saving, the MSysResources table didn’t appear. Same with a report. Even if I switched th „Design“ (I guess thats what “themes” are in Geman Access) and colour palettes of the forms and reports, I couldn‘t see a MSysResources table. Maybe you can find something in a screen-sharing session.

joyfullservice commented 4 years ago

Thanks for posting the additional clues. I am glad to see that the system tables are not being translated, so that brings us to the question of why the MSysResources table is not being created.

The other common use for the MSysResources table is for shared images. Are you able to add a shared image to the database? (See here for a screenshot)

Let me know if adding a shared image creates the table for you. If so, I might use that method to force the creation of the table if it doesn't happen when creating a new form.

effjot commented 4 years ago

Adding an image didn’t help either. I’ve also tried to save and load the thema, and to fiddle around with the theme-y settings in the properties.

effjot commented 4 years ago

BTW, I have sent an email regarding the screen-sharing session, but your server denied it as “probably spam””. I’ll try again later.

joyfullservice commented 4 years ago

It sounds like if the database is in an older format, (such as converted from a MDB file) it may not contain (or support) the MSysResources table. I have added some code in 4c2b5ed to check for the existence of the table before attempting a query against it.

joyfullservice commented 4 years ago

@effjot - Are you able to confirm whether this issue has been resolved in the latest version of the add-in? I am thinking that we have resolved the Unicode/encoding issues that were the original focus of this issue, but let me know if you are still having issues with the Umlauts and similar characters.

effjot commented 4 years ago

I created a blank database, exported it (resulting in a themes folder with Office Theme.thmx) and built from source. Building was successful, with the expected warning: Importing themes... WARNING: Unable to create MSysResources table. [0]

Using my umlaut test DB, both import and export seems to work alright with umlauts in XML and Tab formats.

However, I’m not sure why the build process needs to drop the MSysResoucres table instead of just deleting its contents. When I start a blank database, it contains the table already, so there should be no need to drop it and try magic with creating temporary Form objects. ;-)

blank-database-MSysResources

But I think this issue is solved.

joyfullservice commented 4 years ago

@effjot - Thanks for the additional feedback on this! I went back and reviewed the code, and I don't find anywhere that the MSysResources table is deleted. In clsDbTheme it does clear any existing records from the table, but doesn't delete the table itself. If I recall correctly, the cases where the MSysResources table could not be created were related to an older file format that didn't support themes and shared images.

In any case, I am glad to hear that the Umlauts are now working as intended. 😄 Thanks again for your help and contributions to make that happen!

effjot commented 4 years ago

Sorry, my bad! I came back to the tool after a longer break and in my memory confused the (IMHO unnecessary table drop in XML table data import, see issue #61) and the delete in the MSysResources issue here :-( Also, a manually created blank database has MSysResources, but the blank database in Build from Source hasn’t, so I thought there might be a drop table / DeleteObject in there somewhere.

By the way, I just noticed that the MSysResources table reappears when:

Closing and opening the database. This did not yet happen with version 3.1.62.
Creating a new form and switching to form view or layout view. Back then I only tried creating a form, which puts you in draft view / design view (not sure about the correct translation – the ruler&pen icon); this is not enough.

It is not a big issue anymore, because building from source works without stopping, and that’s what’s important. ;-)

joyfullservice / msaccess-vcs-addin

Umlaut not properly exported to .bas, End keyword truncated. Possibly UTF-8 / ANSI encoding problem #60