REGRESSION: ASCII is mangled in Version 3.3.1

A9G-Data-Droid commented 3 years ago

I have a horizontal line comment I use: '—————————————————————————————————————————————————————————

When exported it becomes this UNICODE character: '覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧

This makes me think that other extended ASCII characters might have a similar problem.

A9G-Data-Droid commented 3 years ago

I took a peak at the code that saves a module... It has changed quite a bit since I last looked at it. There is too much going on. I know these are all features that have reasons but it's getting quite bloated.

I think the solution is to first make unit tests like I did with TestUCS2toUTF8RoundTrip so that you can prove your export can make a round trip and come back just as it went out. This kind of testing is critical when adding layers to WriteFile. It gets more difficult when trying to build a test for SanitizeFile because you are intentionally breaking the output so it won't match the original.

joyfullservice commented 3 years ago

Where are you seeing this as mangled? I just added a test to the Testing database and I can export and rebuild the database with no issues. It shows up correctly in VBA and in GitHub Desktop, and on GitHub...

joyfullservice commented 3 years ago

I think the solution is to first make unit tests like I did with TestUCS2toUTF8RoundTrip so that you can prove your export can make a round trip and come back just as it went out. This kind of testing is critical when adding layers to WriteFile. It gets more difficult when trying to build a test for SanitizeFile because you are intentionally breaking the output so it won't match the original.

Fundamentally, the ultimate goal of this project is to facilitate the recreation of a database from source files that matches the original. For the purpose of testing the functionality, the Testing.accdb database can be built from source and exported again to ensure that the source files have not changed. This is more comprehensive then writing hundreds or even thousands of unit tests to try to test every function and line of code. Additionally, the testing database has a set of tests that it runs internally and displays the output on startup. This gives you an additional visual check to make sure that the add-in is working as expected.

To test the add-in, you can do the following:

Build the Testing.accdb database from source.
Export all source using the add-in.
Build the database from source.
Close and reopen the testing database for all settings to take effect.
Rename the source folder directory.
Export all source again.
Diff the two source folders and review any differences.

As of the current version, the only differences I find are (non-essential) changes with the frmForm20 Forms 2.0 VBE export files, which are probably not practical to fully remediate.

I recognize that this testing database does not cover every possible scenario and option that could be used, but more database components can be easily added as desired to test the reproduction of the database component from source. Using this approach, every component in the database goes through round-trip testing. For my own purposes, I felt this was adequate to meet my needs for testing to ensure that changes to the add-in were not causing undesired side affects.

If you encounter persistent reproducible issues, feel free to add them to the testing database and this will help us maintain the desired functionality as the add-in evolves.

I know this is kind of a long drawn-out explanation, but hopefully it communicates some of the rationale behind the approach I have been taking on testing. To be honest, if RubberDuck wasn't so painfully slow, I would probably use it more for unit testing in my projects, but hopefully someday it will improve. 😄

A9G-Data-Droid commented 3 years ago

Thanks for explaining. That's the difference between Integration Testing and Unit Testing. I agree that Integration Testing like this is a great way to show that the plugin works as advertised. Unit testing is still useful for critical components.

I tried to follow your process with the test DB and I am getting errors on Step 1.

ERROR: Build error in: \msaccess-vcs-integration\Testing\Testing.accdb.src\tbldefs\tblLinkedAccess.json Source: modImportExport.Build Error 3170: Could not find installable ISAM.
ERROR: Build error in: \msaccess-vcs-integration\Testing\Testing.accdb.src\tbldefs\tblLinkedCSV.json Source: modImportExport.Build Error 3170: Could not find installable ISAM.
ERROR: Build error in: \msaccess-vcs-integration\Testing\Testing.accdb.src\reports\rptDefaultPrinter.bas Source: modImportExport.Build Error 2128: Microsoft Access encountered errors while importing rptDefaultPrinter.
ERROR: Build error in: \msaccess-vcs-integration\Testing\Testing.accdb.src\reports\rptNavigationPaneGroups.bas Source: modImportExport.Build Error 2128: Microsoft Access encountered errors while importing rptNavigationPaneGroups.
ERROR: Build error in: \msaccess-vcs-integration\Testing\Testing.accdb.src\documents.json Source: modImportExport.Build Error 3265: Item not found in this collection.

I know that the linked CSV is hard coded to a path on your dev system. Same with the default printer being specific to your environment. I'm not sure about the other errors.

It made some error files:

errors.txt

Microsoft Access encountered an error while importing the object 'rptDefaultPrinter'.

Error encountered at line 22.
Expected: 'End'.  Found: 0x6801000068010000680100006801000000000000201c0000e010000001000000.

errors1.txt

Microsoft Access encountered an error while importing the object 'rptNavigationPaneGroups'.

Error encountered at line 24.
Expected: 'End'.  Found: 0x6801000068010000680100006801000000000000b42d00008601000001000000.

joyfullservice commented 3 years ago

You are doing this on the dev branch, right? That would have the latest version of the Testing database...

A9G-Data-Droid commented 3 years ago

No, I was using master because I thought it would match the released version of the add-in.

joyfullservice commented 3 years ago

No, I was using master because I thought it would match the released version of the add-in.

That would be a very logical conclusion. 😄 Setting the connection strings to relative paths is a very new change that hasn't rolled into master yet. If you build the add-in from source, and run the testing from the dev branch, you should have the very latest development version which includes the test I pushed today on the ASCII string in VBA.

Sorry about the confusion on that. After the next release on the master branch, you should be able to build and run the test database using the released add-in. I will probably be pushing another release to master soon, pending some potential fixes related to #180.

Also, that was a helpful distinction regarding Integration Testing and Unit Testing. I agree that both are very valuable in a project like this. 👍

A9G-Data-Droid commented 3 years ago

Ok, so the dev branch works wonderfully. I filled in a full ASCII table to that test module. However, my problem didn't repeat. Now I'm left wondering why the export from my main project got a bunch of wrong characters in it's export. The new test only proves that the add-in works...

A9G-Data-Droid commented 3 years ago

Spoke too soon! I reproduced it by making that line longer. So it's not the character that is getting mangled but if that line character is repeated a certain number of times then the whole line gets changed in to a different character. Take a look at my PR. It's got the bad characters in it now. Due to the extended character set being present you can see that a bunch of other characters were affected. It's like the code page changed.

A9G-Data-Droid commented 3 years ago

I believe I have isolated the problem:

? StringHasUnicode("'——————————————————————————————————————————————————————————————————————————————————————————")
True

It's getting converted to Unicode when it's a normal ASCII source.

A9G-Data-Droid commented 3 years ago

Ok, my PR now contains a proposed fix. It makes the StringHasUnicodein to a simple one-liner. It just converts the string to unicode and compares it to the original. If it's the same then it was Unicode. If it wasn't unicode then the string should change in to junk. I made a simple unit test for it too.

While I was in there I noticed that the unit test for TestUCS2toUTF8RoundTrip is broken. That should be looked at but I'm out of time for today.

joyfullservice commented 3 years ago

I believe I have isolated the problem:

? StringHasUnicode("'——————————————————————————————————————————————————————————————————————————————————————————")
True

It's getting converted to Unicode when it's a normal ASCII source.

The name of this function is a bit misleading... (Due to going back and forth a few times on how to handle extended ascii characters.) The description indicates its intended purpose, to return true if the string contains either Unicode or extended ascii characters. See #154 for a more extended discussion on this topic, and why it was important to include the UTF-8 BOM with files that contained extended characters. (Basically treat them as containing Unicode.)

A9G-Data-Droid commented 3 years ago

Thanks for the backstory. I agree with @Ik-12 that we should always export using the same format regardless of content.

I tested forcing the write without BOM and it did not fix the problem. So the issue is somewhere in the conversion to UTF8. The extended characters need to be converted to multibyte but they are coming out wrong.

First thing to note, when I dump the string coming in to WriteFile like so:

With FSO.CreateTextFile(strPath, True, True)
    .WriteLine strText
    .Close
End With

The output appears identical to the fancy code that converts to a byte array and writes that. This brings me to 2 conclusions:

The code converting to UTF8 is correct
The code converting to UTF8 is unnecessary (Why not just use FSO?)

So the problem is before WriteFile. First the file is written using native method .SaveAsText . We know this works. This narrows it down to ReadFile(strSourceFile, "_autodetect_all") being the source of the corruption.

I proved this by changing "_autodetect_all" to "Windows-1252" and it fixed the problem for me. I'm not sure if this would break it for people who have a different system codepage. We would need to test this on a foreign language system to find out.

What this says to me is that MS-Access .SaveAsText does not output in the system codepage.

A9G-Data-Droid commented 3 years ago

Ok, I have written a new Encoder that is far more simple and straightforward. It uses two streams and it reads from the old file while writing the new file at the same time. This might even be faster.

It is faithfully exporting my ASCII extended characters. Take a look and see if you like it. You might find that it could replace a bunch of other methods. What's odd is that I'm using "_autodetect_all" and now it's working.

joyfullservice commented 3 years ago

Thank you! This may end up helping with #180 as well...

I will also want to check the performance logs when running this against large projects. We might get some significant performance gains by reading and writing the data in chunks rather than line-by-line. (See the ReadFile function for an example of how this is used.)

A9G-Data-Droid commented 3 years ago

Eureka! A clue! So for my encoder to work it needs to be in text mode because we are telling it what text format to write. So the bytemode chunking you were doing for performance can't be done here. I said to myself "1 character should = 1 byte, right?"

So I changed my code from doing line at a time, to character mode:

        FROM: objOutputStream.WriteText .ReadText(adReadLine), adWriteLine
        TO:   objOutputStream.WriteText .ReadText(131072), adWriteChar

AND the identical corruption returned!

So I think this means the problem has something to do with the number of characters falling on the boundary of the chunks! Maybe this is why it only happens when the line is a certain length.

This is bad news for your 128k chunk method but it does isolate the problem even further.

joyfullservice commented 3 years ago

Hmm... I wonder if we read the contents as binary, then converted to a text string after reading the file we could solve both the performance and the corruption issue...

A9G-Data-Droid commented 3 years ago

That's what the new method does. It loads the file as binary:

    .Type = adTypeBinary
    .LoadFromFile strInputFile

then it reads it in to a textstream of the defined type while writing that to the output file in any defined type:

    .Type = adTypeText
    .Charset = strInputCharset
    objOutputStream.Open
    objOutputStream.Charset = strOutputCharset
    Do While .EOS <> True
        objOutputStream.WriteText .ReadText(adReadLine), adWriteLine

This allows you to go from any encoding to any encoding supported by the OS using native OS encoding methods.

That is why I said you could use this method for anything. Including replacing the UCS2 methods.

A9G-Data-Droid commented 3 years ago

We have come full circle. The issue has returned when I use the new 3.3.7 in dev branch. It must be the chunking.

joyfullservice commented 3 years ago

We have come full circle. The issue has returned when I use the new 3.3.7 in dev branch. It must be the chunking.

Would you be able to add something to the test database that would allow me to reproduce the issue?

joyfullservice commented 3 years ago

If the issue is with converting files, fa2ba83 may resolve it. If it is with reading files, you might do some similar testing with the ReadFile function. I did not find any performance penalty to copying the stream in a single operation in the aforementioned commit.

joyfullservice commented 3 years ago

Okay, just pushed another update that should fix the reading of files as well, with no apparent performance penalty. The only thing left that uses chunks to read a file is something specific to ADP projects. I would guess that this should fix the encoding issue here, at least if it was caused by the chunking. 😄

A9G-Data-Droid commented 3 years ago

And that shows that it is NOT the chunking. This is a tricky problem. Now I'm seeing it happen because of the _autodetect_all which was my first hypothesis.

When the file first comes out of .SaveAsText it is immediately problematic. If you open the raw file in Notepad++ or Visual Studio Code they both mistakenly assume that the file is Japanese and display the funky characters. The very same funky characters that become encoded in to the UTF-8 file by the conversion. Before that conversion occurs I can manually select the encoding type of windows-1251 and the file displays properly in NP++ and VSCode. Just the same, if I swap _autodetect_all for windows-1251 the final encoded UTF8 file is correctly showing all the characters.

ARG! Have a nice weekend. I'm done for the day.

hecon5 commented 3 years ago

Is there a way to detect the code page/encoding, and have that be the chosen one? Or is autodetect supposed to do that?

joyfullservice commented 3 years ago

I think the GetACP API might be just what we are needing here... Wrap that in a select case to map to the correct encoding string (i.e. windows-1251) and anything else can use _autodetect_all. I tried this the other day, and it seemed to work great in terms of reporting the active code page used by Windows. You can also take a look at LanguageSettings.LanguageID for a possible built-in approach that doesn't require the API call.

joyfullservice commented 3 years ago

I think that I now understand what was happening with the _autodetect_all. My hunch is that this parameter was telling the encoder to attempt to automatically detect the language based on the content. If this was actual written text such as blog posts, this would probably work just fine. But VBA source code is more like a bunch of English keywords and some comments. While this might still work in many cases, if you use a lot of extended characters, such as with the special lines described in this issue, it may cause the auto detection to come to a different conclusion and map the file to an entirely different language encoding. Since (1) we have no control over the automatic detection, (2) source code reads differently than natural langage, and (3) it may contain non-text content as formatting and layout aids, using auto-detect is a very risky approach.

Instead of trying to auto-detect the language encoding for the non-Unicode VBA module content, I am now using a simple API call to return the active code page that is already being used by Windows, and encoding the file based on that. The removes all the guesswork and inconsistencies with _autodetect_all and gives us a solid approach for handling the extended character content used in code modules. Since the active code page is what is being used to edit and display what the user sees in the VBA IDE, this should provide a consistent editing experience where the code looks the same in VBA as it does in version control (even through the files are technically encoded differently). Building from source should correctly convert the UTF-8 files to the code page used by the VBA IDE, and provide a fully consistent round-trip experience.

The one known exception to this is if the user selects the beta UTF-8 option in Windows 10 regional administrative settings. (See #180) Since the VBA IDE does not internally support UTF-8, it falls back to the code page mapping of the extended characters and may encounter conversion issues if certain extended characters are used in the code modules. For example, the BOM constants used in this project will not export/import correctly if you use this beta feature. I did some testing, and did not really find a simple way to resolve this, so at this point I plan to just leave that as a known limitation. (I really don't think it will affect many users.)

joyfullservice commented 3 years ago

@A9G-Data-Droid - Could you confirm that this is working correctly now as of the latest 3.3.8 dev version? Even though this has proved to be quite a complex issue, I think we have gained some very helpful benefits through it with the simplification of the UTF-8 encoding and eliminating the (sometimes inaccurate) language auto detection when converting files.

A9G-Data-Droid commented 3 years ago

Thanks! This is the way I would have handled it. This is a powerful feature that should allow people with different system charsets to share source files as they are traded in a safe format and then imported in the proper format for their system.

I tested this latest version and it handles my new test file just fine.

My test results can be seen in PR #190

joyfullservice / msaccess-vcs-addin

REGRESSION: ASCII is mangled in Version 3.3.1 #186