Closed A9G-Data-Droid closed 3 years ago
I took a peak at the code that saves a module... It has changed quite a bit since I last looked at it. There is too much going on. I know these are all features that have reasons but it's getting quite bloated.
I think the solution is to first make unit tests like I did with TestUCS2toUTF8RoundTrip
so that you can prove your export can make a round trip and come back just as it went out. This kind of testing is critical when adding layers to WriteFile
. It gets more difficult when trying to build a test for SanitizeFile
because you are intentionally breaking the output so it won't match the original.
Where are you seeing this as mangled? I just added a test to the Testing database and I can export and rebuild the database with no issues. It shows up correctly in VBA and in GitHub Desktop, and on GitHub...
I think the solution is to first make unit tests like I did with
TestUCS2toUTF8RoundTrip
so that you can prove your export can make a round trip and come back just as it went out. This kind of testing is critical when adding layers toWriteFile
. It gets more difficult when trying to build a test forSanitizeFile
because you are intentionally breaking the output so it won't match the original.
Fundamentally, the ultimate goal of this project is to facilitate the recreation of a database from source files that matches the original. For the purpose of testing the functionality, the Testing.accdb
database can be built from source and exported again to ensure that the source files have not changed. This is more comprehensive then writing hundreds or even thousands of unit tests to try to test every function and line of code. Additionally, the testing database has a set of tests that it runs internally and displays the output on startup. This gives you an additional visual check to make sure that the add-in is working as expected.
To test the add-in, you can do the following:
Testing.accdb
database from source.As of the current version, the only differences I find are (non-essential) changes with the frmForm20
Forms 2.0 VBE export files, which are probably not practical to fully remediate.
I recognize that this testing database does not cover every possible scenario and option that could be used, but more database components can be easily added as desired to test the reproduction of the database component from source. Using this approach, every component in the database goes through round-trip testing. For my own purposes, I felt this was adequate to meet my needs for testing to ensure that changes to the add-in were not causing undesired side affects.
If you encounter persistent reproducible issues, feel free to add them to the testing database and this will help us maintain the desired functionality as the add-in evolves.
I know this is kind of a long drawn-out explanation, but hopefully it communicates some of the rationale behind the approach I have been taking on testing. To be honest, if RubberDuck wasn't so painfully slow, I would probably use it more for unit testing in my projects, but hopefully someday it will improve. 😄
Thanks for explaining. That's the difference between Integration Testing and Unit Testing. I agree that Integration Testing like this is a great way to show that the plugin works as advertised. Unit testing is still useful for critical components.
I tried to follow your process with the test DB and I am getting errors on Step 1.
ERROR: Build error in: \msaccess-vcs-integration\Testing\Testing.accdb.src\tbldefs\tblLinkedAccess.json Source: modImportExport.Build Error 3170: Could not find installable ISAM.
ERROR: Build error in: \msaccess-vcs-integration\Testing\Testing.accdb.src\tbldefs\tblLinkedCSV.json Source: modImportExport.Build Error 3170: Could not find installable ISAM.
ERROR: Build error in: \msaccess-vcs-integration\Testing\Testing.accdb.src\reports\rptDefaultPrinter.bas Source: modImportExport.Build Error 2128: Microsoft Access encountered errors while importing rptDefaultPrinter.
ERROR: Build error in: \msaccess-vcs-integration\Testing\Testing.accdb.src\reports\rptNavigationPaneGroups.bas Source: modImportExport.Build Error 2128: Microsoft Access encountered errors while importing rptNavigationPaneGroups.
ERROR: Build error in: \msaccess-vcs-integration\Testing\Testing.accdb.src\documents.json Source: modImportExport.Build Error 3265: Item not found in this collection.
I know that the linked CSV is hard coded to a path on your dev system. Same with the default printer being specific to your environment. I'm not sure about the other errors.
It made some error files:
errors.txt
Microsoft Access encountered an error while importing the object 'rptDefaultPrinter'.
Error encountered at line 22.
Expected: 'End'. Found: 0x6801000068010000680100006801000000000000201c0000e010000001000000.
errors1.txt
Microsoft Access encountered an error while importing the object 'rptNavigationPaneGroups'.
Error encountered at line 24.
Expected: 'End'. Found: 0x6801000068010000680100006801000000000000b42d00008601000001000000.
You are doing this on the dev
branch, right? That would have the latest version of the Testing database...
No, I was using master
because I thought it would match the released version of the add-in.
No, I was using
master
because I thought it would match the released version of the add-in.
That would be a very logical conclusion. 😄 Setting the connection strings to relative paths is a very new change that hasn't rolled into master
yet. If you build the add-in from source, and run the testing from the dev
branch, you should have the very latest development version which includes the test I pushed today on the ASCII string in VBA.
Sorry about the confusion on that. After the next release on the master
branch, you should be able to build and run the test database using the released add-in. I will probably be pushing another release to master
soon, pending some potential fixes related to #180.
Also, that was a helpful distinction regarding Integration Testing and Unit Testing. I agree that both are very valuable in a project like this. 👍
Ok, so the dev branch works wonderfully. I filled in a full ASCII table to that test module. However, my problem didn't repeat. Now I'm left wondering why the export from my main project got a bunch of wrong characters in it's export. The new test only proves that the add-in works...
Spoke too soon! I reproduced it by making that line longer. So it's not the character that is getting mangled but if that line character is repeated a certain number of times then the whole line gets changed in to a different character. Take a look at my PR. It's got the bad characters in it now. Due to the extended character set being present you can see that a bunch of other characters were affected. It's like the code page changed.
I believe I have isolated the problem:
? StringHasUnicode("'——————————————————————————————————————————————————————————————————————————————————————————")
True
It's getting converted to Unicode when it's a normal ASCII source.
Ok, my PR now contains a proposed fix. It makes the StringHasUnicode
in to a simple one-liner. It just converts the string to unicode and compares it to the original. If it's the same then it was Unicode. If it wasn't unicode then the string should change in to junk. I made a simple unit test for it too.
While I was in there I noticed that the unit test for TestUCS2toUTF8RoundTrip
is broken. That should be looked at but I'm out of time for today.
I believe I have isolated the problem:
? StringHasUnicode("'——————————————————————————————————————————————————————————————————————————————————————————") True
It's getting converted to Unicode when it's a normal ASCII source.
The name of this function is a bit misleading... (Due to going back and forth a few times on how to handle extended ascii characters.) The description indicates its intended purpose, to return true if the string contains either Unicode or extended ascii characters. See #154 for a more extended discussion on this topic, and why it was important to include the UTF-8 BOM with files that contained extended characters. (Basically treat them as containing Unicode.)
Thanks for the backstory. I agree with @Ik-12 that we should always export using the same format regardless of content.
I tested forcing the write without BOM and it did not fix the problem. So the issue is somewhere in the conversion to UTF8. The extended characters need to be converted to multibyte but they are coming out wrong.
First thing to note, when I dump the string coming in to WriteFile
like so:
With FSO.CreateTextFile(strPath, True, True)
.WriteLine strText
.Close
End With
The output appears identical to the fancy code that converts to a byte array and writes that. This brings me to 2 conclusions:
So the problem is before WriteFile
. First the file is written using native method .SaveAsText
. We know this works. This narrows it down to ReadFile(strSourceFile, "_autodetect_all")
being the source of the corruption.
I proved this by changing "_autodetect_all"
to "Windows-1252"
and it fixed the problem for me. I'm not sure if this would break it for people who have a different system codepage. We would need to test this on a foreign language system to find out.
What this says to me is that MS-Access .SaveAsText
does not output in the system codepage.
Ok, I have written a new Encoder that is far more simple and straightforward. It uses two streams and it reads from the old file while writing the new file at the same time. This might even be faster.
It is faithfully exporting my ASCII extended characters. Take a look and see if you like it. You might find that it could replace a bunch of other methods. What's odd is that I'm using "_autodetect_all" and now it's working.
Thank you! This may end up helping with #180 as well...
I will also want to check the performance logs when running this against large projects. We might get some significant performance gains by reading and writing the data in chunks rather than line-by-line. (See the ReadFile
function for an example of how this is used.)
Eureka! A clue! So for my encoder to work it needs to be in text mode because we are telling it what text format to write. So the bytemode chunking you were doing for performance can't be done here. I said to myself "1 character should = 1 byte, right?"
So I changed my code from doing line at a time, to character mode:
FROM: objOutputStream.WriteText .ReadText(adReadLine), adWriteLine
TO: objOutputStream.WriteText .ReadText(131072), adWriteChar
AND the identical corruption returned!
So I think this means the problem has something to do with the number of characters falling on the boundary of the chunks! Maybe this is why it only happens when the line is a certain length.
This is bad news for your 128k chunk method but it does isolate the problem even further.
Hmm... I wonder if we read the contents as binary, then converted to a text string after reading the file we could solve both the performance and the corruption issue...
That's what the new method does. It loads the file as binary:
.Type = adTypeBinary
.LoadFromFile strInputFile
then it reads it in to a textstream of the defined type while writing that to the output file in any defined type:
.Type = adTypeText
.Charset = strInputCharset
objOutputStream.Open
objOutputStream.Charset = strOutputCharset
Do While .EOS <> True
objOutputStream.WriteText .ReadText(adReadLine), adWriteLine
This allows you to go from any encoding to any encoding supported by the OS using native OS encoding methods.
That is why I said you could use this method for anything. Including replacing the UCS2 methods.
We have come full circle. The issue has returned when I use the new 3.3.7 in dev
branch. It must be the chunking.
We have come full circle. The issue has returned when I use the new 3.3.7 in
dev
branch. It must be the chunking.
Would you be able to add something to the test database that would allow me to reproduce the issue?
If the issue is with converting files, fa2ba83 may resolve it. If it is with reading files, you might do some similar testing with the ReadFile
function. I did not find any performance penalty to copying the stream in a single operation in the aforementioned commit.
Okay, just pushed another update that should fix the reading of files as well, with no apparent performance penalty. The only thing left that uses chunks to read a file is something specific to ADP projects. I would guess that this should fix the encoding issue here, at least if it was caused by the chunking. 😄
And that shows that it is NOT the chunking. This is a tricky problem. Now I'm seeing it happen because of the _autodetect_all
which was my first hypothesis.
When the file first comes out of .SaveAsText
it is immediately problematic. If you open the raw file in Notepad++ or Visual Studio Code they both mistakenly assume that the file is Japanese and display the funky characters. The very same funky characters that become encoded in to the UTF-8 file by the conversion. Before that conversion occurs I can manually select the encoding type of windows-1251
and the file displays properly in NP++ and VSCode. Just the same, if I swap _autodetect_all
for windows-1251
the final encoded UTF8 file is correctly showing all the characters.
ARG! Have a nice weekend. I'm done for the day.
Is there a way to detect the code page/encoding, and have that be the chosen one? Or is autodetect supposed to do that?
I think the GetACP API might be just what we are needing here... Wrap that in a select case to map to the correct encoding string (i.e. windows-1251
) and anything else can use _autodetect_all
. I tried this the other day, and it seemed to work great in terms of reporting the active code page used by Windows. You can also take a look at LanguageSettings.LanguageID for a possible built-in approach that doesn't require the API call.
I think that I now understand what was happening with the _autodetect_all
. My hunch is that this parameter was telling the encoder to attempt to automatically detect the language based on the content. If this was actual written text such as blog posts, this would probably work just fine. But VBA source code is more like a bunch of English keywords and some comments. While this might still work in many cases, if you use a lot of extended characters, such as with the special lines described in this issue, it may cause the auto detection to come to a different conclusion and map the file to an entirely different language encoding. Since (1) we have no control over the automatic detection, (2) source code reads differently than natural langage, and (3) it may contain non-text content as formatting and layout aids, using auto-detect is a very risky approach.
Instead of trying to auto-detect the language encoding for the non-Unicode VBA module content, I am now using a simple API call to return the active code page that is already being used by Windows, and encoding the file based on that. The removes all the guesswork and inconsistencies with _autodetect_all
and gives us a solid approach for handling the extended character content used in code modules. Since the active code page is what is being used to edit and display what the user sees in the VBA IDE, this should provide a consistent editing experience where the code looks the same in VBA as it does in version control (even through the files are technically encoded differently). Building from source should correctly convert the UTF-8 files to the code page used by the VBA IDE, and provide a fully consistent round-trip experience.
The one known exception to this is if the user selects the beta UTF-8 option in Windows 10 regional administrative settings. (See #180) Since the VBA IDE does not internally support UTF-8, it falls back to the code page mapping of the extended characters and may encounter conversion issues if certain extended characters are used in the code modules. For example, the BOM constants used in this project will not export/import correctly if you use this beta feature. I did some testing, and did not really find a simple way to resolve this, so at this point I plan to just leave that as a known limitation. (I really don't think it will affect many users.)
@A9G-Data-Droid - Could you confirm that this is working correctly now as of the latest 3.3.8 dev
version? Even though this has proved to be quite a complex issue, I think we have gained some very helpful benefits through it with the simplification of the UTF-8 encoding and eliminating the (sometimes inaccurate) language auto detection when converting files.
Thanks! This is the way I would have handled it. This is a powerful feature that should allow people with different system charsets to share source files as they are traded in a safe format and then imported in the proper format for their system.
I tested this latest version and it handles my new test file just fine.
My test results can be seen in PR #190
I have a horizontal line comment I use:
'—————————————————————————————————————————————————————————
When exported it becomes this UNICODE character:
'覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧覧
This makes me think that other extended ASCII characters might have a similar problem.