Random encoding issues in "Source file occurrence"

Ikalou commented 5 years ago

Hi,

We are randomly getting non-printable characters when using the very useful "Source file occurrences" feature.

poedit

I don't think there is an encoding issue with the source file itself because the bug appears and disappears without us editing the file.

Any ideas?

Thanks!

vslavik commented 5 years ago

Can you please provide full reproduction instructions? There's certainly nothing "random" about this, but it's hard to investigate anything when all I have are some pictures. Please provide details about your OS, its version, Poedit version, attach relevant files and provide step-by-step instructions on what to do to see it myself. (See here for why I need this.)

At a glance, it's evident that charset detection changes between the two images, and without further information the most likely explanation is that it really changes, e.g. because you're saving the file somewhere in the meantime, differently. Part of the need for detailed step-by-step reproduction is to rule that out.

Ikalou commented 5 years ago

I am using version 2.2.3 on Windows 10.

I have attached a simple po file that exhibits the issue:

en.zip

And here is the source file:

s1.zip

Jumping into the source file repeatedly using the "References" context menu option sometimes displays the accentuated characters incorrectly for me. I was able to confirm using ProcessMonitor that no other process other than Poedit reads or writes to the source file during that time.

vslavik commented 5 years ago

Can you please provide the information I asked for, i.e. precise step-by-step, keystroke-by-keystroke instructions on what to do with the files you attached? Specific actions, specific settings, specific strings, instead of high-level overviews like "use these files" or "jump around"; also where to put the files on disk (relative to each other and absolute, i.e. what kind of drive etc.). This may seem obvious to you, but isn't immediately clear to other people.

TIA!

vslavik commented 5 years ago

The files you attached are not the files in the screenshot. The PO file contains only a single nonsense string àâéèêîïôùûçàâéèêîïôùûçàâéèêîïôùûçàâéèêîïôùûçàâéèêîïôùûçàâéèêîïôùûçàâéèêîïôùûçàâéèêîïôùûç and the source file is just that line repeated over and over.

It looks like something is seriously wrong outside of Poedit there.

Ikalou commented 5 years ago

Thank you for looking into this.

The specific content of the file doesn't matter. Any file with accentuated characters will exhibit the behavior shown in the screenshot.

I'm sorry I can't provide you with an easily reproducible setup, the encoding issue does unfortunately seem to appear randomly.

Could there possibly be some sort of a race condition with wxConvAuto() not always getting the encoding right in fileviewer.cpp?

I'll try to build poedit and investigate the issue sometime next week.

vslavik commented 5 years ago

This bug was one of the more interesting wild goose chases.

Upon investigation, none of the above jumped-to conclusions were correct: it's deterministic (affected by OS memory management, but not random), there's no race condition and it affects only very particular files that are neither "any file with accentuated characters" nor is the presence of accents or diacritics required to trigger it.

Specifically, it only affected UTF-8 files with DOS line endings, no BOM (this combination is culturally rare: both CRLF and UTF-8 BOMs are a thing on Windows, but atypical anywhere else), just the right size and some non-ASCII characters to observe the difference.

The culprit is a bug in wxWidgets' conversion of buffers read from FILE*, due to CRT's behavior of implicitly converting DOS newlines to \n and thus returning fewer bytes than the file's size. While this was addressed in the code, the mitigation stopped being effective at some point in the past when support for embedded NULs in strings was added across the codebase… and nobody noticed this because strings tend to be converted into NUL-terminated C strings awfully often, hiding any corrupted tails.

https://github.com/wxWidgets/wxWidgets/commit/17e2f8c477e4064a042c8deea1adb2efde8109e8

vslavik / poedit

Random encoding issues in "Source file occurrence" #605