Escaped unicode characters not preserved when writing out unrelated changes

timtebeek commented 1 year ago

Noticed this when I ran StringFormatted on Apache Wicket wicket-util/src/main/java/org/apache/wicket/util/string/Strings.java line 435:

git clone git@github.com:apache/wicket.git
cd wicket;
# Get some coffee
mvn org.openrewrite.maven:rewrite-maven-plugin:4.36.0:run
  -Drewrite.recipeArtifactCoordinates=org.openrewrite.recipe:rewrite-migrate-java:1.13.0
  -DactiveRecipes=org.openrewrite.java.migrate.lang.StringFormatted

Unsure if it's only for JavaDoc, or for regular code as well. Thought to report it here, as it's troublesome when enforcing rewrite:dryRun does not make any changes.

traceyyoshima commented 1 year ago

@timtebeek thanks for reporting the issue - taking a look!

traceyyoshima commented 1 year ago

Test to reproduce the issue:

    @Test
    fun preserveUnicodeEncoding(jp: JavaParser) = assertParsePrintAndProcess(
        jp,
        CompilationUnit,
        """
            class A {
                /**
                 * '\u4F60\u597D'
                 */
                void method(String escapedUnicodeString) {}
            }
        """
    )

traceyyoshima commented 1 year ago

Hi @timtebeek, I looked into the parsing issue and would appreciate more context on how the problem affects your work.

I'm curious if the issue prevents you from making any important upgrades; if the issue only happens a few times and may be fixed by hand, and/or if the issue happens often.

For context: We can fix the issue, but it's not a quick fix. The parsing issue occurs in the Java compiler when the JavaDoc is scanned by the DocTreeScanner. The issue is not a quick fix because the scanner is interested in the String value ('你好') and does not preserve the capitalization of the original source. So, '\u4F60\u597D' is changed to '\u4f60\u597d' in the Node of the DocTree. (It would have been a simple fix if the Node coming out of the java compiler preserved the original encoding.)

Also, I'd be happy to hop on zoom and provide an overview of how the parsing works, and how to preserve the source code if you'd like to fix the issue.

TDLR on a fix:

Store the position of the original Unicode encoding during init() using the cursor (link) in the JavadocVisitor(s).
- The preservation of source code would be a similar concept to preserving line breaks in JavaDocs.
Detect and replace the Unicode characters when the DocTree is parsed in convertMultiline (link).

timtebeek commented 1 year ago

(quick answer at it's close to midnight here) Appreciate the thorough analysis and outline of what would be required for a proper fix.

In this case it was something I stumbled upon while upgrading Wicket, and only then in very few files, easily reverted by hand. It's in no way blocking anything important at the moment. Mostly just a curiosity that I thought to report for completeness sake.

From what it sounds like this might be one to have on the backlog as a known limitation for now, until someone comes along with a more pressing need for proper handling.

traceyyoshima commented 1 year ago

Sounds good - I'll move the issue to the backlog, for now, thanks for the report!

tkvangorder commented 1 year ago

Seems related to #2374

knutwannheden commented 1 year ago

Also just stumbled across this while parsing the JDK sources. The java/util/Arrays.java file contains the following piece in the Javadoc for copyOfRange(char[], int, int): {@code '\u005cu0000'} which ends up as a DCLiteral with body text value \u0000. I was hoping it would be possible to apply the same trick as in org.openrewrite.java.isolated.ReloadableJava17ParserVisitor#visitLiteral(): https://github.com/openrewrite/rewrite/blob/994273d515a472f81fff1f675ee8f02e5ee86e34/rewrite-java-17/src/main/java/org/openrewrite/java/isolated/ReloadableJava17ParserVisitor.java#L814-L817

I.e. extract the corresponding original source text using the node's start and end position. But if I understand you correctly, @traceyyoshima , it isn't quite that simple here?

Please also note that the fix would also need to correct the visitText(String) method. The fix will indeed end up being rather ugly...

openrewrite / rewrite

Escaped unicode characters not preserved when writing out unrelated changes #2340