pygame / pygame

🐍🎮 pygame (the library) is a Free and Open Source python programming language library for making multimedia applications like games built on top of the excellent SDL library. C, Python, Native, OpenGL.
https://www.pygame.org
7.32k stars 3.25k forks source link

pygame.scrap.get("charset=utf-8") returns utf-16 encoded text on Windows #3790

Open quswadress opened 1 year ago

quswadress commented 1 year ago

Environment:

Current behavior:

pygame.scrap.get("text/plain;charset=utf-8")[:-1].decode("utf-8") cannot be decoded in utf-8 encoding, but can in utf-16. This is not user friendly. You said utf-8 then be nice to be able to decode to utf-8.

Expected behavior:

pygame.scrap.get("text/plain;charset=utf-8").decode("utf-8") should return a plain python string with the text I copied.

Screenshots English text is decoded as it should be:

But unicode characters are not decoded:

Steps to reproduce:

  1. pip install pygame
  2. Copy and paste the test code below into the main.py file.
  3. Copy any non-English text (in the screenshots was the text "( ͡° ͜ʖ ͡°)").
  4. python main.py.

Test code

import pygame

pygame.init()
screen = pygame.display.set_mode((512, 512))
pygame.scrap.init()
font = pygame.font.SysFont("arial", 36)
clock = pygame.time.Clock()
running = True
while running:
    clock.tick(30)
    screen.fill("black")
    events = pygame.event.get()
    for event in events:
        if event.type == pygame.QUIT or event.type == pygame.KEYDOWN and event.key == pygame.K_ESCAPE:
            running = False
    # [:-1] is needed to remove null character at the end of the string
    text = pygame.scrap.get("text/plain;charset=utf-8")[:-1]
    rendered_text = font.render(text.replace(b"\x00", b"").decode("utf-8", errors="ignore"), False, "white")
    screen.blit(rendered_text, (screen.get_width()//2-rendered_text.get_width()//2, screen.get_height() // 3))

    rendered_text = font.render(text.decode("utf-16", errors="ignore"), False, "white")
    screen.blit(rendered_text, (screen.get_width()//2-rendered_text.get_width()//2, screen.get_height() // 3 * 2))
    pygame.display.flip()
MarcellPerger1 commented 1 month ago

Here is what I think is happening:

Windows uses UTF-16 encoding for Unicode.

So in src_c/scrap_win.c: https://github.com/pygame/pygame/blob/8d929587ec246e214408763e778ec2a1813c9ff7/src_c/scrap_win.c#L274

GetClipboardData will give utf-16 (not utf-8) data.

It seems that this data is never actually converted: https://github.com/pygame/pygame/blob/8d929587ec246e214408763e778ec2a1813c9ff7/src_c/scrap_win.c#L297-L307

To get this to return utf-8, you would need to convert it from utf-16 to utf-8 here (using WideCharToMultiByte from the Windows API or using one of the Python C API functions or something else).