martinling / libserialport

Unofficial personal repository for libserialport - see git://sigrok.org/libserialport for official repo
http://sigrok.org/wiki/Libserialport
GNU Lesser General Public License v3.0
65 stars 34 forks source link

Problem with CH340 on Windows #29

Closed glebm closed 3 years ago

glebm commented 5 years ago

Devices with CH340 work just fine with libserialport on Linux, but not on Windows.

This issue has been encountered by others, for example: https://github.com/iforce2d/thrustTester#arduino.

Environment:

  1. Arduino Nano clone that has a CH340 USB-Serial chip.
  2. Windows 10 with the latest CH340 drivers from the manufacturer's website.
  3. msys2 mingw64 built libserialport dll (https://github.com/Alexpux/MINGW-packages/pull/4841).

Reproduction:

Open port in read-write mode.
Write 1 byte
Read 1 byte
Write 1 byte
Read 1 byte
Write 2 bytes
Read 1 byte

All reads and writes are blocking, with large timeout (3 seconds). The last read returns 0 (timeout) on Windows even though it shouldn't.

The exact same code on Linux works. pyserial implementation on Windows also works.

I didn't get very far with investigating this:

  1. There is nothing interesting in LIBSERIALPORT_DEBUG output.
  2. Running with gdb produces a segfault in a Windows syscall when opening the port but it doesn't result in a crash (not sure why). Nothing else that I could find but I know next to nothing about these things.

/cc @iforce2d @jamesljlster @facchinm

martinling commented 5 years ago

Hi @glebm,

It's not 100% clear yet whether this is a libserialport bug or a CH340 driver bug. Obviously the fact it works with pyserial is interesting and raises the possibility that libserialport is doing something wrong.

On the other hand, the problem only shows up with this particular device & driver. And libserialport doesn't do anything chipset specific - it uses the exact same calls into the standard Windows serial API regardless of the attached device. The driver is responsible for everything that happens beyond that. So any device-specific difference in behaviour would normally point quite strongly towards a driver bug.

The difference between pyserial and libserialport here may just involve the two making slightly different sequences of API calls - both of them valid - where one set triggers the driver bug whilst the other doesn't.

Either way, I think what would be needed to investigate further is to somehow capture the sequence of system API calls and results that are occuring in each test case:

  1. libserialport on CH340
  2. pyserial on CH340
  3. libserialport on ATMega16U2
  4. pyserial on ATMega16U2

I'm not sure how best that can be done these days - some brief reading suggests the ProcMon tool doesn't capture serial port calls, and PortMon is obsolete. I guess one would need to use a general purpose system/dll call tracing tool - there seem to be a few available but I'm not familiar with any of them.

glebm commented 5 years ago

I tried using DrMemory's dstrace (latest GitHub release) but the output is huge (perhaps you'll have better luck following it, attached).

dstrace results on CH340:

Interestingly, running DrMemory on my very simple program reports some bugs, such as UNADDRESSABLE ACCESS:

Dr. Memory version 1.11.17856 build 1 built on Nov 27 2018 15:50:14
Dr. Memory results for pid 12476: "serial_send.exe"
Application cmdline: ".\serial_send.exe .\serial_constants.h COM3"
Recorded 115 suppression(s) from default C:\Users\Gleb\Apps\DrMemory\bin64\suppress-default.txt

Error #1: UNADDRESSABLE ACCESS beyond heap bounds: reading 0x0000000001b11c0c-0x0000000001b11c10 4 byte(s)
# 0 replace_memmove                   [c:\projects\drmemory\drmemory\replace.c:763]
# 1 libserialport-0.dll!?            +0x0      (0x000000006d68896d <libserialport-0.dll+0x896d>)
# 2 KERNELBASE.dll!DeviceIoControl   +0x65     (0x00007ffdbcd92766 <KERNELBASE.dll+0x32766>)
# 3 libserialport-0.dll!?            +0x0      (0x000000006d6814df <libserialport-0.dll+0x14df>)
# 4 atexit                            [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:439]
# 5 __tmainCRTStartup                 [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:339]
# 6 .l_start                          [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:223]
# 7 KERNEL32.dll!BaseThreadInitThunk +0x13     (0x00007ffdbfbd3034 <KERNEL32.dll+0x13034>)
Note: @0:00:00.516 in thread 5712
Note: refers to 0 byte(s) beyond last valid byte in prior malloc
Note: prev lower malloc:  0x0000000001b11b80-0x0000000001b11c0c
Note: allocated here:
Note: # 0 replace_malloc                    [c:\projects\drmemory\common\alloc_replace.c:2577]
Note: # 1 KERNELBASE.dll!DeviceIoControl   +0x65     (0x00007ffdbcd92766 <KERNELBASE.dll+0x32766>)
Note: # 2 KERNEL32.dll!DeviceIoControl     +0x7f     (0x00007ffdbfbd3d30 <KERNEL32.dll+0x13d30>)
Note: # 3 libserialport-0.dll!?            +0x0      (0x000000006d68979b <libserialport-0.dll+0x979b>)
Note: # 4 libserialport-0.dll!?            +0x0      (0x000000006d6814df <libserialport-0.dll+0x14df>)
Note: # 5 atexit                            [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:439]
Note: # 6 __tmainCRTStartup                 [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:339]
Note: # 7 .l_start                          [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:223]
Note: # 8 KERNEL32.dll!BaseThreadInitThunk +0x13     (0x00007ffdbfbd3034 <KERNEL32.dll+0x13034>)
Note: instruction: mov    (%rcx) -> %ecx

Error #2: POSSIBLE LEAK 79 direct bytes 0x0000000001b101e0-0x0000000001b1022f + 0 indirect bytes
# 0 replace_malloc                         [c:\projects\drmemory\common\alloc_replace.c:2577]
# 1 ntdll.dll!RtlRestoreLastWin32Error    +0x3f     (0x00007ffdc06783d0 <ntdll.dll+0x83d0>)
# 2 msvcrt.dll!getptd_noexit
# 3 ntdll.dll!RtlRestoreLastWin32Error    +0x3f     (0x00007ffdc06783d0 <ntdll.dll+0x83d0>)
# 4 pre_cpp_init                           [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:173]
# 5 msvcrt.dll!initterm  
# 6 __tmainCRTStartup                      [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:275]
# 7 .l_start                               [C:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:223]
# 8 KERNEL32.dll!BaseThreadInitThunk      +0x13     (0x00007ffdbfbd3034 <KERNEL32.dll+0x13034>)

===========================================================================
FINAL SUMMARY:

DUPLICATE ERROR COUNTS:

SUPPRESSIONS USED:

ERRORS FOUND:
      1 unique,     1 total unaddressable access(es)
      0 unique,     0 total invalid heap argument(s)
      0 unique,     0 total GDI usage error(s)
      0 unique,     0 total handle leak(s)
      0 unique,     0 total warning(s)
      0 unique,     0 total,      0 byte(s) of leak(s)
      1 unique,     1 total,     79 byte(s) of possible leak(s)
ERRORS IGNORED:
     11 potential error(s) (suspected false positives)
         (details: C:\Users\Gleb\Apps\DrMemory\drmemory\logs\DrMemory-serial_send.exe.12476.000\potential_errors.txt)
      3 potential leak(s) (suspected false positives)
         (details: C:\Users\Gleb\Apps\DrMemory\drmemory\logs\DrMemory-serial_send.exe.12476.000\potential_errors.txt)
     41 unique,    46 total,   7987 byte(s) of still-reachable allocation(s)
         (re-run with "-show_reachable" for details)
Details: C:\Users\Gleb\Apps\DrMemory\drmemory\logs\DrMemory-serial_send.exe.12476.000\results.txt
glebm commented 5 years ago

This is pyserial's blocking read/write implementation:

https://github.com/pyserial/pyserial/blob/a27715f322bb08b1fccffebab776c94df50057e9/serial/serialwin32.py#L262-L338

Is it substantially different from libserialport's one?

glebm commented 5 years ago

The unadressable access is likely unrelated, because the arduino/ fork does not have it, yet still doesn't work

martinling commented 5 years ago

There is a known bug in wc_to_utf8 on Windows which was patched in the arduino fork, see https://sigrok.org/bugzilla/show_bug.cgi?id=1031. But there were some issues with their PR for those fixes (#23) that never got resolved, so the fixes haven't gone in. I expect that's what you're seeing. It would be really good to get that fixed properly!

I can't really see much useful from the call logs since there's no context or expansion of the arguments for each call.

Looking at the pyserial vs libserialport code however, the most obvious difference is that pyserial is calling ClearCommError before every single read. That could well be hiding problems.

glebm commented 5 years ago

@martinling Note that the outstanding PR (https://github.com/sigrokproject/libserialport/pull/1) for wc_to_utf8 does not fix this particular unaddressable access, so it's probably from some other arduino change, or perhaps https://github.com/sigrokproject/libserialport/pull/1 doesn't really fix the issue.

Looking at the pyserial vs libserialport code however, the most obvious difference is that pyserial is calling ClearCommError before every single read. That could well be hiding problems.

It does work though (the transmitted data is correct). I wonder why it's calling ClearCommError. There is also win32.ResetEvent(self._overlapped_read.hEvent) in there that's not in libserialport, could it be relevant?

johnbeard commented 5 years ago

Adding this below the timeout block around line 1014 of serialport.c didn't work for me:

ResetEvent(port->read_ovl.hEvent);

DWORD errors;
COMSTAT comstat;

if (ClearCommError(port->hdl, &errors, &comstat) == 0)
    RETURN_FAIL("ClearCommError() failed");

Also, simply opening something like Termie or Termite is enough to put the device into a state where it can be used with libserialport.

johnbeard commented 5 years ago

On further inspection, you can also "kick" the port into working simply with this script:

#! /usr/bin/env python3

import serial

PORT = "COM3" # your port here
print("Messing with port " + PORT)

with serial.Serial(PORT, 9600) as ser:
    print("Opened")

print("closed")

So whatever pyserial does in open and close is enough.

Digging in a bit more, in pyserial's open(), if you modify it in serialwin32.py (line 78):

return // Return here - the port isn't "fixed"
self._reconfigure_port()
return // But, if you return here, libserialport will subsequently work

So pyserial's _reconfigure_port() does something important that libserialport does not in sp_open()/set_config(). You can also add a return immediately from close() to prove it is the open(), not the close(), that is important.

johnbeard commented 5 years ago

Turn out I was just missing the bits setting after sp_open:

    error = sp_open(port, SP_MODE_READ_WRITE);
    if (error == SP_OK)
    {
        // Parity, baud, flowcontrol
        sp_set_bits(port, 8); // Added this
    }

On Linux, in set_config() called during sp_open(), when the device hasn't been opened yet, config->bits is 8, on Windows it's 7 (this value comes from the get_config() call)

Setting it formally, after sp_open() returns, but before using the port allowed it to work.

glebm commented 5 years ago

@johnbeard Is this with CH-340?

I already set the flags explicitly like this (RETURN_IF_SP_ERROR is a macro from my code, not the library):

#define SERIAL_PROTOCOL_BAUD_RATE 115200
#define SERIAL_PROTOCOL_BITS 8
  RETURN_IF_SP_ERROR(sp_open(port, SP_MODE_READ_WRITE));
  RETURN_IF_SP_ERROR(sp_set_baudrate(port, SERIAL_PROTOCOL_BAUD_RATE));
  RETURN_IF_SP_ERROR(sp_set_bits(port, SERIAL_PROTOCOL_BITS));
  RETURN_IF_SP_ERROR(sp_set_parity(port, SP_PARITY_EVEN));
  RETURN_IF_SP_ERROR(sp_set_flowcontrol(port, SP_FLOWCONTROL_NONE));
johnbeard commented 5 years ago

@glebm It is a CH-340 (in the form of a cheap and nasty USB-serial cable). I suspect this might not be the same issue, but just has a similar symptom (works on Linux, but reads 0 on Windows sometimes).

uwehermann commented 4 years ago

Hi, can you please retry the current git HEAD of git://sigrok.org/libserialport and see if this is still reproducible? We've fixed a bunch of issue which may or may not have caused this.

Independently of that I've also tried to reproduce a problem on CH430/CH341, but so far everything seems to work as expected for me both on Linux and Windows, but certainly more in-depth testing is needed.

If possible, can you provide a minimal test-case Arduino project + libserialport C program (which doesn't depend on any other hardware than a CH340/CH341-based Arduino board or USB-to-serial board)? Then we could try to reproduce using that exact code on different OSes and libserialport versions/patches.

martinling commented 4 years ago

Just looking back over this issue and I noticed that @glebm's code sets other necessary parameters but doesn't call sp_set_stopbits. Could the issue be that the Windows CH340 driver has something other than the common 1 stop bit as default? We already know it has 7 data bits as default from @johnbeard's comment above, so it wouldn't be too surprising. This would explain why using the port with another program first avoids the issue.

The API documentation has been extended a bit to clarify which settings are essential to set when opening a port: https://www.sigrok.org/api/libserialport/unstable/a00022.html#details

martinling commented 4 years ago

It looks like unset configuration parameters may also be the issue in https://github.com/iforce2d/thrustTester/issues/2 - see my comment there.

martinling commented 3 years ago

I'm closing this issue since I think it's been resolved - @glebm, I think your original problem may have been with you not calling sp_set_stopbits(). If you can still reproduce this with all configuration parameters set, please reopen.

glebm commented 3 years ago

@martinling I don't have the hardware anymore (it's been ~2 years), thanks for looking into it