python / cpython

The Python programming language
https://www.python.org
Other
63.49k stars 30.41k forks source link

Simplify linking of shared libraries on the AIX OS #81871

Closed d399e807-80a0-406d-8006-e79791f279ae closed 5 years ago

d399e807-80a0-406d-8006-e79791f279ae commented 5 years ago
BPO 37690
Nosy @ericvw, @aixtools, @pablogsal
PRs
  • python/cpython#14965
  • Files
  • aix-extension-simplify.patch: Patch from 3.7 branch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['build', '3.8', '3.9', 'extension-modules', 'type-feature', '3.7'] title = 'Simplify linking of shared libraries on the AIX OS' updated_at = user = 'https://github.com/ericvw' ``` bugs.python.org fields: ```python activity = actor = 'ericvw' assignee = 'none' closed = True closed_date = closer = 'ericvw' components = ['Build', 'Extension Modules'] creation = creator = 'ericvw' dependencies = [] files = ['48508'] hgrepos = [] issue_num = 37690 keywords = ['patch'] message_count = 6.0 messages = ['348511', '348519', '348523', '348535', '348605', '348675'] nosy_count = 4.0 nosy_names = ['ericvw', 'David.Edelsohn', 'Michael.Felt', 'pablogsal'] pr_nums = ['14965'] priority = 'normal' resolution = None stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue37690' versions = ['Python 2.7', 'Python 3.5', 'Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9'] ```

    d399e807-80a0-406d-8006-e79791f279ae commented 5 years ago

    Have the approach of building shared libraries on the AIX operating system be similar to that of a System V system. The primary benefit of this change is the elimination of custom AIX paths and reducing the changes at ./configure to affect just the LDSHARED environment variable.

    For background context, AIX sees shared libraries as fully linked and resolved, where symbol references are resolved at link-time and cannot be rebound at load-time. System V resolves all global symbols by the run-time linker. Thus, conventional shared libraries in AIX cannot have undefined symbols, while System V can.

    However, AIX does allow for run-time linking in allowing symbols to be undefined until load-time.

    Therefore, this change affects how linking of shared libraries are performed on AIX to behave similarly to that of System V.

    Given that symbols are now going to be allowed to be undefined for AIX, all the code paths for generating exported symbols and the related wrapper scripts go away.

    The real magic is in the -G flag for LDSHARED. Effectively, -G is equivalent to specifying the following:

    I have been using this patch for Python 3.7, 3.6, and 2.7 (with appropriate backporting adaptations) without issue for being able to build and use Python C/C++ extensions on AIX for about 6 months now. Given that we haven't had any issues, I felt it was appropriate to see if this would be accepted upstream.

    b9c72c32-828e-41b6-8a7e-aac13397383a commented 5 years ago

    Absolutely, positively no. This is horrible and completely wrong.

    Applications on AIX should not be compiled to allow dynamic linking to make them operate more like SVR4/Linux. Python does not require dynamic linking. This simply is masking a symptom in a naive and incorrect manner. Use of runtime linking causes many internal changes to the behavior of AIX applications, severely affecting performance and potentially causing overflow of data structures.

    We currently are going through the process of removing this brain damage from CMake. I absolutely will not allow Python to go down this path and introduce this type of mistake.

    d399e807-80a0-406d-8006-e79791f279ae commented 5 years ago

    This is horrible and completely wrong.

    I'm not an expert in AIX and xlc, by any means. I would greatly appreciate your help to better understand so I can see the problem in the way you are to figure the best approach I can take.

    My primary motivation was to simplify/homogenize the mechanism by which Python C/C++ extensions are built. For background, I have Python applications and libraries that need to run on Linux, Solaris, and AIX. One of the challenges we ran into was how and when symbol resolution occurs, which is fundamentally different in AIX.

    Python does not require dynamic linking.

    I understand Python does not require dynamic linking. However, the problem I am running into is how this should work/behave for Python C/C++ extensions, which are imported (loaded) at runtime of a Python application. Maybe this is where I have a fundamental misunderstanding, but it led me to believe that in AIX this should behave similarly to SVR4/Linux. When scouring how Python interplays with AIX for building Python C/C++ extensions, this problem piqued my interest.

    When conducting my self-research, I came across http://download.boulder.ibm.com/ibmdl/pub/software/dw/aix/es-aix_ll.pdf, which helped me in understanding the differences between dynamic loading run-time linking. Thus, I went down the path of run-time linking with the '-G' flag, which appeared similar to what was done in Python for other operating systems.

    This simply is masking a symptom in a naive and incorrect manner.

    This is leading up to my misunderstanding of what I was observing during my initial investigation of what was going on.

    I'll need to revisit the symptom being observed, but I vaguely recall missing symbols when building Python C/C++ extensions when the interpreter is configured with '--enable-shared'. Let me go back, undo the patch I have, and recreate the symptom/issue that was observed.

    Use of runtime linking causes many internal changes to the behavior of AIX applications, severely affecting performance and potentially causing overflow of data structures.

    I'm really curious about this one. What internal changes, performance concerns, and overflow of data structures could occur? Luckily, I have observed nor experienced anything egregiously negative, thus far. Understanding these concerns will help bolster my understanding.

    I absolutely will not allow Python to go down this path and introduce this type of mistake.

    No worries. I'm trying to solve a problem and appeared to have gone down an incorrect path. Being able to better understand what the desired expectation is for Python and associated C/C++ extensions, will help guide me to focus where the misunderstanding is and to redirect focus on where the problem is that needs to be addressed.

    b9c72c32-828e-41b6-8a7e-aac13397383a commented 5 years ago

    Runtime linking allows a dynamically loaded library to interpose symbols. The classic example is allowing a program or dynamic library to overload C++ operator new. A library or program overrides the symbol by name.

    Python does not require this. Python does not need to allow an extension module to override a function in Python.

    If one needs to add AIX ld -G and runtime linking, 99% of the time one is covering up a problem.

    The downside of -G is that it forces all global functions to be called through the AIX glink code (equivalent to SVR4 PLT) and not inlined. This allows every global function call to be overriden, but forces every call to go through a function pointer. This is expensive.

    Calling functions through the "PLT" requires that the function pointers for each global function be placed in the AIX TOC (equivalent to SVR4 GOT). If the program or shared library is large enough, this can overflow the "GOT", which then requires even more expensive fixup code.

    The mistaken use of this option leads down a path with bad performance and potentially requiring more and more effort to recover from problems introduced by the choice.

    I don't know exactly the symptoms that you observed, but one possibility is that the shared object you are building is not being linked against all of the dependent libraries.

    Separate from runtime linking, SVR4 allows unresolved symbols when a shared library is created and used to export all global symbols by default (before the efforts on symbol visibility). A simplistic way of describing this is that a process into which an executable and shared libraries are loaded sort of has this soup of all global symbols floating around and available to the runtime loader. When a new shared library is loaded, the dynamic linker can resolve the symbols from any definitions available in the process. Allowing the unresolved symbols at shared library link time is a promise that the symbols will be provided by someone at runtime. At runtime, all of the symbol needs and definitions are thrown in the air and hopefully match up correctly when first referenced at runtime.

    AIX requires that all shared objects be fully resolved at link-edit time. It requires that the shared object refer to all dependent libraries at link time, even if those libraries also will be present and provided by other shared libraries or executable at runtime.

    In other words, on AIX, one must link all C++ shared objects against the C++ standard library, even if the main executable is linked against the library.

    So, again, one possible explanation for the error of missing symbols is that one or more dependent libraries are missing from the link command building the shared object and that omission coincidentally happens to work on SVR4/Linux because of its semantics, but it doesn't work in the more strict environment of AIX.

    This type of error should not be solved through runtime linking to borrow the missing symbols from the running process, which is a very expensive solution.

    aixtools commented 5 years ago

    David gives several reasons why this PR should not be used.

    And, in reading them - while I follow them at face value, there may be things I miss due to ignorance or being naive (more the system admin than tool developer).

    Isn't there an configure --enable-shared that (sadly!) gives a SVR4 like shared library (sys-admin view - it is a .so file (libpython3.7m.so) rather than "the same file" as a member of an archive (e.g., libpython3.a[libpython3.7m.so]).

    While it may be common on other OS systems to have two "lib" directories, e.g., /usr/lib and /usr/lib64 - on AIX there is expected - one directory (/usr/lib) and the "archives aka .a files" may have multiple members, e.g., a 32-bit and a 64-bit member.

    Not using .a files makes it very hard to keep a "tight-ship" on an AIX server - and I feel it is incorrect for a tool to dictate system administration policy.

    As I do not know how Python looks on other systems - here is a short view of Python and ldd when --enable-shared is used:

    /opt/bin/python3 needs: /usr/lib/libc.a(shr.o) /usr/lib/libpthreads.a(shr_xpg5.o) /opt/lib/libpython3.7m.so /unix /usr/lib/libcrypt.a(shr.o) /usr/lib/libpthreads.a(shr_comm.o) /usr/lib/libdl.a(shr.o)

    Here is an example not using --enable-shared: /opt/bin/python3 needs: /usr/lib/libc.a(shr.o) /usr/lib/libpthreads.a(shr_xpg5.o) /usr/lib/libpthreads.a(shr_comm.o) /usr/lib/libdl.a(shr.o) /usr/lib/libintl.a(libintl.so.8) /unix /usr/lib/libcrypt.a(shr.o) /usr/lib/libpthread.a(shr_xpg5.o) /usr/lib/libiconv.a(libiconv.so.2) /usr/lib/libc.a(shr_64.o) /usr/lib/libcrypt.a(shr_64.o)

    Both versions build ".so" files, that are accessed using dlopen()

    root@x066:[/home/root]find /opt/lib/python3.7 -name \*.so | head /opt/lib/python3.7/lib-dynload/_asyncio.so /opt/lib/python3.7/lib-dynload/_bisect.so /opt/lib/python3.7/lib-dynload/_blake2.so /opt/lib/python3.7/lib-dynload/_bz2.so /opt/lib/python3.7/lib-dynload/_codecs_cn.so /opt/lib/python3.7/lib-dynload/_codecs_hk.so /opt/lib/python3.7/lib-dynload/_codecs_iso2022.so /opt/lib/python3.7/lib-dynload/_codecs_jp.so /opt/lib/python3.7/lib-dynload/_codecs_kr.so /opt/lib/python3.7/lib-dynload/_codecs_tw.so

    and

    root@x064:[/opt/lib/python3.7]find /opt/lib/python3.7 -name \*.so | head /opt/lib/python3.7/lib-dynload/_asyncio.so /opt/lib/python3.7/lib-dynload/_bisect.so /opt/lib/python3.7/lib-dynload/_blake2.so /opt/lib/python3.7/lib-dynload/_bz2.so /opt/lib/python3.7/lib-dynload/_codecs_cn.so /opt/lib/python3.7/lib-dynload/_codecs_hk.so /opt/lib/python3.7/lib-dynload/_codecs_iso2022.so /opt/lib/python3.7/lib-dynload/_codecs_jp.so /opt/lib/python3.7/lib-dynload/_codecs_kr.so /opt/lib/python3.7/lib-dynload/_codecs_tw.so

    Lastly, The PR, asis, appears to be broken.

    make: [Makefile:613: sharedmods] Illegal instruction (core dumped) /opt/bin/make returned an error root@x066:[/data/prj/python/python3-3.9]make V=1 CC='xlc_r' LDSHARED='xlc_r -G ' OPT='-DNDEBUG -O' _TCLTK_INCLUDES='' _TCLTK_LIBS='' ./python -E ../git/python3-3.9/setup.py build make: [Makefile:613: sharedmods] Illegal instruction (core dumped)

    Note also: LDSHARED has added xlc_r to it's flags - that does not seem right either.

    -1

    d399e807-80a0-406d-8006-e79791f279ae commented 5 years ago

    Thanks for the in-depth responses and feedback.

    When reinvestigating this in more detail that led me to create this patch, I discovered that the premise upon which I was operating upon was not the default (desired) compiler and linker flags. It turns out the environment I am working in builds all of the software using -bsvr4 and -brtl on AIX.

    I have a lot more to unravel now. I already closed the PR and will abandon this issue since it has been clearly illustrated that this is masking an underlying problem.

    Thanks for taking the time to provide feedback and detail of what is problematic with this change.