google / re2

RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.
BSD 3-Clause "New" or "Revised" License
8.86k stars 1.12k forks source link

Python typestubs #496

Open ddn0 opened 2 months ago

ddn0 commented 2 months ago

I'm wondering if this project already maintains Python typestubs (pyi) for the google-re2 package.

If not, I've written typestubs for my own personal use, would this project be interested in accepting them? Otherwise, I can maintain a separate typestub library.

junyer commented 2 months ago

Thanks for asking! RE2 doesn't have a stub file, but it really should. Unfortunately, I have no experience with Python type checkers apart from https://github.com/google/pytype, so I will need some time to read about (and tinker with) https://github.com/python/mypy before it would be safe to say that I know what I'm doing. (stubgen and stubtest already sound like they will be very useful.) I also need to give some thought to plumbing everything into our Bazel CI workflow...

ddn0 commented 2 months ago

I have a stub file written, which you can use as a starting point.

My plan this week was to package it up to be pip installable. This might be a useful basis for you too? I can share the repo when it is ready (and I'm happy to sign any CLA required for you to take the code).

junyer commented 2 months ago

Many thanks for offering! I did spend some time looking into this on Saturday and, as I had feared, the situation with Bazel with regard to Python type checkers is... truly dire. Distributing the stub file as part of google-re2 remains something that I want to do... eventually. In the short term, however, you will need to DIY. :(

ddn0 commented 2 months ago

For others, here's the stub package I will be maintaining: https://github.com/ddn0/google-re2-stubs

junyer commented 2 months ago

Thanks, @ddn0!

@cosmicexplorer, I'm thinking to give up trying to work entirely within Bazel and, instead, to factor out generate_python_toolchains() in order to make the Bazel CI workflow use the local Python environment. That way, Bazel-built artifacts will be compatible with the local Python environment.

cosmicexplorer commented 2 months ago

Have been working a pip-only solution for now, and realized after much trial and error that:

  1. Namespace packages (declared with packages=[...]) are the only allowed way to bundle type stubs with a wheel (see https://peps.python.org/pep-0561/ and https://typing.readthedocs.io/en/latest/spec/distributing.html#packaging-typed-libraries). Our current py_modules=['re2'] won't work, because package_data and data_files all stick things into a subdirectory (this seems intentional, and very frustrating). So we won't be able to stick with a single-file package. This alone would be fine, except...
  2. Namespace packages are not able to import setuptools.Extension modules for some reason I cannot understand. I have tried to place our _re2 python module in e.g. _re2._re2 instead, to mimic my other project https://github.com/cosmicexplorer/medusa-zip, but this fails for some reason. I noticed that unlike medusa-zip, setuptools.Extension will always place _re2 into top_level.txt in the generated wheel, whereas e.g. setuptools_rust() for medusa-zip does not do that. So for whatever reason, the c/c++ setuptools.Extension interface does not have feature parity with setuptools_rust(). It's possible this may be related to multi-phased initialization, and maybe pyo3 from medusa-zip does that but our pybind11 doesn't?
  3. However, I was able to get a terrible prototype to initialize our poor _re2 module by hand, using the changes from this branch (see https://github.com/google/re2/commit/5235d0f1acfcc7bc1178063f73eda8b4770404fd or https://github.com/google/re2/compare/main...cosmicexplorer:internal-type-stubs?expand=1):
; cd python
; python setup.py bdist_wheel
; ls build/lib.linux-x86_64-cpython-310/re2/_re2.cpython-310-x86_64-linux-gnu.so 
-rwxr-xr-x 1 cosmicexplorer wheel 5.5M Jun  8 16:37 build/lib.linux-x86_64-cpython-310/re2/_re2.cpython-310-x86_64-linux-gnu.so*
(bbb) # cosmicexplorer@terrestrial-gamma-ray-flash: ~/tools/re2/python 16:43:51 
; python
Python 3.10.13 (main, Jan 10 2024, 22:22:23) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from importlib.machinery import ExtensionFileLoader as EFL
>>> EFL('_re2', 'build/lib.linux-x86_64-cpython-310/re2/_re2.cpython-310-x86_64-linux-gnu.so')
<_frozen_importlib_external.ExtensionFileLoader object at 0x70ae72e98250>
>>> zzz = EFL('_re2', 'build/lib.linux-x86_64-cpython-310/re2/_re2.cpython-310-x86_64-linux-gnu.so')
>>> zzz
<_frozen_importlib_external.ExtensionFileLoader object at 0x70ae72e98be0>
>>> dir(zzz)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'create_module', 'exec_module', 'get_code', 'get_data', 'get_filename', 'get_resource_reader', 'get_source', 'is_package', 'load_module', 'name', 'path']
>>> zzz.load_module()
<module '_re2' from 'build/lib.linux-x86_64-cpython-310/re2/_re2.cpython-310-x86_64-linux-gnu.so'>
>>> _re2 = zzz.load_module()
>>> dir(_re2)
['BytesToCharLen', 'CharLenToBytes', 'Error', 'Filter', 'RE2', 'Set', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__']

However, trying to use ExtensionFileLoader seems to fail when providing it any input that's not an unpacked physical file path, even though this syntax for indexing into a zip file works with other importers:

(bbb) # cosmicexplorer@terrestrial-gamma-ray-flash: ~/tools/re2/python 16:58:54 
; ls dist
total 1.7M
-rw-r--r-- 1 cosmicexplorer wheel 1.7M Jun  8 16:37 google_re2-1.1.20240601-cp310-cp310-linux_x86_64.whl
(bbb) # cosmicexplorer@terrestrial-gamma-ray-flash: ~/tools/re2/python 16:58:55 
; python 
Python 3.10.13 (main, Jan 10 2024, 22:22:23) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> dir(_re2)
KeyboardInterrupt
>>> from importlib.machinery import ExtensionFileLoader as EFL
>>> zzz = EFL('_re2', 'dist/google_re2-1.1.20240601-cp310-cp310-linux_x86_64.whl/re2/_re2.cpython-310-x86_64-linux-gnu.so')
>>> zzz.load_module()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<frozen importlib._bootstrap_external>", line 548, in _check_name_wrapper
  File "<frozen importlib._bootstrap_external>", line 1063, in load_module
  File "<frozen importlib._bootstrap_external>", line 888, in load_module
  File "<frozen importlib._bootstrap>", line 290, in _load_module_shim
  File "<frozen importlib._bootstrap>", line 719, in _load
  File "<frozen importlib._bootstrap>", line 674, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1176, in create_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: dist/google_re2-1.1.20240601-cp310-cp310-linux_x86_64.whl/re2/_re2.cpython-310-x86_64-linux-gnu.so: cannot open shared object file: Not a directory
>>> 

I also tried to get this working by executing the module straight from its bytes, but it appears loading a native extension requires the special handling of the ExtensionFileLoader, which cannot receive anything other than a physical file path.

However, the branch I linked (https://github.com/google/re2/compare/main...cosmicexplorer:internal-type-stubs?expand=1) does at least demonstrate that we can keep using setuptools.Extension to build our _re2 module, and then keep type stubs in a new re2 "package", which shouldn't break any APIs and should also provide typing info. So if we wanted to take this approach (instead of completely reworking the build system, which sounds even less promising), we would need to:

There were a whole host of difficulties I think we really shouldn't have had to deal with here, but if we can figure out a way to unzip the native _re2 module to a checksummed location so it only ever occurs once, I think we will be able to achieve this restructuring of the package contents to enable our users to get access to our type stubs (the generics I've used are actually quite useful to avoid str/bytes mixups). I am relatively confident that our current approach with single-file modules via py_modules=['re2'] will not allow us to distribute typing information, since type stubs seem to only be supported with the packages=[...] argument, but if someone else figures out how to avoid that drastic restructuring, I would love to hear about it.

junyer commented 2 months ago

I spent a lot of tonight glaring at the setuptools documentation until I finally happened to notice ext_package, which does the needful. I now have a change out for review that will munge the module into a package.

cosmicexplorer commented 2 months ago

I have some type stubs available at https://github.com/google/re2/compare/main...cosmicexplorer:type-stubs?expand=1 along with a change to setup.py which inserts them into the wheel, but will need to figure out how to change the github workflow to run mypy EDIT: done!. @junyer how should I test github CI? Does that get tested on gerrit? I have just disabled github pro so actions no longer run on my own account. I was also going to try making bazel run mypy but would prefer to do that in a followup change \@_\@.

EDIT: on gerrit as https://code-review.googlesource.com/c/re2/+/63310!

junyer commented 2 months ago

@junyer how should I test github CI? Does that get tested on gerrit? I have just disabled github pro so actions no longer run on my own account.

All of the CI workflows are "post-submit" only: they trigger when Copybara mirrors https://code.googlesource.com/re2 to GitHub. No need to concern yourself with anything more than local testing. :)

I was also going to try making bazel run mypy but would prefer to do that in a followup change @_@.

If the Bazel community ever decides to make Python type checkers Just Work™, we can revisit this then. In the meantime, I have no desire to (co-)author and subsequently maintain whatever Starlark would be needed, so please don't spend any (more!) time worrying about it. ;)