open-contracting / ocdskit

A suite of command-line tools for working with OCDS data
https://ocdskit.readthedocs.io
BSD 3-Clause "New" or "Revised" License
17 stars 6 forks source link

PyPy support #178

Closed jpmckinney closed 2 years ago

jpmckinney commented 2 years ago

I encountered various forms of core dumps on GitHub Actions: "Segmentation fault", "Abort trap: 6", etc.

Some things I addressed:

However, just running pytest also causes a "Segmentation fault". The traceback in https://github.com/open-contracting/ocdskit/runs/3082882014?check_suite_focus=true starts with jsonref, whose tests are currently failing on all versions, and whose CI is configured to allow failures on PyPy: https://github.com/gazpachoking/jsonref (jsonref seems unmaintained, which is concerning.)

Anyway, I don't think the tests will pass on PyPy, but if we restore it, see 21548ea

jpmckinney commented 2 years ago

I'm able to reproduce locally. The message starts with "Job 1, 'pytest' terminated by signal " followed by:

It occurs inconsistently. When testing only pytest -v tests/commands/test_compile.py, I have gotten it to fail on:

It likely fails on any *_package* test. This case is special, in that it sets package['records'] to a generator, which has special handling in util.SerializableGenerator to exhaust the generator.

If I disable that behavior by changing streaming=False, I now get:

It might have to do with garbage collection and yield: https://www.pypy.org/compat.html and https://doc.pypy.org/en/latest/cpython_differences.html#differences-related-to-garbage-collection-strategies

jpmckinney commented 2 years ago

The particular points at which it fails might be red herrings.

This patch replaces ijson with json, and avoids the package['records'] generator. Some tests fail (because json can't read concatenated JSON), but there are no segfaults, etc.

So, the issue is probably some interaction with ijson. Either ijson has an issue, or our use of ijson is not safe (maybe more likely than the former).

diff --git a/ocdskit/cli/commands/base.py b/ocdskit/cli/commands/base.py
index 3c58ba1..7f81d88 100644
--- a/ocdskit/cli/commands/base.py
+++ b/ocdskit/cli/commands/base.py
@@ -2,6 +2,7 @@ import os
 import sys
 from abc import ABC, abstractmethod

+import json
 import ijson

 from ocdskit.util import iterencode, json_dumps
@@ -62,8 +63,10 @@ class BaseCommand(ABC):
         """
         Yields the items in the input.
         """
-        file = StandardInputReader(self.args.encoding)
-        yield from ijson.items(file, self.prefix(), multiple_values=True, **kwargs)
+        for line in sys.stdin:
+            yield json.loads(line, **kwargs)
+        # file = StandardInputReader(self.args.encoding)
+        # yield from ijson.items(file, self.prefix(), multiple_values=True, **kwargs)

     def print(self, data, streaming=False):
         """
diff --git a/ocdskit/cli/commands/compile.py b/ocdskit/cli/commands/compile.py
index d08eedf..b6d73dc 100644
--- a/ocdskit/cli/commands/compile.py
+++ b/ocdskit/cli/commands/compile.py
@@ -1,4 +1,5 @@
 import logging
+import platform
 import sys

 import ocdskit.packager
@@ -36,9 +37,16 @@ class Command(OCDSCommand):
             logger.warning('sqlite3 is unavailable, so the command will run in memory. If input files are too large, '
                            'the command might exceed available memory.')

+        if platform.python_implementation() == 'PyPy':
+            merge_streaming = False
+            print_streaming = False
+        else:
+            merge_streaming = True
+            print_streaming = self.args.package
+
         try:
-            for output in merge(self.items(), streaming=True, **kwargs):
-                self.print(output, streaming=self.args.package)
+            for output in merge(self.items(), streaming=merge_streaming, **kwargs):
+                self.print(output, streaming=print_streaming)
         except MissingOcidKeyError as e:
             raise CommandError('The `ocid` field of at least one release is missing.') from e
         except InconsistentVersionError as e:
jpmckinney commented 2 years ago

I think I need a much better understanding of generators and PyPy to resolve this. Given that OCDS Kit uses a lot of generators, I'm also not sure how much speed-up PyPy can offer (it didn't in older versions at least).

Since most OCDS tools depend on OCDS Kit, it would be nice to support PyPy, so that the other tools can run on PyPy. However, for now, I'll abandon the project.

jpmckinney commented 2 years ago

I took the strategy of disabling different parts of the code: print(), json_dump(), iterencode(), etc. The errors persisted. It then occurred to me to use a different ijson backend. Avoiding the yajl2_c backend seems to be working in #179.