Test suite is dumping core after completing

nikola-rados commented 4 years ago

While all of the tests pass, the test suite causes a core dump that seems to be caused by infinite recursion. The output below is from a run on a local machine:

(venv) nrados@pcic-2024:~/github/pcic/modelmeta$ python3 -m pytest
=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.6.9, pytest-5.4.2, py-1.8.1, pluggy-0.13.1
rootdir: /home/nrados/github/pcic/modelmeta, inifile: pytest.ini, testpaths: tests
plugins: alembic-verify-0.1.4
collected 515 items

tests/alembic-migrations/test_migrations.py ....                                                                                                                                                            [  0%]
tests/mm_cataloguer/test_associate_ensemble.py ...................................                                                                                                                          [  7%]
tests/mm_cataloguer/test_index.py ......................................................................................................................................................................... [ 40%]
........................................................................................................................................................................................................... [ 79%]
.........................................................................................                                                                                                                   [ 97%]
tests/modelmeta/test_against_test_sqlite_db.py .........                                                                                                                                                    [ 98%]
tests/modelmeta/test_orm.py ......                                                                                                                                                                          [100%]

================================================================================================ warnings summary =================================================================================================
tests/alembic-migrations/test_migrations.py: 46 tests with warnings
  /home/nrados/github/pcic/modelmeta/venv/lib/python3.6/site-packages/sqlalchemydiff/comparer.py:295: SADeprecationWarning: The Inspector.get_primary_keys() method is deprecated and will be removed in a future release.  Please refer to the Inspector.get_pk_constraint() method.
    return inspector.get_primary_keys(table_name)

tests/mm_cataloguer/test_index.py: 28 tests with warnings
  /home/nrados/github/pcic/modelmeta/venv/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py:1364: SAWarning: DELETE statement on table 'data_file_variables_gridded' expected to delete 1 row(s); 0 were matched.  Please set confirm_deleted_rows=False within the mapper configuration to prevent this warning.
    % (table.description, expected, rows_matched)

tests/mm_cataloguer/test_index.py: 32 tests with warnings
  /home/nrados/github/pcic/modelmeta/venv/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py:1364: SAWarning: DELETE statement on table 'data_file_variables' expected to delete 1 row(s); 0 were matched.  Please set confirm_deleted_rows=False within the mapper configuration to prevent this warning.
    % (table.description, expected, rows_matched)

tests/mm_cataloguer/test_index.py::test_find_update_or_insert_cf_file__dup[streamflow-dataset_mocks8-os_path_mocks8-False]
tests/mm_cataloguer/test_index.py::test_find_update_or_insert_cf_file__dup[streamflow-dataset_mocks9-os_path_mocks9-False]
tests/mm_cataloguer/test_index.py::test_find_update_or_insert_cf_file__dup[streamflow-dataset_mocks10-os_path_mocks10-False]
tests/mm_cataloguer/test_index.py::test_find_update_or_insert_cf_file__dup[streamflow-dataset_mocks11-os_path_mocks11-False]
  /home/nrados/github/pcic/modelmeta/venv/lib/python3.6/site-packages/sqlalchemy/orm/persistence.py:1364: SAWarning: DELETE statement on table 'data_file_variables_dsg_time_series' expected to delete 1 row(s); 0 were matched.  Please set confirm_deleted_rows=False within the mapper configuration to prevent this warning.
    % (table.description, expected, rows_matched)

-- Docs: https://docs.pytest.org/en/latest/warnings.html
================================================================================== 515 passed, 110 warnings in 431.75s (0:07:11) ==================================================================================
Error in sys.excepthook:

Original exception was:
Error in sys.excepthook:

... (repeats about 100 times)

Original exception was:
Error in sys.excepthook:

Original exception was:
Fatal Python error: GC object already tracked

Current thread 0x00007f3e8b8fa740 (most recent call first):
(core dumped)

This issue is also seen in our Jenkins pipeline where error is slightly different but ends in the same result:

Original exception was:
Fatal Python error: Segmentation fault

Current thread 0x00007f72884c1080 (most recent call first):
  File "/usr/local/lib/python3.7/site-packages/nchelpers/decorators.py", line 15 in wrapper
Segmentation fault (core dumped)

The particular error seen in Jenkins points to a method in nchelpers meant to prevent infinite recursion.

After some investigation is looks like the error is coming from both test files in tests/mm_cataloguer/.

jameshiebert commented 4 years ago

Yikes, looks like this goes back at least 10 months.

nikola-rados commented 4 years ago

Interesting that it was showing the exceptions but not failing.

nikola-rados commented 4 years ago

Tests from tests/mm_cataloguer/test_associate_ensemble.py that trigger the exception (but do not cause a core dump) are the following:

test_associate_ensemble_to_data_file_variable__
test_associate_ensemble_to_data_file__
test_associate_ensemble_to_filepath__
test_associate_ensemble_to_filepaths__ e

And from tests/mm_cataloguer/test_index.py:

test_get_grid_info
test_get_level_set_info
test_insert_level_set
test_find_level_set
test_find_or_insert_level_set
test_insert_grid
test_find_grid
test_find_or_insert_grid
test_insert_data_file_variable_gridded
test_find_data_file_variable_gridded
test_find_or_insert_data_file_variable_gridded
test_find_update_or_insert_cf_file__dup

A common thread across all (but one) of these tests is that they have the tiny_gridded_dataset or tiny_any_dataset as a parameter. These are created in tests/conftest.py and return a CFDataset. Since we know our Jenkins output spits out an exception pointing us at ncherlpers we may be on the right track. That being said I can't say for sure that this is the cause the problem since other tests are using the same parameter and do not cause any issue (ex. test_find_model). It may be how the parameter is being used, rather than the parameter itself.

@rod-glover, you were the last person to touch this code (even though it was 3 years ago), is there any way about how it is used that may lead to endless recursion?

The only exception to this is test_associate_ensemble_to_filepaths__ (labelled with the e above). This method does not use nchelpers at all and still causes the exception. I'm not quite sure what to make of this yet.

corviday commented 4 years ago

The problem appears to be the @memoize decorator, which caches function results for faster repeat calls. If you comment out the @memoize before get_grid_info(), for example, the tests involving get_grid_info() will not dump core. Something about the function data cache is confusing pytest's garbage collection.

nikola-rados commented 4 years ago

Ran the test with the decorators commented out, works fine now. How much do we gain from these decorators? Can we just remove them outright or can we find a way around it?

corviday commented 4 years ago

Possible options might include:

upgrading to different caching code. Comments on @memoize indicate it was designed to work with python 2.7. I don't think we're supporting python 2.7 any more in this repository, so I think we could try switching to one of the built in python 3 memoizing modules and see if they worked better
not using the cache if we detect that pytest is running
not caching at all
trying to debug the code as it exists. this seems like it could be a giant headache, since whatever the interaction is, the result is buried deep in automatic garbage collection

jameshiebert commented 4 years ago

Option 1 sounds like a good start if it simplifies things (since we're not supporting py27 in this repo anymore).

nikola-rados commented 4 years ago

I had lost track of this issue and it has come back to bite me. I'm working through issue #97 and the test suite will not work until this issue is solved.

Looking at the docs it seems python 3.2+ has the @lru_cache decorator available. Do we want to define a maxsize or typed in our case?

jameshiebert commented 4 years ago

There's an lru_cache decorator?! Awesome! We should use that in climate-explorer-backend too and get rid of a whole bunch of custom code.

corviday commented 4 years ago

We aren't using it in the ce backend because lru_cache defines cache size in terms of a set number of stored items, and we wanted to define it as a maximum number of megabytes instead. Though maybe lru_cache has added more options since we last looked into it, that was a while ago.

nikola-rados commented 4 years ago

We aren't using it in the ce backend because lru_cache defines cache size in terms of a set number of stored items, and we wanted to define it as a maximum number of megabytes instead.

Is this a problem for modelmeta as well? Or can we use it in this case?

corviday commented 4 years ago

I ran

pytest tests/mm_cataloguer/test_index.py::test_get_grid_info

from the branch you made with the lru_cache, and got a bunch of

Error in sys.excepthook:

Original exception was:

though I didn't get the segfault, for some reason?

So I added

    get_grid_info.cache_clear()

to the end of test_get_grid_info() to clear the cache after each test, and the error went away. Perhaps something about how pytest or our complicated test config are doing teardown between tests is an issue for the cache? I'm not sure what that would be.

Perhaps a solution would be to write a cache cleanup function and have pytest run it after all tests are completed, or maybe after each individual test? It could use cache_info() on each cache to see if anything was in it, and cache_clear() to clear any cache with stuff in it.

That assumes the error only comes up in the context of pytest. I said previously that I've never see the error "in the wild", but on further thought, I never use the most recent version of this package (it is incompatible with our database), so it's possible the error would happen in the wild but nobody has ever seen it. It may be worth spinning up a test database and making sure this actually is a pytest-only issue.

jameshiebert commented 4 years ago

@corviday Your experiment was with using the builtin @lru_cache?

corviday commented 4 years ago

Yes, @nikola-rados made a branch with it, I used that branch except modified as described.

nikola-rados commented 4 years ago

I'll be moving it to a dedicated issue branch. I was just experimenting with it on the actions branch I was working on.

jameshiebert commented 4 years ago

Yeah, I've been looking over this and have a few (without evidence, unfortunately) suspicions and questions:

Are the Error in sys.excepthook: errors related to the segfault? I suspect not, since we can make the former happen without the latter.
Where is the core dump?! I can't find it in my working directory (which is where it should land) or the virtual environment bin dir. Inspecting it would probably be insightful... have either of you found one?
I'm vaguely suspicious of the testing.posgresql setup/teardown, but I have not particular reason to be (aside that it's bringing up subprocs, I think).
I'm vaguely suspicious of nchelpers.decorators.prevent_infinite_recursion, since it gets used and also tries to be clever with thread local data caching. But again, I haven't been able to make it specifically break... so that may not be actually problematic.

corviday commented 4 years ago

Are the Error in sys.excepthook: errors related to the segfault? I suspect not, since we can make the former happen without the latter.

I was assuming so, but with no particularly strong evidence. You may well be right.

Where is the core dump?!

I have not been able to find it either.

jameshiebert commented 4 years ago

UGH! The NetCDF datasets were not being closed properly in the testing setup! I've pushed a branch that should fix it with a few minor changes. Stand by for the CI run.

nikola-rados commented 4 years ago

Is the cache fix not relevant then? Or is that a separate problem from the one you have solved?

jameshiebert commented 4 years ago

It appears to not be a problem at all :shrug: I'll make a PR and you can decide.

jameshiebert commented 4 years ago

And just to be clear, the only code affected was in the testing code. So we don't necessarily need to do a new release now or anything, because client code is completely unaffected.

pacificclimate / modelmeta

Test suite is dumping core after completing #95