Closed nikola-rados closed 4 years ago
Yikes, looks like this goes back at least 10 months.
Interesting that it was showing the exceptions but not failing.
Tests from tests/mm_cataloguer/test_associate_ensemble.py
that trigger the exception (but do not cause a core dump) are the following:
test_associate_ensemble_to_data_file_variable__
test_associate_ensemble_to_data_file__
test_associate_ensemble_to_filepath__
test_associate_ensemble_to_filepaths__
eAnd from tests/mm_cataloguer/test_index.py
:
test_get_grid_info
test_get_level_set_info
test_insert_level_set
test_find_level_set
test_find_or_insert_level_set
test_insert_grid
test_find_grid
test_find_or_insert_grid
test_insert_data_file_variable_gridded
test_find_data_file_variable_gridded
test_find_or_insert_data_file_variable_gridded
test_find_update_or_insert_cf_file__dup
A common thread across all (but one) of these tests is that they have the tiny_gridded_dataset
or tiny_any_dataset
as a parameter. These are created in tests/conftest.py
and return a CFDataset
. Since we know our Jenkins output spits out an exception pointing us at ncherlpers
we may be on the right track. That being said I can't say for sure that this is the cause the problem since other tests are using the same parameter and do not cause any issue (ex. test_find_model
). It may be how the parameter is being used, rather than the parameter itself.
@rod-glover, you were the last person to touch this code (even though it was 3 years ago), is there any way about how it is used that may lead to endless recursion?
The only exception to this is test_associate_ensemble_to_filepaths__
(labelled with the e above). This method does not use nchelpers
at all and still causes the exception. I'm not quite sure what to make of this yet.
The problem appears to be the @memoize
decorator, which caches function results for faster repeat calls. If you comment out the @memoize
before get_grid_info()
, for example, the tests involving get_grid_info()
will not dump core. Something about the function data cache is confusing pytest's garbage collection.
Ran the test with the decorators commented out, works fine now. How much do we gain from these decorators? Can we just remove them outright or can we find a way around it?
Possible options might include:
upgrading to different caching code. Comments on @memoize
indicate it was designed to work with python 2.7. I don't think we're supporting python 2.7 any more in this repository, so I think we could try switching to one of the built in python 3 memoizing modules and see if they worked better
not using the cache if we detect that pytest
is running
not caching at all
trying to debug the code as it exists. this seems like it could be a giant headache, since whatever the interaction is, the result is buried deep in automatic garbage collection
Option 1 sounds like a good start if it simplifies things (since we're not supporting py27 in this repo anymore).
I had lost track of this issue and it has come back to bite me. I'm working through issue #97 and the test suite will not work until this issue is solved.
Looking at the docs it seems python 3.2+
has the @lru_cache
decorator available. Do we want to define a maxsize
or typed
in our case?
There's an lru_cache
decorator?! Awesome! We should use that in climate-explorer-backend
too and get rid of a whole bunch of custom code.
We aren't using it in the ce backend because lru_cache
defines cache size in terms of a set number of stored items, and we wanted to define it as a maximum number of megabytes instead. Though maybe lru_cache
has added more options since we last looked into it, that was a while ago.
We aren't using it in the ce backend because
lru_cache
defines cache size in terms of a set number of stored items, and we wanted to define it as a maximum number of megabytes instead.
Is this a problem for modelmeta
as well? Or can we use it in this case?
I ran
pytest tests/mm_cataloguer/test_index.py::test_get_grid_info
from the branch you made with the lru_cache
, and got a bunch of
Error in sys.excepthook:
Original exception was:
though I didn't get the segfault, for some reason?
So I added
get_grid_info.cache_clear()
to the end of test_get_grid_info()
to clear the cache after each test, and the error went away. Perhaps something about how pytest or our complicated test config are doing teardown between tests is an issue for the cache? I'm not sure what that would be.
Perhaps a solution would be to write a cache cleanup function and have pytest run it after all tests are completed, or maybe after each individual test? It could use cache_info()
on each cache to see if anything was in it, and cache_clear()
to clear any cache with stuff in it.
That assumes the error only comes up in the context of pytest
. I said previously that I've never see the error "in the wild", but on further thought, I never use the most recent version of this package (it is incompatible with our database), so it's possible the error would happen in the wild but nobody has ever seen it. It may be worth spinning up a test database and making sure this actually is a pytest-only issue.
@corviday Your experiment was with using the builtin @lru_cache
?
Yes, @nikola-rados made a branch with it, I used that branch except modified as described.
I'll be moving it to a dedicated issue branch. I was just experimenting with it on the actions branch I was working on.
Yeah, I've been looking over this and have a few (without evidence, unfortunately) suspicions and questions:
Error in sys.excepthook:
errors related to the segfault? I suspect not, since we can make the former happen without the latter.testing.posgresql
setup/teardown, but I have not particular reason to be (aside that it's bringing up subprocs, I think).nchelpers.decorators.prevent_infinite_recursion
, since it gets used and also tries to be clever with thread local data caching. But again, I haven't been able to make it specifically break... so that may not be actually problematic.Are the Error in sys.excepthook: errors related to the segfault? I suspect not, since we can make the former happen without the latter.
I was assuming so, but with no particularly strong evidence. You may well be right.
Where is the core dump?!
I have not been able to find it either.
UGH! The NetCDF datasets were not being closed properly in the testing setup! I've pushed a branch that should fix it with a few minor changes. Stand by for the CI run.
Is the cache fix not relevant then? Or is that a separate problem from the one you have solved?
It appears to not be a problem at all :shrug: I'll make a PR and you can decide.
And just to be clear, the only code affected was in the testing code. So we don't necessarily need to do a new release now or anything, because client code is completely unaffected.
While all of the tests pass, the test suite causes a core dump that seems to be caused by infinite recursion. The output below is from a run on a local machine:
This issue is also seen in our Jenkins pipeline where error is slightly different but ends in the same result:
The particular error seen in Jenkins points to a method in
nchelpers
meant to prevent infinite recursion.After some investigation is looks like the error is coming from both test files in
tests/mm_cataloguer/
.