Open pavlis opened 10 months ago
Back to look at this after traveling and see I have had no response here although for the record @wangyinz and I exchanged a couple of emails on this topic.
With that preface, I was reading pytest documentation this morning and think I see a cleaner solution that meshes with pytest features. That is, I think this problem may be better addressed with pytest fixtures. I think with pytest fixtures we can combine what is now two test files into one with one running dask/spark and the other not running either. The trick is to change the test class to add an argument for enabling dask or spask to the class and implementing each alternative as a fixture. Before I go down into this potentially dark alley any comments on the idea here would be appreciated. I'm very naive about pytest and have a lot to learn making it difficult to see what is in that dark alley.
This sounds good! I thought that there is a solution in the fixtures but never got the chance to explore. If you figured out what it is, we should definitely implement it.
I do think there may be a solution with fixtures to this problem, but I had yet another idea I'd like us to explore. Call this the "include" file approach.
Besides the differences in these two files noted above, the test_db_no_spark_dask.py
uses python's mock feature to, I guess, create an environment simulating a run with no dask or spark installed. At least, I think that is what the following construct does:
with mock.patch.dict(
sys.modules,
{"pyspark": None, "dask": None, "dask.dataframe": None, "dask.bag": None},
):
The common code between these two files all lies inside the with block defined by the lines above. I have two fundamental issues I don't fully understand here that make it challenging to fix with the issue involved here:
mock.patch
statement works so please confirm what I think is going on here. That is, sys.modules is the current list of implicit imports for the run environment. This call to mock.patch makes sure the spark and dask modules that might be implicitly loaded are cleared. Then the tests inside the with block are run in an environment where we can be sure dask and spark are both disable and wouldn't corrupt the namespace. Is that right?setup_class
method. What is not clear is if it is possible to put common test files in a separate file and do the equivalent of an "include" in C? Specifically, if import acted like include I could put all the redundant code in a separate file. Call it "db_base_code.py" for this discussion. Then I wonder if the following construct would work:
with mock.patch.dict(
sys.modules,
{"pyspark": None, "dask": None, "dask.dataframe": None, "dask.bag": None},
):
import db_base_code # might need a path qualifier but this is the idea
This would be an easier solution than fixtures or inheritance discussed above, but from the pytest rule description above I am skeptical that it would work. Maybe I need a simpler test file to see if this might work. Note it is particularly complicated by the fact that the common code for these tests are inside a class wrapper to allow for a setup_class to define required data. Before I got down that road perhaps one of you could tell me if this is a dead end.
Overall this is a very complex, multifacted problem that challenging to fix. I am learning the hard way why just copying the text was an easy way out. I reiterate, however, that the current pair of test files are not sustainable and we need to fix this problem.
We have created a horrible maintenance nightmare in out test suite for testing the Database class and parallel readers/writers. The issue is that more than 90% of the code in the following two test files are pure duplicates - well, almost pure duplicates but that is a detail I'll clarify at the end. The two files in the python/tests directory are:
The two files differ in only the following ways:
test_set_schema
missing from the other.test_save_ensemble_data_binary_file, test_read_ensemble_data_group, test_index_mseed_file_parallel
. It also adds function tests (not in the class definition) for:test_read_distributed_data
andtest_read_distributed_data_dask
.test_save_dataframe
method in the test_db_spark_dask version has a "parallel" argument set true while "no" has it False.test_save_and_read_data
method has some minor differences I think are harmless.I think we need to fix this problem now to avoid a much nastier maintenance problem down the road. I hit this problem when resolving tests for v2 of Database. pytest ran the "no" version first so I resolved all those. Then I realized I was having to fix the same problems in the other file when it was run later.
The solution I think is fairly obvious here is to use inheritance. Since test_db_spark_dask is (mostly) a superset of the "no" version we should be able to make it just a subclass of the "no" version. Complications to doing this are:
Can others on the development team clarify why we have this duplication and suggest alternatives to my inheritance proposal. I will work the inheritance line if that is best option.