mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
28 stars 11 forks source link

Database test redundancy #457

Open pavlis opened 10 months ago

pavlis commented 10 months ago

We have created a horrible maintenance nightmare in out test suite for testing the Database class and parallel readers/writers. The issue is that more than 90% of the code in the following two test files are pure duplicates - well, almost pure duplicates but that is a detail I'll clarify at the end. The two files in the python/tests directory are:

  1. db/test_db_spark_dask.py
  2. db/test_db_no_spark_dask.py

The two files differ in only the following ways:

  1. The imports are different
  2. The "no" file has the method test_set_schema missing from the other.
  3. The test_db_spark_dask file add three methods to the "no" version: test_save_ensemble_data_binary_file, test_read_ensemble_data_group, test_index_mseed_file_parallel. It also adds function tests (not in the class definition) for: test_read_distributed_data and test_read_distributed_data_dask.
  4. The test_save_dataframe method in the test_db_spark_dask version has a "parallel" argument set true while "no" has it False.
  5. The test_save_and_read_data method has some minor differences I think are harmless.
  6. Because of the construct below, the indent of the "no" version is at one level up from test_db_spark_dask. That is important only because it made cross-checking the two files an even harder problem. I will spare you my solution. @wangyinz I am sending you the text file of the raw record I made to clarify this problem.

I think we need to fix this problem now to avoid a much nastier maintenance problem down the road. I hit this problem when resolving tests for v2 of Database. pytest ran the "no" version first so I resolved all those. Then I realized I was having to fix the same problems in the other file when it was run later.

The solution I think is fairly obvious here is to use inheritance. Since test_db_spark_dask is (mostly) a superset of the "no" version we should be able to make it just a subclass of the "no" version. Complications to doing this are:

  1. I am confused from the documentation about how pytest handles test object constructors and destructors. That is, it isn't clear in my cursory reading so far why setup_class is used instead of init. I can see more why teardown_class is needed as python doesn't really have a destructor or does it? That is only an anticipated problem though.
  2. The bigger issue will be getting the imports right. This will require a deep understanding of the (confusing in my opinion but a necessary evil for an interpreted language) way python handles symbols loaded during imports.

Can others on the development team clarify why we have this duplication and suggest alternatives to my inheritance proposal. I will work the inheritance line if that is best option.

pavlis commented 10 months ago

Back to look at this after traveling and see I have had no response here although for the record @wangyinz and I exchanged a couple of emails on this topic.

With that preface, I was reading pytest documentation this morning and think I see a cleaner solution that meshes with pytest features. That is, I think this problem may be better addressed with pytest fixtures. I think with pytest fixtures we can combine what is now two test files into one with one running dask/spark and the other not running either. The trick is to change the test class to add an argument for enabling dask or spask to the class and implementing each alternative as a fixture. Before I go down into this potentially dark alley any comments on the idea here would be appreciated. I'm very naive about pytest and have a lot to learn making it difficult to see what is in that dark alley.

wangyinz commented 10 months ago

This sounds good! I thought that there is a solution in the fixtures but never got the chance to explore. If you figured out what it is, we should definitely implement it.

pavlis commented 10 months ago

I do think there may be a solution with fixtures to this problem, but I had yet another idea I'd like us to explore. Call this the "include" file approach.

Besides the differences in these two files noted above, the test_db_no_spark_dask.py uses python's mock feature to, I guess, create an environment simulating a run with no dask or spark installed. At least, I think that is what the following construct does:

with mock.patch.dict(
    sys.modules,
    {"pyspark": None, "dask": None, "dask.dataframe": None, "dask.bag": None},
):

The common code between these two files all lies inside the with block defined by the lines above. I have two fundamental issues I don't fully understand here that make it challenging to fix with the issue involved here:

  1. I am not sure I understand how this mock.patch statement works so please confirm what I think is going on here. That is, sys.modules is the current list of implicit imports for the run environment. This call to mock.patch makes sure the spark and dask modules that might be implicitly loaded are cleared. Then the tests inside the with block are run in an environment where we can be sure dask and spark are both disable and wouldn't corrupt the namespace. Is that right?
  2. The rules pytest has for deciding if a function is to be run as a test are simple if the file is just a string of functions that begin with "test". The complete set of rules are simple and described here. Similarly if, as in this case, a class is defined that has various permutations of "test" in it the class will be treated as a group of tests that are run sequentially after initializing witht eh setup_class method. What is not clear is if it is possible to put common test files in a separate file and do the equivalent of an "include" in C? Specifically, if import acted like include I could put all the redundant code in a separate file. Call it "db_base_code.py" for this discussion. Then I wonder if the following construct would work:
    with mock.patch.dict(
    sys.modules,
    {"pyspark": None, "dask": None, "dask.dataframe": None, "dask.bag": None},
    ):
    import db_base_code    # might need a path qualifier but this is the idea

    This would be an easier solution than fixtures or inheritance discussed above, but from the pytest rule description above I am skeptical that it would work. Maybe I need a simpler test file to see if this might work. Note it is particularly complicated by the fact that the common code for these tests are inside a class wrapper to allow for a setup_class to define required data. Before I got down that road perhaps one of you could tell me if this is a dead end.

Overall this is a very complex, multifacted problem that challenging to fix. I am learning the hard way why just copying the text was an easy way out. I reiterate, however, that the current pair of test files are not sustainable and we need to fix this problem.