Mapper unit testing: print output about each mapper even if no failures

barbarahui commented 8 months ago

We looked at Tim's mapper unit testing PR 530 and it still doesn't print output per mapper when run. Otherwise the PR looks fine.

We're waiting to see how many CiC hours we have left once Lucas finishes the Tind mapper work. If there is still time on the books, we'll have Tim do the work to add output per mapper. Otherwise, CDL will do the work at some point.

We decided that mapper unit testing isn't critical for MVP, so can potentially wait until after cutover. However, it will definitely be useful in the future when doing mapper development.

See Barbara's notes below on how to approach this work if CDL ends up being the one to do it.

barbarahui commented 8 months ago

The framework uses pytest. From pytest's perspective, there is only one test right now, called test_mappers (defined in metadata_mapper/test/test_mapper.py). This single test comprises:

traversing the codebase and finding mappers to test
for each mapper found:
- skip if mapper doesn't match --mapper or --mappers arg
- import mapper module (note: why not wait to import until we know we have a helper?)
- look for a corresponding helper class
- instantiate helper class (this creates fixture data)
- instantiate mapper's Record class
- run the default test method, which simply runs to_UCLDC() on each mapper's record class with fixture data as input. If any exceptions are raised, pytest reports this failure. The code uses pytest.assume to override the default halting behavior and continue iterating over any remaining mappers.

Problem 1: we don't get output per mapper when running pytest. This is because the default behavior for pytest is to list the tests that it "collects" (in this case, a single test) and then suppress any output unless the test fails. So for example, this is what we get when we run pytest for contentdm and csudh mappers:

$ pytest metadata_mapper/test/ --mapper=contentdm,csudh
================================================= test session starts ==================================================
platform darwin -- Python 3.9.0, pytest-7.4.4, pluggy-1.3.0
rootdir: /Users/bhui/dev/rikolti
plugins: Faker-22.5.0, assume-2.4.3, requests-mock-1.11.0
collected 1 item

metadata_mapper/test/test_mapper.py .                                                                            [100%]

================================================== 1 passed in 1.26s ===================================================

Pytest simply reports that one test passed. What we need is some output on which mappers were tested. I think this can be accomplished by tweaking pytest's output capture behavior and/or logging behavior.

Problem 2: we don't get any error output when running pytest for a non-existent mapper, e.g.:

$ pytest metadata_mapper/test/ --mapper=foobar
================================================= test session starts ==================================================
platform darwin -- Python 3.9.0, pytest-7.4.4, pluggy-1.3.0
rootdir: /Users/bhui/dev/rikolti
plugins: Faker-22.5.0, assume-2.4.3, requests-mock-1.11.0
collected 1 item

metadata_mapper/test/test_mapper.py .                                                                            [100%]

================================================== 1 passed in 0.81s ==================================================

Again, I think we can tweak the capture and/or logging behavior to fix this.

Nice to have: it might be nice to get output on which mappers are missing helpers

barbarahui commented 8 months ago

Some things in Tim's PR that would be good to tidy up:

metadata_mapper/utilities get_files() and read_from_bucket() are never used? I think this might be left over from a previous merge with main branch
helper class prepare_record() runs pre-enrichments? YUP. Maybe rename it then for clarity.
update BaseTestHelper comments -- we're no longer using DEFAULT_SCHEMA. Also I don't think special 'generate' methods are needed or implemented.
base_helper: get rid of self.static = {} as it's never used
schema_index still isn't used. Get rid of it?

ucldc / rikolti

Mapper unit testing: print output about each mapper even if no failures #707