spotify / luigi

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Apache License 2.0
17.71k stars 2.39k forks source link

Best practices for testing luigi pipelines #3104

Open plazmakeks opened 3 years ago

plazmakeks commented 3 years ago

Hi,

I recently took over a project a former coworker initiated at my current company. He implemented an ETL pipeline using luigi. All the tasks output a LocalTarget (aka file) to be fed into the next tasks. Till I took things over (automated) tests where neglected by the people. However, meanwhile, for convenience we want to have those tests. I managed to write a test which actually tests the complete pipeline and does its final assertion about counting documents being inserted into a database. Not an issue. However, as noted, files are being created for the most tasks in the pipeline and here comes the pain: I would like to mock the filesystem for this as the ci runs the test in a docker container and didn't manage to get this going for now. There is this library pyfakefs which works pretty well generally but cannot mock the multiprocessing module of python which is required by the luigi worker. For the stock luigi MockFileSystem there is almost no examples available and the very few you find seem not very elaborated. I would not want to change the implementation as there is no tests so mocking the filesystem is my preferred way to go.

Could maybe anyone point to a good example how to realize that?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If closed, you may revisit when your time allows and reopen! Thank you for your contributions.