Closed edublancas closed 2 years ago
@edublancas please add short context.
done
Hi @edublancas I have a question for this issue. Don't we already have a file size checker at here https://github.com/ploomber/soopervisor/blob/b230a5bb415dd4e9b05e1448b21b17c2150dc75c/src/soopervisor/commons/source.py#L174
Is it really necessary to add another check again? I assume any big files after compression will exceed 5MB which then triggers this warning?
great catch. I remember opening this issue and then implementing the basic check. What it's already implemented does not provide much detail, it measures the size of the .tar.gz
file but the user won't know what specific files are too big.
So we want to extend this to check on a file-by-file basis, for example, a user may inadvertently include a few big .csv
files as part of their project's source code and could show something like this:
the following files are too big. this will increase the docker image size so ensure this is required to run the pipeline:
- data.csv (X MB)
- another.csv (Y MB)
Ok I see. Working it.
When building the docker image, we compress the user's code (
.tar.gz
file) and copy it into the image, here's the implementation.The function already ignores things that shouldn't be copied into the image, like
__pycache__
files or the git repository. These are clear examples of things that shouldn't go into the image but there are ambiguous cases.For example, a user might be storing raw data in some
raw/
folder and the function will copy them, making the docker image unnecessarily big. In many cases, this would be a mistake and we should warn the user, but in other cases, the user indeed might want to include those files (so we shouldn't throw an error just because of this)So we should warn the user if big files end up in the folder they will get copied. we should modify the copying function and print a warning message in yellow color if we detect files above a certain threshold. 10 MB is a reasonable threshold