ploomber / soopervisor

☁️ Export Ploomber pipelines to Kubernetes (Argo), Airflow, AWS Batch, SLURM, and Kubeflow.
https://soopervisor.readthedocs.io
Apache License 2.0
45 stars 18 forks source link

warn if source distribution has big files #58

Closed edublancas closed 2 years ago

edublancas commented 2 years ago

When building the docker image, we compress the user's code (.tar.gz file) and copy it into the image, here's the implementation.

The function already ignores things that shouldn't be copied into the image, like __pycache__ files or the git repository. These are clear examples of things that shouldn't go into the image but there are ambiguous cases.

For example, a user might be storing raw data in some raw/ folder and the function will copy them, making the docker image unnecessarily big. In many cases, this would be a mistake and we should warn the user, but in other cases, the user indeed might want to include those files (so we shouldn't throw an error just because of this)

So we should warn the user if big files end up in the folder they will get copied. we should modify the copying function and print a warning message in yellow color if we detect files above a certain threshold. 10 MB is a reasonable threshold

idomic commented 2 years ago

@edublancas please add short context.

edublancas commented 2 years ago

done

Wxl19980214 commented 2 years ago

Hi @edublancas I have a question for this issue. Don't we already have a file size checker at here https://github.com/ploomber/soopervisor/blob/b230a5bb415dd4e9b05e1448b21b17c2150dc75c/src/soopervisor/commons/source.py#L174

Is it really necessary to add another check again? I assume any big files after compression will exceed 5MB which then triggers this warning?

edublancas commented 2 years ago

great catch. I remember opening this issue and then implementing the basic check. What it's already implemented does not provide much detail, it measures the size of the .tar.gz file but the user won't know what specific files are too big.

So we want to extend this to check on a file-by-file basis, for example, a user may inadvertently include a few big .csv files as part of their project's source code and could show something like this:

the following files are too big. this will increase the docker image size so ensure this is required to run the pipeline:

- data.csv (X MB)
- another.csv (Y MB)
Wxl19980214 commented 2 years ago

Ok I see. Working it.