lemma-osu / naip-cnn

Modeling forest attributes from aerial imagery using CNNs
0 stars 0 forks source link

Dev Container slow to rebuild #7

Closed aazuspan closed 1 year ago

aazuspan commented 1 year ago

When the Docker image is rebuilt, all data in the project directory is included in the "build context". For me, this currently means that the build context is about 31 GB and takes half an hour to rebuild!

We need to make sure all data in the directory is available to the container so that we can train on it, but don't want to have to copy it all into the container like we're currently doing. This obviously isn't a unique problem, so there must be a solution.

This StackOverflow response looks like it's probably the answer:

You can create a .dockerignore file that names files that need to be omitted from the context... files you include in .dockerignore can't be COPYed into the image. This only affects the image build process and has no effect on volumes:; you could bind-mount the omitted directory into the container at runtime, even if it shouldn't be included in the image.

Other references:

aazuspan commented 1 year ago

@grovduck I think I fixed this in the refactor branch, but I'm still figuring out how best to set up Docker, so please let me know if you run into any issues like slow builds, missing/out of sync data, etc.

Also, have you noticed that Git changes aren't automatically refreshed in your Docker container, e.g. you make a change to a file and it's not marked as modified in VS Code? I'm running into that, but not sure if it's something in my local config or the Docker settings.

grovduck commented 1 year ago

@aazuspan, sorry I've been so slow to review. Going through the notebooks now.

Also, have you noticed that Git changes aren't automatically refreshed in your Docker container, e.g. you make a change to a file and it's not marked as modified in VS Code? I'm running into that, but not sure if it's something in my local config or the Docker settings.

Yes, I'm experiencing the same behavior. For example, I created the .h5 file from the first notebook, but it didn't automatically refresh when created. When I did hit the refresh button up top, it did show up.

Also, as part of running the notebooks, I create modifications (random sampling, etc.), but those modifications are not being shown in VSCode. I can run git status at command line and it shows as changed, but even a refresh of VSCode doesn't make it show up in the Source Control panel.

I think I fixed this in the refactor branch, but I'm still figuring out how best to set up Docker, so please let me know if you run into any issues like slow builds, missing/out of sync data, etc.

I'm pretty sure you haven't committed the Malheur_lidar_cancov.tif file (in the data directory) - that's the only file I've had to manually add. But perhaps I'm missing data that I should have? My rebuild of the container was very fast.

aazuspan commented 1 year ago

Yes, I'm experiencing the same behavior. For example, I created the .h5 file from the first notebook, but it didn't automatically refresh when created. When I did hit the refresh button up top, it did show up. Also, as part of running the notebooks, I create modifications (random sampling, etc.), but those modifications are not being shown in VSCode. I can run git status at command line and it shows as changed, but even a refresh of VSCode doesn't make it show up in the Source Control panel.

Good to know! I'm also having to manually refresh the Source Control tab to get changes to show up, so it sounds like that's probably a Docker problem. I opened #8 to track that.

Also, as part of running the notebooks, I create modifications (random sampling, etc.), but those modifications are not being shown in VSCode. I can run git status at command line and it shows as changed, but even a refresh of VSCode doesn't make it show up in the Source Control panel.

That's interesting, I'm not sure what could cause that... Does that only affect notebooks?

I'm pretty sure you haven't committed the Malheur_lidar_cancov.tif file (in the data directory) - that's the only file I've had to manually add.

Yes, that's on me! I was hesitating to start committing data or models because we probably won't end up being able to store everything on Github. Actually, as part of #5 I'm considering migrating all the LiDAR to EE to simplify the footprint sampling, at which point I think all of the data would be either remote or generated.

grovduck commented 1 year ago

Also, as part of running the notebooks, I create modifications (random sampling, etc.), but those modifications are not being shown in VSCode. I can run git status at command line and it shows as changed, but even a refresh of VSCode doesn't make it show up in the Source Control panel.

That's interesting, I'm not sure what could cause that... Does that only affect notebooks?

Sorry, that wasn't clear at all. I just meant that it didn't initially show as a change in the Source Control tab (as you note), i.e. incrementing the number of files that had changed. The actual text is changing.