polywrap / wrap-cli

Used to create, build, and integrate wraps.
https://polywrap.io
MIT License
170 stars 52 forks source link

Put Docker Image Inside Build Folder #424

Open dOrgJelli opened 3 years ago

dOrgJelli commented 3 years ago

Keep the built docker image inside of a "build" folder (or some other folder). This way we're "hosting" docker images ourselves instead of relying on the docker hub.

dOrgJelli commented 2 years ago

Hey @namesty, thank you for implementing this.

After further consideration, I think the approach I've suggested of "putting the dockerfile inside the build folder" is naive and will lead to problems in the future. Here are some additional details:

Size of Docker Image:
Given a simple test polywrapper (test-cases/cases/apis/asyncify/), the docker image.tar file is ~450mb. If you pipe the image through gzip, it becomes ~150mb.

Diving deeper into what's causing the bloat inside the image.tar file, it's ~80% due to the base linux image that's added in the first couple of layers. If you open up the .tar image file, you can see this for yourself.

IPFS Storage Optimization:
In IPFS, folders & files are stored as DAGs, where each node in the DAG is a hash of a file/folder. This is to help optimize storage and reduce redundancy.

My fear is that, if we store the large image.tar as a single file, IPFS won't properly optimize its storge. Ideally IPFS would be able to store the individual layers, which would help reduce redundancy, because common layers (such as a base linux image) would be replicated and not duplicated. More research is needed to better understand IPFS's default functionality, and how well the image.tar file "plays nice" with this, or if it makes it harder for this de-duplication to occur.

Additional Research Needed:
We need to research the following two properties of IPFS:

Ideal Future State:
IMO, in the future, the ideal state would be:

  1. Store Polywrap "packages" as their own folder, minimizing the amount of space needed to keep them "online" within an archive node which pins polywrap packages for quick retrieval.
  2. Store Polywrap "source images" as their own folders, and having a seperate set of archive nodes keep that information online. This data would only be needed for source-code-verification & auditing. Instead of just storing the image.tar file as a single file, we would instead extract the contents and store those in a folder. This will help IPFS optimize the storage of common docker image layers. Additionally in the future, this will help us flag a build image's layers as "audited" or "known good" to help speed up audits. For example, a "known good" layer has of 0x58... which represents a known good linux image would be marked green, and auditors could largely ignore it and move onto other "unknown" layers that are specific to the API at hand.

NOTE: The "package" would link to the "source image" within its web3api.build.yaml file.

Conclusion

All of this said, I think we should discuss what the ideal future state should be, and create an execution plan around that. That would probably lead to us closing this issue and opening up new ones.

Please let me know your thoughts.

Niraj-Kamdar commented 2 years ago

Checkout: https://github.com/jvassev/image2ipfs/ It may help

namesty commented 2 years ago

I'll retake this

namesty commented 2 years ago

After a bit of research, I had an idea:

  1. Standardize the Dockerfile layering
  2. Build the image and instead of saving the whole .tar as we were doing or saving it as a whole image, we would
  3. Deconstruct the image in layers with https://github.com/larsks/undocker/, which is very simple to use and output the layers to an "build-env image" folder, excluding the OS layer. (Doing this programatically is something I have to look into, as there seems to be no instant solution already created)
  4. We could then pin this to IPFS, which is the actual essential data
  5. And for auditing purposes in the future, we could (as an auditor) rebuild the image locally with the OS layer and all, and then compare that locally built image, with the layers retrieved from IPFS (using a tool like dive).

I thought that this would be compliant with the second point of @dOrgJelli's ideal scenario, but I may be missing a point regarding how the audits will be done.

Does this make sense?

dOrgJelli commented 2 years ago

Yes that makes perfect sense, I came to the same conclusion after unzipping the docker .tar file and seeing they have all the layers right there!

We could have the OS layer still there, and ideally with IPFS hashing it would de-duplicate it. Would require more investigation to make sure this is the case.

I worry that if we don't have the OS layer, we would have a hard time reproducing the build again.

dOrgJelli commented 2 years ago

@namesty after speaking with a developer who's spent the past 3 years researching & building with IPFS & IPLD, he confirmed to me that "if multiple folders have the same files, IPFS will automatically deduplicate these files due to the preservation of the content hash in the IPLD database."

This means that we can store the OS layer no problem because it will not be duplicated. Additionally uploads SHOULD be optimized as well, as I do not think a node would request a request sender to upload a file it already knows it has.