turnkeylinux / tracker

TurnKey Linux Tracker
https://www.turnkeylinux.org
70 stars 16 forks source link

[TKLDev] Feature Request: git clone cache/proxy #466

Closed DocCyblade closed 9 years ago

DocCyblade commented 9 years ago

From a conversation found (https://github.com/l-arnold/tkl-nomadic-odoo/issues/11#issuecomment-142125653) there should be a wrapper that would "cache" or mirror GIT repos. The current system uses the built in proxy but due to the way GIT uses the HTTP as a transfer protocol it does not behave as one would expect and does not cache as we would expect.

So it would be nice to have such a script to proxy/cache git-clone command in that it would only download it once, and pull locally there after and keep the downloaded repo up-to-date via some kind of time stamped file so it's not fetching every time.

I have started an attempt (not even working at the moment just a skeleton template bash script right now) but will hack away at this on the side when I have nothing to do! :open_mouth: Check it out in my https://gist.github.com/DocCyblade

JedMeister commented 9 years ago

Thanks for testing this and discovering the issue Ken. I marked it as both a 'bug' and a 'feature request' as I think it's sort of both! :smile:

l-arnold commented 9 years ago

I referenced here just now https://github.com/l-arnold/tkl-nomadic-odoo/issues/11 but thought since we have moved into Tracker to put the comment here as well.

It seems we can go to /turnkey/fab and mkdir git-cache Run "git clone" to bring a full Git Set in to be available to TKLDEV. Then the App build could call "git-cache/folder" just as "common" is called in TKLDEV processes.

Would save a lot of time for sure. Would need to "git pull" once and a while to stay current is all on each of the folders that get pulled (could be a lot). Would also need some logic in how TKLDEV addresses that folder set.

DocCyblade commented 9 years ago

@l-arnold This is more to fix the current way TKLDev tool chain handles GIT in its builds. The current proxy does not seem to cache like it does for downloading zip or packages. This idea would allow TKLDev tool chain to do this with GIT repos. Really has nothing to do with Odoo specifically, it just happen I stumbled upon it when I was working with the Odoo build.

JedMeister commented 9 years ago

Thinking about this a little more (and noticed that your gist is still a bare-bones template - nice template BTW :smile:)... So did you mean that Polipo (the caching proxy used in TKLDev) even when configured right won't cache git clone operations?

Or did you mean that setting the system/env vars just wasn't working? IIRC you said that git will honour the http/https proxy settings?

Anyway just had a thought this morning (which might be blown out of the water depending on your answer to above):

git clone accepts the -c switch which allows you to specify config (e.g. git clone -c <key>=<value>) for an individual git clone operation (see docs). So perhaps an easy way could be to just set up a bash function (like many apps already do with dl to download using the proxy) within the conf script. Something like:

proxy-clone() {
    /usr/bin/git clone -c http.proxy=$FAB_PROXY -c https.proxy=$FAB_PROXY --depth 1
}
# clone repo using proxy
proxy-clone https://github.com/user/repo.git

:question:

Also testing/troubleshooting this might be easier if we turn up the verbosity of git with these env vars?:

set GIT_TRACE=1
set GIT_CURL_VERBOSE=1
DocCyblade commented 9 years ago

Good call, I am going to run a few tests.

  1. Clear proxy cache
  2. Perform two down loads (1 from GIT hub zip file, 1 from another open source site
  3. Check the content of the proxy cache (See if it is caching the files)
  4. Clean cache again,
  5. Try git clone using proxy server, set manually
    • 5a. Full clone of small repo 30-40MB
    • 5b. Full clone of large Odoo repo
    • 5c. shallow clone of Odoo
  6. Each of the above test show disk usage of the proxy cache, then clear the cache

These should shed some light to see if it is caching anything at all.

I will re-run these test with the proxy script (hope to work on that tomorrow night)

JedMeister commented 9 years ago

Sounds awesome! :+1:

DocCyblade commented 9 years ago

@JedMeister Ok, so very strange, I can't seem to get anything github, even proxy request via curl to cache. I started looking into the manual pages of the proxy server TKLDev uses, learned way more than I wanted too...

Anyway googling away I stumbled upon this (http://comments.gmane.org/gmane.comp.web.polipo.user/3375)

Then it hit me... Duh! you can't cache https since, you can proxy it, but since the data is encrypted from end to end, there is no way is can cache it. Even if it did, it would be like a man in the middle attack.

So with many many downloads switching to HTTPS, this idea of a scriptable caching idea might work also with any downloads with https. It could work the same in that, it would first check the cache for the file, if it does not exist, download it, then copy it to location from cache directory. If the request url already exist does a copy. Could come up with some time stamp file to "refresh" the cache.

So I think my idea is still a good one, and maybe a https download wrapper to cache https content

DocCyblade commented 9 years ago

FWIW I added a skeleton fab-https-dl.sh as well. Maybe one day....

JedMeister commented 9 years ago

Ah-ha! Great point! That didn't even occur to me!

Https caching is still quite do-able; however the proxy would need to do the https connection (to the remote server) and the connection between fab and the proxy would probably need to be http only...

Another way to go would be to essentially create a MITM proxy... Have a look at this for an idea...

Using squid instead of polipo may be another option? (A vaguely relevant looking squid tutorial here)

l-arnold commented 9 years ago

I return to the "manual" approach. Couldn't there be a /turnkey/fab/git-cache/ folder where Git Clones are pulled via normal "Git Clone" statements.

Would start a check to see if there is such a folder with data, then if so it is used, if not it it is git-cloned and used.

One can go via the Shell and Do that. Trick then is for TKLDEV to use that DATA in the build as it does with Common.

Just sayin. About Proxies etc I am not thinking. Only about the Data pull and its availability.

This seems as scriptable as what we are doing except that it is going outside of the "app - FAB" architecture. It is however in the same architecture as the Common files which are also used.

JedMeister commented 9 years ago

@l-arnold that's actually not a bad idea. A bash function that overides git so make git pull https://github.com/user/repo.git do something like this instead:

CACHE=/var/cache/tkldev/$repo
if [ -d $CACHE ]; then
    cd $CACHE
    git pull origin master
else
    git clone https://github.com/user/repo.git $CACHE
fi
cp $cached-repo $new-location

(Note it's not a proper script; just the essence of a script to show my thinking).

DocCyblade commented 9 years ago

That's not a bad idea. I like it. Simple just that little bit would work for now!

A transparent proxy cache would the the goal however but that could be a good stop gap

l-arnold commented 9 years ago

Happy to test this if it speeds the current process. My goal is to have an acceptable system for Odoo (I personally think it is generally working right now - or at least was last week).
If we can speed the builds with an implementation like this, lets try it. Like I said, I can TEST!

DocCyblade commented 9 years ago

I just added some code to the Odoo conf script to cache. Works like a treat. The first build takes along time, it is downloading the whole repo, but after that it will do an remote update and clone from local.

https://github.com/DocCyblade/tkl-odoo/blob/a4861036b2b29ab935cd277502a547b0d8aeb545/conf.d/main#L65

l-arnold commented 9 years ago

nice. Instructions may be needed.

DocCyblade commented 9 years ago

Well after a few builds... I realized that the build scripts are running in a CHROOT! So this will not work. You need a proxy, something out side of the chroot script. Well back to the drawing board

l-arnold commented 9 years ago

There is the tkl gitlab appliance that could conceivably work. Might even be married to tkldev. But I have not used it.

How about writing a folder next to common with a git clone command then figuring out how common parts are likewise brought in.

I'd say lets get the script working first.

l-arnold commented 9 years ago

Github should also let mobile edit a post: Ie many things are not perfect.

DocCyblade commented 9 years ago

@l-arnold since the conf scripts run in a CHROOT so you can't access outside the CHROOT to cache the GIT repo.

If there was some sort of service or proxy on another port like the web proxy but specific to GIT repos.

Using part of the git lab code for delivery of the GIT could work, the issue with large repos like Odoo it could time out before the whole repo would be downloaded.

An Proxy that supports SSL may be he best option. Just thinking

JedMeister commented 9 years ago

Bugger. TBH that never occurred to me, but now you mention it it's obvious... So we'll need to go a different direction. I think that MITM https proxy might be the go...

DocCyblade commented 9 years ago

@JedMeister - Trying to figure a way to clone a git repo specifically, however cacheing all https traffic via proxy port would be ideal. Is there away have a proxy server get content from https, then delivery the content via http internally, basically translating it from one protocol to another? You could then double proxy so that the first proxy request the url to the second proxy, this proxy requests the content over HTTP and decrypts the content serves it back at HTPP, this then the first poxy can cache.

The other idea, really just for GIT repo would be to have a fab-gitclone that would create a cache (like code I created before) since we can't do it via conf script since that is run in CHROOT. This would need to be put in another file config.git.d. This would require a retooling in the fab process/tools and looks like a large undertaking. However, since a lot of software these days seems to be using github for installation it may be a value

Another idea, add an additional Makefile that would use code to cache the repo. Not sure how I can make when this Makefile gets executed in the correct order so that it's before the conf scripts. I did some poking around the fab/make files and looks like this would be as hard as the fab-gitclone idea.

Bottom line, looks like what ever is implemented it will need to be baked into the tool chain at some point, and would need the blessing of the The core devs. I'd be happy submit some code, just want to know what direction that this would be implemented in. Maybe your next chat with lirazsiri and/or alonswartz could produce some direction and I could start creating/testing some code.

JedMeister commented 9 years ago

IIRC I posted a link somewhere on using squid as a MITM https proxy. I didn't read very far but the general gist of it was that traffic was https between proxy and program and proxy and website.

So for a slightly more detailed answer: All DNS querries redirect to localhost where the proxy is waiting. The proxy (falsely) identifies itself as the requested website using a(n invalid) self signed local cert. It then forwards the request to the remote site (using https) and downloads (and caches) the content. As it can decrypt the data both ways (as it has the keys) caching is no longer a problem...

Something somewhat inline with your other suggestion would be doable for git. To work around the chroot factor; you could mount a cache dir within the container and always use that dir. You would also want a function (or temporary alias for git) which would do a git pull if the dir already exists (rather than failing as git does now). You would then need to copy (rsync?) the git repo to where you want it afterwards...

DocCyblade commented 9 years ago

If you think that's the way we should look I'll dive in and see if I can tweak the tkldev build to use that. It would I expect this would replace the current one

JedMeister commented 9 years ago

Despite the fact that part of me doesn't like it (because it could so easily be abused); I think that the MITM proxy would be the best way and most similar to how it's currently done...

DocCyblade commented 9 years ago

Understandable, it does solve https downloads at the same time not breaking anything already written. I'll give it a spin. Will be a branch off my fork of tkldev build.

DocCyblade commented 9 years ago

This issue should probably be closed as a duplicate, as #467 really would solve this and https traffic as well.

JedMeister commented 9 years ago

Yep; let's close.