volojs / volo

Create front end projects from templates, add dependencies, and automate the resulting projects
https://volojs.github.io/
Other
1.41k stars 99 forks source link

Fragment locations via # in private repos have 1MB limit #156

Open plumlee opened 11 years ago

plumlee commented 11 years ago

Our team has several private Github repos that follow this pattern:

dist/
    someZipFile.zip

It appears that volo and/or the GitHub API don't allow a deeplink into the project to get that zip file, even when it's listed as the volo url in package.json. No matter what syntax I try on both the consuming app (where I run volo add) and the linked app (where I set the volo url property), it always pulls in the entire repo contents.

However, I did find your comments on using a volofile with onAdd to run some actions after install. I was able to use that to clean up all extra files that I didn't want, effectively mimicing the direct download.

If this is something that is a known issue and you are ok with documenting this as workaround, I'll do a fork and pull request to update the docs.

jrburke commented 11 years ago

Did the volo add owner/repo#dist/someZipFile.zip work for you, or did that fail in some way?

I did a fix related to this for the latest 0.2.10 release, so I also suggest making sure you have the latest volo to try that command (type volo to get the installed version).

plumlee commented 11 years ago

When I run the add command, it pulls in the entire contents of the repo - so the app.js, the package.json, the lib files, everything. I'll check that I have the latest version and report back,

plumlee commented 11 years ago

Yes, I'm running 0.2.10. This only appears to be an issue with our private repos - with a public repo, it does exactly as you'd expect. Is there a simple check to verify that the download URL for the zip file is resolving to what I'd expect, or would a network trace be the best way? Or a debug flag that I could turn on to verify what URL is being returned from github for the download?

jrburke commented 11 years ago

volo should print out what zip file will be used. I just tried this, and I see volo saying:

Downloading: https://codeload.github.com/jrburke/test-private-create/legacy.zip/master?sso=....

with a long strings of characters after that sso part. I was able to successfully install something I put in that private repo as dist/test.zip. So I did volo add jrburke/test-private-create#dist/test.zip It fetched the zip file. After volo ran, I then did a manual unzip and it contained the files I expected.

In your case, is it just a private repo on github.com or is this something served from a github enterprise installation? Mine was just on regular github.com.

plumlee commented 11 years ago

Looks like I've had a fundamental misunderstanding of what volo is doing when I give it a repo name. I expected that by setting the volo.url property in package.json, that URL would be returned and used when doing a volo search or volo add. Now that I see that is not used at all, I'm unclear as to what that property is there for, unless it's just metadata?

So there is no bug here, it's just been my misunderstanding of what is happening. I can point directly to github:Org/repo#path/to/file.zip and have that pulled correctly. I had assumed that by setting the volo.url property in package.json, I was avoiding the need to tell consumers of our repo they needed that direct url. Is it possible to have search check for this property, so that we can make use of this metadata in this way?

(edit - these are on github.com).

plumlee commented 11 years ago

Now further confused - we maintain a public github repo (fuelux) that does exactly what I'm trying to do - specifies volo.url in the package.json that points directly to a specific zip file inside the repo. When I run volo add fuelux, it grabs that specific file. So it appears this is being read somewhere?

The package.json volo.url directive that works: https://github.com/ExactTarget/fuelux/blob/master/package.json

When I run volo add fuelux, I get exactly what I expect, the zip file in dist/.

But trying the private repo we have, it can't grab the file specified in package.json unless I give it the exact file to get (volo add private-repo#dist/file.zip)

I'm assuming it's because of it being a private repo, but is there something I'm missing?

plumlee commented 11 years ago

Dug around some more and I believe I now know why private repos are being treated differently.

See here: https://github.com/volojs/volo/blob/master/lib/resolve/github.js#L83 and https://github.com/volojs/volo/blob/master/lib/config.js#L35

The check for a package.json file is being done against the raw github URL, which will not properly resolve for a private repository. This drops the fetchBrowserPackageJson function into the error handler, where it's silently ignored. For a public repository, this will resolve, and the package.json file is fetched, parsed, and the specific URL is correctly used for downloading the dependency.

I'm not as familiar with the code base as I would like to be (or with the Github API) but it seems like it should be possible to fallback to a second check that would attempt a direct download the package.json file if the raw attempt failed. I can't tell if the raw.github urls are only allowed for private repos if you pass the username and pw in the query parameters - a quick test with the token volo was using still failed.

Not sure where something like this falls on the priority list, but will do a pull request to give an error message indicating that the volo.url property cannot be used in this case. Would be happy to contribute some more code along the lines of a fallback attempt if you have any info about Github API and authentication that might apply here.

jrburke commented 11 years ago

Thanks for digging into this! I am going on a trip and will not have good turn-around on communication for about a week or so, but I am hoping to do some volo work at some point on the trip, and will be looking into this more, and your pull request.

plumlee commented 11 years ago

I did some preliminary work on this, and I'm going to do a pull request so you can see it. However, the only part that works is getting the contents of package.json using the API. Where things will fail is on getting the actual specified file, since the API only allows files up to 1 MB. What may be preferred is to use the first part of what I did but then download the entire repo, and use volo commands to delete all but the file specified in the volo.url property.

Thanks for your work creating volo, and appreciate your time on this issue. Doing pull request now.

jrburke commented 11 years ago

So following up with from our in person conversation:

Private repos have a 1MB direct file access limit. So, what we might need instead:

Do the normal work we do now. If the path fragment is for a file that is bigger than 1MB, then download the full zip of the repo, then extract the file that way.

If I mis-remembered the conversation, please let me know. Since I think we still have some discussion on this to do, and I believe you wanted to try the pull request, I am going to push this to a 0.3.1 release, as there are some fixes I want to get out more immediately for 0.3.0. We can do a follow-on 0.3.1 release fairly quickly if we work out the details in the near future.

plumlee commented 11 years ago

Not a problem at all. I didn't get to do any coding on this yet, and was going to reply today. I would stil like to do the pull request.

I think there was a slight difference in the flow that we talked about:

How does that sound?

jrburke commented 11 years ago
  • If the attempt to download the package.json file of the repo fails fails, then try again using the authenticated API (doubtful that the package.json file would be greater than 1 MB).

It should be doing this already: https://github.com/volojs/volo/blob/master/lib/resolve/github.js#L28

  • If that works, and if there is a volo.url present, then set a flag that that we need to clean up to only leave that single file. If it doesn't work, log that via console to inform the user and move on with no workflow changes.

First, we should confirm we need to do this special path by checking the size of the file. Right now the code does a HEAD request to confirm the file really exists at the location around here: https://github.com/volojs/volo/blob/master/lib/resolve/github.js#L224

I am hopeful that we just need to also check a Content-Length to know if we need the special path. Although, looking at that code segment right now, it assumes that there is either a fragment (a #something.zip) or a volo.url, but not the possibility that the volo.url is for a fragment in the repo.

So perhaps do some normalization around here: https://github.com/volojs/volo/blob/master/lib/resolve/github.js#L215

that inspects the volo.url to see if it really is just a fragment inside the repo? Hmm, I am also assuming the use case we are trying to fix is "volo.url for a fragment in the current repo". If I missed something with that definition of the use case, give a holler.

So after determining if the file will be too big, I think it may be enough then to modify the results of the resolve operation to url: 'zip of repo', isArchive: true and set the fragment.

Then hopefully the rest sorts itself out? I'm going on incomplete memory of the use case though, and unfortunately will be relying more on you to remind me than to do the research myself.

  • If after the full zip download we also find a volofile ignore the volofile and just log that as well. Proceed as if we did not find the volofile in the repo.

I wonder if it should just keep doing the volofile work even in this case. In particular, I think the only thing it will do is look for an "onAdd" command in the volofile. That still may be useful in this case too?

  • Finally use volo commands to remove all files and directories except for the one specified in the volo.url property we read earlier.

Hopefully if we just readjust the results from the resolve module, this will fall out without any further changes.

plumlee commented 11 years ago

For the first point about making the authenticated call, the code at https://github.com/volojs/volo/blob/master/lib/resolve/github.js#L86 does use authentication, but it's going against the github raw URL, not the API url. That's why it's always failing for private repos. So a change to the API URL should make this work in most cases. Locally I got up to 40K lines in a JSON test file before hitting 1 MB file size. So a documentation note with a link to the Github API spec would be enough for that edge case if you agree.

I think you are correct in identifying the case here - it's really that we want a fragment, but the volo.url isn't in that form. Also think you're correct in saying that we should normalize to make that the case - this give more flexibility and allows the code to follow the correct path vs the hackish one I was having to do.

Since from experience I know that we can download the entire repo already with no issues, it should only be the initial call to get package.json that needs to change to be against the API. Later calls should function correctly as they do now.

And it means less changes if we do as you suggest - just change to a fragment if that's what we really have and the rest should sort itself out. I do see where you already check for the onAdd command and run that if needed

So I think the steps would be as below:

This is much simpler than what I had planned. Thanks for putting the time into the discussion. If this sounds good, I'll work getting it done.

plumlee commented 11 years ago

Did some work on this, but going to walk back on the plan. What this really boils down to is that Github doesn't expose a raw URL for private repos in the same manner it does for public repos. So I can't put a volo.url in place for a private repo that actually works, because there isn't any such thing - the only thing that appears to work is using the zipball/tarball API url to get the entire repo, and then cleaning up afterwards with a volofile.

I'm digging around more to try and find out if there is any sort of direct URL for downloading individual files from a private repo. A few tests show that cookie based authentication works using a url like: https://github.com/org/repo/blob/master/dist/file.zip?raw=true but not oAuth tokens. So GH effectively limits the download to web-based UAs only.

Best short term plan might be to update docs to indicate volo won't work correctly for private URLs because of the way GH treats them, but note that using a volofile will work because volo falls back to pulling the zipball/tarball, and from there you can clean up.

I'm looking at whether the GH API for downloads might be a workaround here - if you can create a download for a repo, then perhaps that file can be specified for the volo URL. That would let me point to a specific file in the exact same way a public URL works.

plumlee commented 11 years ago

Just found this out - you can get a direct download for any file using this syntax: http(s)://[token]@[hostname]/[owner]/[repo]/raw/[branch]/[path_to_file]

Tested via curl with the token from volo and it works. So private repos can specify the normal format and volo should be able to work successfully. Only change needed would be the authenticated call to get package.json as well as the URL for any of the HEAD requests. Since the authenticated call to package.json returns a field indicating private, this can be the flag to change subsequent URLs to this format.

Will take another crack at this soon.

plumlee commented 11 years ago

Now it appears that I misunderstood the use of this sytax. It may only work for actual git operations, not pure http downloads. If that is verified, then I'm going to close this entire issue and just work up a grunt plugin that will create the volofile for me, so that I can automate removing any extra directories.

plumlee commented 11 years ago

I got back confirmation from GH support that only git operations over https with this format using oauth tokens are supported. The last support person did recommend using the blobs API - http://developer.github.com/v3/git/blobs/. For the moment, I'm going to pursue a workaround route of a grunt task to update a volofile which will remove all contents but the directory I want. I'll revisit this again as time permits.

jrburke commented 11 years ago

Thanks for diving into this more. I wonder, to get around the issue of not being able to construct a private URL, what if volo.url allowed relative URLs? So, if volo.url = 'dist/something.zip', then it would resolve it relative to whatever volo calculates as the repo URL, and if a private repo, then just download the whole repo zip and extract that file (treat it like a fragment).

Sounds like you have a workaround for now, but when I get back to a volo dev cycle, I would like to be able to sort it out.

plumlee commented 11 years ago

That might be the best method. Since there is no real URL that is a valid URL for private repos. Would still need the change to the initial call to get package.json so that it uses the API to pull that specific file, but then you'd also be able to check the repo for being private at that point. So you could support relative URLs either way, but if the private flag is on, you'd probably be able to skip some steps since the only path is a full zip download.

I will probably do a grunt workaround for now to generate a volofile and keep my temporary workaround in place, but will try to get back to this again as well.