Closed joehillen closed 10 years ago
It turns out it's because I'm using eCryptFS which has a limit of 143 characters: http://stackoverflow.com/questions/6571435/limit-on-file-name-length-in-bash
The only solution I can think is to shorten the file names. Maybe don't store the entire url in the filename?
Or add exception handling for this error and just skip caching for this case.
Let me know which you prefer, and I will write a patch for either.
I thought of a better solution.
Instead of storing the full url as the filename, hash it.
I'm thinking of doing a structure similar to how git stores its object files:
<first 2 characters of the SHA>/<rest of the SHA>/(<filename>.tar.gz|content-type|url)
The hash is on the file's full url. I forget why git does separate directories for the beginning of the SHAs, but I'm sure there is a good reason.
Finding matching files for a particular host would be easy because then you can just do:
find ~/.pip/cache/ -name url | grep pypi.python.org
This also has the advantage that you don't have to encode the url.
For backwards compatibility, you can just check for files in the old format and then convert them to the new format without having to redownload any files.
I'd be happy to build this, but I would like some approval of the design from a maintainer before starting. It's terrible spending a ton of time writing and testing a patch just to have it ignored or rejected.
Let me know.
This also has the advantage that you don't have to encode the url.
Because of?
The reason for the prefix is to limit the number of directories/files in a single directory.
Some bikeshedding here, i'd prefer the path to be more broken up, something like...
a/b/d/3/7/abd37534c7d9a2efb9465de931cd7055ffdb8879563ae98078d6d6d5/<filename>
That is using sha224 (these urls might be coming from different locations, a collision attack may be plausible so using sha224 should make that harder). It also uses 5 directory deep prefix instead of a single one. I don't have a particular reason for that except I prefer it. It also includes the full hash in the final directory instead of just the rest of it.
Once you locate the filename it should verify that the url file associated with it matches the url we are looking for, and should treat a failure as a cache miss.
This should also solve the same issue for #1287
Because the url is kept in a file named url
. Here is a specific example:
The url http://pypi.internal/packages/backports.ssl_match_hostname/download/1855/backports.ssl_match_hostname-3.4.0.2.tar.gz
hashes to c5a89d648fb312c4988d3cd7600434e1895cfc48
.
This creates the following directories and files:
@dstufft I see no reason to use a stronger hash function as there are no known attacks on sha1 (only a theoretical attack that is unproven) and even the attack on MD5 requires a large piece of data to be effective. URLs are short compared to the size requirements for an MD5 attack. Also the target controls the URLs so the attack space isn't very large.
I think git made the right choice in balancing number of directories and hash length, I would prefer to defer to their expertise.
Sorry, but you need to argue a reason why using a weaker hash is more appropriate. The default in any software I'm willing to accept should be the strongest available. In this case sha1, sha224, sha256, sha384, and sha512 have a 40, 56, 64, 96, and 128 byte hex digest respectively. There are two filesystems where the difference will matter, that is FATX and MINIX V3 FS and MINIX V3 FS will function perfectly fine with sha224 too.
So there's no technical reason afaict to prefer the weaker just cargo culting what git has done.
As far as two letter prefix vs my scheme, that's just a style thing, I find the multiple nested and a full hash at the end to be nicer to work with in general.
sha1 is python native. I shouldn't need more reason than that and the reasons I said earlier.
You're not building a crypto library here. Keep your requirements simple.
just cargo culting what git has done
Please don't insult me.
Python sha-2 is native to Python as well via the hashlib module. There's literally no more complexity from using something in the sha-2 family over using sha1.
Ah, I thought hashlib wasn't in the standard library. Fine, sha224 it is.
Also your reasons only address why using sha1 isn't inherently broken, they don't provide any reasoning as to why this should use sha1 over something in the sha-2 family.
Ah cool :) That explains it then :)
I wonder if we ought to use a hardcoded name for the filename inside of the package directory too... as of right now someone could have a filename longer than 143 (there's no limit in PyPI etc). Although I've never seen anyone ever have a problem with that so it probably doesn't matter.
Oh incase it wasn't obvious besides the bikeshedding I would absolutely accept this and I doubt any of the other maintainers would object.
I thought about that, but decided to leave the full file name to make it easier to search and navigate. I've never seen a 143 character library name, and that would be perverse and silly.
Yea I'm happy punting on that.
As for putting the full sha as the subdirectory, that's just duplicating information. Once the hash is generated there is only one place it could possibly be and the python code for it is:
os.path.join(cache_path, hash[:2], hash[2:])
if you want a compromise between path length and hash family, imo
hash = hashlib.sha256(url).hexdigest()[:32]
would work fine.
MD5 would work fine too. I'm really not worried about collisions or security here as they're not really within the scope of this use case.
That being said, it's taboo to crop a hashed result, so I will just use the full hash. 64 characters should be fine for all known systems.
it's taboo to crop a hashed result
Would love to know who says so.
I'll leave that as an exercise for you to learn more about cryptographic hashing.
@joehillen as a matter of fact, that's absolute bollocks. It coincidentally stops message extension attacks on plain merkle-damgard constructions like sha1 and sha256, and is exactly how sha224 and sha384 are computed.
k
I recently hit this on OSX, which has a 255 filename limit. I'm also using an internal PyPI proxy, which results in pip adding a long'?remote=... target_url ....' string to the filename.
Hashing the cache directory structure sounds like an ideal fix.
Yeah, I want to work on this, but I've been far too busy the last month. I'm hoping I will get some downtime soon to work on this.
What is the status of this Issue?
I want to add if you install Ubuntu or a Linux Distribution that encrypt your home, the filename restriction raises.
@mariocesar see dstuff's PR #1748 above, it should solve these when pulled
Got this error while using pip caching. I had to turn off caching in order for install to finish successfully. This is on 1.5.4. This is on Ubuntu 12.04 64-bit.
What's weird is the string it's failing on is only 166 characters.