pypa / pip

The Python package installer
https://pip.pypa.io/
MIT License
9.53k stars 3.03k forks source link

IOError: File name too long #1634

Closed joehillen closed 10 years ago

joehillen commented 10 years ago

Got this error while using pip caching. I had to turn off caching in order for install to finish successfully. This is on 1.5.4. This is on Ubuntu 12.04 64-bit.

Downloading/unpacking backports.ssl-match-hostname (from tornado->-r requirements/standard.txt (line 15))
  Downloading backports.ssl_match_hostname-3.4.0.2.tar.gz
  Storing download in cache at /home/joe/.pip/cache/http%3A%2F%2Fpypi.internal%2Fpackages%2Fbackports.ssl_match_hostname%2Fdownload%2F1855%2Fbackports.ssl_match_hostname-3.4.0.2.tar.gz
Cleaning up...
Exception:
Traceback (most recent call last):
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/basecommand.py", line 122, in main
    status = self.run(options, args)
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/commands/install.py", line 278, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/req.py", line 1197, in prepare_files
    do_download,
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/req.py", line 1375, in unpack_url
    self.session,
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/download.py", line 586, in unpack_http_url
    cache_download(cache_file, temp_location, content_type)
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/util.py", line 609, in cache_download
    fp = open(target_file+'.content-type', 'w')
IOError: [Errno 36] File name too long: '/home/joe/.pip/cache/http%3A%2F%2Fpypi.internal%2Fpackages%2Fbackports.ssl_match_hostname%2Fdownload%2F1855%2Fbackports.ssl_match_hostname-3.4.0.2.tar.gz.content-type'

What's weird is the string it's failing on is only 166 characters.

joehillen commented 10 years ago

It turns out it's because I'm using eCryptFS which has a limit of 143 characters: http://stackoverflow.com/questions/6571435/limit-on-file-name-length-in-bash

The only solution I can think is to shorten the file names. Maybe don't store the entire url in the filename?

Or add exception handling for this error and just skip caching for this case.

Let me know which you prefer, and I will write a patch for either.

joehillen commented 10 years ago

I thought of a better solution.

Instead of storing the full url as the filename, hash it.

I'm thinking of doing a structure similar to how git stores its object files:

<first 2 characters of the SHA>/<rest of the SHA>/(<filename>.tar.gz|content-type|url)

The hash is on the file's full url. I forget why git does separate directories for the beginning of the SHAs, but I'm sure there is a good reason.

Finding matching files for a particular host would be easy because then you can just do:

find ~/.pip/cache/ -name url | grep pypi.python.org

This also has the advantage that you don't have to encode the url.

For backwards compatibility, you can just check for files in the old format and then convert them to the new format without having to redownload any files.

I'd be happy to build this, but I would like some approval of the design from a maintainer before starting. It's terrible spending a ton of time writing and testing a patch just to have it ignored or rejected.

Let me know.

piotr-dobrogost commented 10 years ago

This also has the advantage that you don't have to encode the url.

Because of?

dstufft commented 10 years ago

The reason for the prefix is to limit the number of directories/files in a single directory.

Some bikeshedding here, i'd prefer the path to be more broken up, something like...

a/b/d/3/7/abd37534c7d9a2efb9465de931cd7055ffdb8879563ae98078d6d6d5/<filename>

That is using sha224 (these urls might be coming from different locations, a collision attack may be plausible so using sha224 should make that harder). It also uses 5 directory deep prefix instead of a single one. I don't have a particular reason for that except I prefer it. It also includes the full hash in the final directory instead of just the rest of it.

Once you locate the filename it should verify that the url file associated with it matches the url we are looking for, and should treat a failure as a cache miss.

This should also solve the same issue for #1287

joehillen commented 10 years ago

Because the url is kept in a file named url. Here is a specific example:

The url http://pypi.internal/packages/backports.ssl_match_hostname/download/1855/backports.ssl_match_hostname-3.4.0.2.tar.gz hashes to c5a89d648fb312c4988d3cd7600434e1895cfc48.

This creates the following directories and files:

joehillen commented 10 years ago

@dstufft I see no reason to use a stronger hash function as there are no known attacks on sha1 (only a theoretical attack that is unproven) and even the attack on MD5 requires a large piece of data to be effective. URLs are short compared to the size requirements for an MD5 attack. Also the target controls the URLs so the attack space isn't very large.

I think git made the right choice in balancing number of directories and hash length, I would prefer to defer to their expertise.

dstufft commented 10 years ago

Sorry, but you need to argue a reason why using a weaker hash is more appropriate. The default in any software I'm willing to accept should be the strongest available. In this case sha1, sha224, sha256, sha384, and sha512 have a 40, 56, 64, 96, and 128 byte hex digest respectively. There are two filesystems where the difference will matter, that is FATX and MINIX V3 FS and MINIX V3 FS will function perfectly fine with sha224 too.

So there's no technical reason afaict to prefer the weaker just cargo culting what git has done.

As far as two letter prefix vs my scheme, that's just a style thing, I find the multiple nested and a full hash at the end to be nicer to work with in general.

joehillen commented 10 years ago

sha1 is python native. I shouldn't need more reason than that and the reasons I said earlier.

You're not building a crypto library here. Keep your requirements simple.

just cargo culting what git has done

Please don't insult me.

dstufft commented 10 years ago

Python sha-2 is native to Python as well via the hashlib module. There's literally no more complexity from using something in the sha-2 family over using sha1.

joehillen commented 10 years ago

Ah, I thought hashlib wasn't in the standard library. Fine, sha224 it is.

dstufft commented 10 years ago

Also your reasons only address why using sha1 isn't inherently broken, they don't provide any reasoning as to why this should use sha1 over something in the sha-2 family.

dstufft commented 10 years ago

Ah cool :) That explains it then :)

dstufft commented 10 years ago

I wonder if we ought to use a hardcoded name for the filename inside of the package directory too... as of right now someone could have a filename longer than 143 (there's no limit in PyPI etc). Although I've never seen anyone ever have a problem with that so it probably doesn't matter.

dstufft commented 10 years ago

Oh incase it wasn't obvious besides the bikeshedding I would absolutely accept this and I doubt any of the other maintainers would object.

joehillen commented 10 years ago

I thought about that, but decided to leave the full file name to make it easier to search and navigate. I've never seen a 143 character library name, and that would be perverse and silly.

dstufft commented 10 years ago

Yea I'm happy punting on that.

joehillen commented 10 years ago

As for putting the full sha as the subdirectory, that's just duplicating information. Once the hash is generated there is only one place it could possibly be and the python code for it is:

os.path.join(cache_path, hash[:2], hash[2:])
Ivoz commented 10 years ago

if you want a compromise between path length and hash family, imo hash = hashlib.sha256(url).hexdigest()[:32] would work fine.

joehillen commented 10 years ago

MD5 would work fine too. I'm really not worried about collisions or security here as they're not really within the scope of this use case.

That being said, it's taboo to crop a hashed result, so I will just use the full hash. 64 characters should be fine for all known systems.

Ivoz commented 10 years ago

it's taboo to crop a hashed result

Would love to know who says so.

joehillen commented 10 years ago

I'll leave that as an exercise for you to learn more about cryptographic hashing.

Ivoz commented 10 years ago

@joehillen as a matter of fact, that's absolute bollocks. It coincidentally stops message extension attacks on plain merkle-damgard constructions like sha1 and sha256, and is exactly how sha224 and sha384 are computed.

joehillen commented 10 years ago

k

cmclaughlin commented 10 years ago

I recently hit this on OSX, which has a 255 filename limit. I'm also using an internal PyPI proxy, which results in pip adding a long'?remote=... target_url ....' string to the filename.

Hashing the cache directory structure sounds like an ideal fix.

joehillen commented 10 years ago

Yeah, I want to work on this, but I've been far too busy the last month. I'm hoping I will get some downtime soon to work on this.

mariocesar commented 10 years ago

What is the status of this Issue?

I want to add if you install Ubuntu or a Linux Distribution that encrypt your home, the filename restriction raises.

Ivoz commented 10 years ago

@mariocesar see dstuff's PR #1748 above, it should solve these when pulled