ndmitchell / hoogle

Haskell API search engine
http://hoogle.haskell.org/
Other
753 stars 137 forks source link

Offline boostrap support #76

Open iustin opened 10 years ago

iustin commented 10 years ago

Hi from the Debian Haskell Packaging group!

In Debian, we'd like to ship Hoogle in such a way that it doesn't use online resources: the package should ship with enough data such that it's possible to generate the local databases "offline" (from bootstrap + local packages).

This used to work via some local hacks (see http://anonscm.debian.org/cgi-bin/darcsweb.cgi?r=pkg-haskell/haskell-hoogle;a=headblob;f=/files_hoogle/update-hoogle) until recently, but (if I read the changes correctly) issue #47 changed the way it looks for local data.

Fixing our script is doable, but I'm worried that things might break again, and silently (like this time; we only found that newer hoogle downloads data from the internet by accident), so I'm wondering what are your thoughts about improving hoogle's bootstrap mode?

A couple of potential improvements that come to mind:

Thanks in advance for your feedback!

ndmitchell commented 10 years ago

Hmm, sorry I inadvertently made that harder. A --no-download flag seems worthwhile. Is it reasonable to have some flag to get hoogle to do the bootstrap, that way I can maintain the "thing that downloads everything" script? Note that I usually develop Hoogle on a train with no internet access, which means I locally hack my copy so I download everything in advance and then use it. It sounds like you just want that feature done properly.

nomeata commented 10 years ago

I’m a bit out of the loop: But what exactly do you need to download?

Assume you have a directory of hoogle.txt files of all relevant packages. Is that sufficient information?

ndmitchell commented 10 years ago

The list of files is here: https://github.com/ndmitchell/hoogle/blob/master/src/Recipe/All.hs#L178 - so basically all the .txt files, all the .cabal files for them, plus the keywords file and Platform cabal description.

nomeata commented 10 years ago

Ok, let’s see

Depending on what you need them for we either skip them in Debian, or we have to start adding them to the -doc packages, along the .txt files.

ndmitchell commented 10 years ago

I use the platform Cabal file to make a package named "platform" which is searched by default. I suspect the Debian people might want to search things the user has installed by default. I use the .cabal files to find what depends on what, so I can link up alias information properly. Having the .cabal file in the -doc package seems reasonable, since it does contain lots of documentation about the package.

iustin commented 10 years ago

Aside from the actual list of files needed (Joachim knows the situation better there), I want to say that having the --no-download flag would be very useful, and if you do cleanup the local feature that is also good.

@nomeata: we not only can include the keywords file, we already do - so yes, I do hope the license is good! I don't know what's the refresh policy for that though, something to keep in mind (and add to instructions).

nomeata commented 10 years ago

@nomeata: we not only can include the keywords file, we already do - so yes, I do hope the license is good! I don't know what's the refresh policy for that though, something to keep in mind (and add to instructions).

The simplest policy would be if upstream would include the file in the tarball as well (with the possibility for the user to update it, as it is the case now), and we just update it along new upstream versions of hoogle :-)

ndmitchell commented 10 years ago

@nomeata I'm happy to include the keywords file - that seems reasonable.

nomeata commented 10 years ago

The automatic downloading of untrusted files was just reported as a critical security bug in the Debian bugtracker: http://bugs.debian.org/756334

What is the timeline for a version of hoogle with --no-download? I’d avoid having to patch the package in Debian to bridge the time.

Also, the worries (unpacking untrusted data) apply to your users as well. Maybe you want to take precautions (signatures or such, or at least warnings to the user)?

ndmitchell commented 10 years ago

The worst that could happen is someone maliciously replacing the Hoogle search results, and that seems unlikely, so I am not too worried. The --no-download is on my radar, but not fantastically close. Maybe a month?

iustin commented 10 years ago

Unless I misunderstand what the option should do (just prevent downloads if a file is missing) I'll send a patch later this week.

ndmitchell commented 10 years ago

As long as the option doesn't do anything harmful if not passed, I'll happily accept anything that works for you guys (but might then revisit it in the future).

abacabadabacaba commented 10 years ago

Current behavior leads to privilege elevation in Debian, as follows:

See here for some other potential attacks.

ndmitchell commented 10 years ago

@abacabadabacaba Is there some flag I can pass to strip the setuid files? I can't see the link at the moment since it gives a 500 server error.

abacabadabacaba commented 10 years ago

I don't know any such flags. Actually, tar is pretty hard to use securely.

One possible solution is to extract into a directory inaccessible to other users, which means that its parent directory should have permission bits set to 0700. An archive can modify permission bits on the directory it is extracted into, that's why it is important to make the parent directory inaccessible to other users.

Then, it is necessary to sanitize the resulting directory by removing everything except regular files and directories. That's just find -type f -o -type d -o -delete. Otherwise, an attacker can place a symlink to cause your script to read arbitrary files. An attacker can also place device nodes and named pipes.

Running tar as a non-root user would prevent it from creating setuid root files and device nodes. However, it can still create symlinks, which can be used to access some secret data. For example, an archive may contain a symlink to /etc/ppp/chap-secrets, which could be used to leak contents of that file.

An alternative approach is to use some library to access contents of the archive without extracting it at all, e. g. libtar.

After this, some denial of service attacks still remain. For example, an attacker can send a "gzip bomb", which will fill the filesystem when unpacked, or just a very large file.

iustin commented 10 years ago

While I appreciate @abacabadabacaba concern about security, I feel it is a bit out of place here, in the context of this bug.

Feel free to clone it and make it a general bug about the safety of downloaded data, but I'd prefer if this bug remains purely about using hoogle without internet access.

I also feel it's a bit misplaced that we assign all of the above attacks on hoogle; normally hoogle is not used as root, it's the fault of the Debian packaging that it runs it so, so a lot of the described attacks above are no longer valid (only root can create device nodes, only files to which hoogle has access would be leaked - if at all, etc.).

abacabadabacaba commented 10 years ago

Most of the attacks are still valid, they would just compromise the account hoogle is running as, which is still a vulnerability. Well, it is really a different bug than that hoogle requires Internet access, so I created issue #78 for it.

iustin commented 10 years ago

With the --no-download flag in, I think the only issue remaining on this bug is how to declare/discover needed files.

It can stay as it is now (looking at the source code), but a nicer way would be to have the data command show the missing files and their URLs either in the error message of missing files or via a separate command. Thoughts?

ndmitchell commented 10 years ago

Could hoogle download just download the necessary files? Then you could do that to build the package?

iustin commented 10 years ago

Yes, if you provide that it would be fine.

nomeata commented 10 years ago

@iustin, @ndmitchell: What is the status of this? Who is blocked on whom?

iustin commented 10 years ago

Nobody is blocked, but I'm traveling right now and don't have access to my GPG keys to upload a new version. This issue remains open for explicit --download, but we're fine with the current coffee for the open bug.