Implement searching in zip-files

zbstof commented 9 years ago

Right now it just says: "ERR: Zip files not yet supported" and "Cannot decompress zipped file" It would be awesome too look into jar files.

frederikschubert commented 9 years ago

For now you could use this command to search zip/jar archives:

# To search the contents of the files
unzip -c app.jar | ag regex

# To search for file names
unzip -l app.jar | ag regex

zbstof commented 9 years ago

What about a folder with jars? I usually have some server delpoyed and I want to search for some class inside all of the jars inside it

frederikschubert commented 9 years ago

To search multiple jars in a folder you could use this command. I recommend creating an alias for it because it is quiet long.

# To recursively search the current directory for files containing the regex
find . -name "*.jar" | while read filename; do unzip -cq "$filename" | ag regex; done;

# To recursively search the current directory for files named like the regex
find . -name "*.jar" | while read filename; do unzip -lq "$filename" | ag regex; done;

zbstof commented 9 years ago

This is better, but still will leave a lot of unziped files, potentially corrupting my deployment folder. Can this script be modified to clean up after itself?

frederikschubert commented 9 years ago

From the linux man pages of unzip:

-c

extract files to stdout/screen (''CRT''). This option is similar to the -p option except that the name of each file is printed as it is extracted, the -a option is allowed, and ASCII-EBCDIC conversion is automatically performed if appropriate. This option is not listed in the unzip usage screen.

and

-l

list archive files (short format). The names, uncompressed file sizes and modification dates and times of the specified files are printed, along with totals for all files specified. If UnZip was compiled with OS2_EAS defined, the -l option also lists columns for the sizes of stored OS/2 extended attributes (EAs) and OS/2 access control lists (ACLs). In addition, the zipfile comment and individual file comments (if any) are displayed. If a file was archived from a single-case file system (for example, the old MS-DOS FAT file system) and the -L option was given, the filename is converted to lowercase and is prefixed with a caret (^).

So the zip files should not be unzipped to the file system when using both options (At least on linux).

jschpp commented 8 years ago

Out of curiosity. How would you want to implement zip search? Do you want to traverse the zip folder structure and search every file within or do you want to implement it for zip file conatining only one file?

pierrejoye commented 8 years ago

Hello,

First thanks for this awesome tool! (not the place but have to say it :)

About this feature, I implemented similar things for PHP (extensions). If you allow me to throw some suggestions here.

Zip is one of many archives formats, I suppose other would like to do it for tar(gz|bz|..) too, along numerous other ways using various IOs.

My first suggestion would be to begin with IO layer, thin enough or exclusively used when it is used for anything but the filesystem (fopen&co system APIs), to avoid impacting performance for FS only (if any) operations. This layer could provide the classic, open, read, scandir, stat or glob(not sure if used) handler for a given IO method (zip, tar, etc).

Once implemented it can be relatively straightforward to implement any archive (or custom IO) support.

For zip, there is this nice well implemented libzip libraries, the one I use for the zip extension in PHP. Support all APIs to actually be used with streams and related areas as well as in memory operations.

If you need a hand on that, let me know, I can try to provide some PRs to start.

jschpp commented 8 years ago

I took a look at the code and I noticed that the interface of the decompress function need to be change to search zip or tar files. At the moment *buf is a pointer to a buffered file but when extracting zip or tar archives one would have an array of files to be searched by search_buf

I'm with @pierrejoye about the use of libzip instead of using minizip from the zlib. My main reason for this is that libzip has a nice api for extracting the contents of a zip file to a buffer.

jschpp commented 8 years ago

Okay. After trying to implement a search in zip files I stumbled across a problem. First of all: It should be no problem to handle zip files containing only one text file. But as we all know that is usually not the case. Instead we have zip files containing folder structures containing in the worst case more zip files or more files of another type. That can't be handled in the moment. To realize that we need to change the way zip files are handled.

At the moment main calls search_dir which then calls search_file. Then the file is opened gets tested and loaded into a buffer. At this point is_zipped is called and after that decompress gets called decompressing the file in the buffer. This way is a problem for zip files because they may contain binary files or stuff which should be ignored. Futhermore zip files may get bigger than INT_MAX which by itself isn't a problem because they could contain a miriad of small files. Yet the current callstack prevents those files from being searched.

I woule like to hear from @ggreer if he has an opinion how to handle this withoout breaking interfaces left and right.

jschpp commented 8 years ago

I've added a pull request. Could you pleas test it and tell me your opionion.

greg-House42 commented 1 month ago

It is possible. Even without preprocessing the data while compressing it. You just have to turn around your system. If you are looking for an not encrypted string for example in an encrypted one that can't work.

ggreer / the_silver_searcher

Implement searching in zip-files #743