oracle / opengrok

OpenGrok is a fast and usable source code search and cross reference engine, written in Java
http://oracle.github.io/opengrok/
Other
4.34k stars 745 forks source link

Contents of files within Zip files? #3034

Open Shooter3k opened 4 years ago

Shooter3k commented 4 years ago

Does Opengrok index the contents of the files within a zip file? If so, mine doesn't seem to be working and I'm not quite sure where to start in order to figure out why.

That being said, if I search for "test123" and there is a file that's called "test123.txt" within "test.zip" then it DOES find that. But if "test123.txt" has "test456" as text within "test123.txt" which is within "test.zip" it does not find that.

Thanks for any help anyone can offer!

vladak commented 4 years ago

Not that I know of. Similar to #606.

tarzanek commented 4 years ago

OpenGrok should be able to look into zip files (it can certainly analyze .jar files) so I think this should be fixed, it seems like a problem with nested analyzers that came in either during lucene upgrade or apache (un)zip libraries update @jvaneck care to go debug and fix it? I'd be happy to consult, start with one zip file, put breakpoints to zip analyser https://github.com/oracle/opengrok/tree/master/opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/archive

tarzanek commented 4 years ago

AH, blasted, the nesting logic isn't even there ... so this should be an enhancement and not a bug actually

tarzanek commented 4 years ago

OK, @jvaneck look at https://github.com/oracle/opengrok/blob/master/opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/archive/BZip2Analyzer.java and same with Gzip

basically same logic needs to be taken to ZipAnalyzer (and tar analyzer ev.) care to copy paste and fix? ;-)

Shooter3k commented 4 years ago

I would if I knew Java! :)

OK, @jvaneck look at https://github.com/oracle/opengrok/blob/master/opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/archive/BZip2Analyzer.java and same with Gzip

basically same logic needs to be taken to ZipAnalyzer (and tar analyzer ev.) care to copy paste and fix? ;-)

idodeclare commented 4 years ago

OK, @jvaneck look at https://github.com/oracle/opengrok/blob/master/opengrok-indexer/src/main/java/org/opengrok/indexer/analysis/archive/BZip2Analyzer.java and same with Gzip

basically same logic needs to be taken to ZipAnalyzer (and tar analyzer ev.) care to copy paste and fix? ;-)

Alas it's more difficult. Gzip and Bzip2 are compression formats. Tar is an archive format. Zip and Jar are archive formats with compression.

As archive formats, Tar, Zip, and Jar are inherently multi-file, so the requisite logic to index more comprehensively would be substantially different from how OpenGrok is able to decompress and (possibly) index the single-file contents of Gzip and Bzip2.

For Jar, OpenGrok only recognizes nested Java Class files for full indexing (via bcel), but OpenGrok ignores other file types possibly in the jar (e.g. any xml or txt files).