oracle / opengrok

OpenGrok is a fast and usable source code search and cross reference engine, written in Java
http://oracle.github.io/opengrok/
Other
4.35k stars 747 forks source link

improve analyzers to include " or < to string literals (Bugzilla #11825) #414

Open vladak opened 11 years ago

vladak commented 11 years ago

status NEW severity enhancement in component analyzer for --- Reported in version unspecified on platform ANY/Generic Assigned to: Trond Norbye

On 2009-10-08 14:35:56 +0000, Lubos Kosco wrote:

I don't know about < or >, but I've had users approach me about finding double quotes. The use case is: I want to find all occurrances of "example" as a string, but I don't want to find classes, namespaces, variables, etc - just string literals.

The angle brackets request may be the same - I want to find all occurrences of but I don't want anything other than tags.

Just my $0.02

-- Jim R. Wilson (jimbojw | trephine.org)

On Fri, Sep 18, 2009 at 4:04 AM, Lubos Kosco Lubos.Kosco@sun.com wrote:

Knut Anders Hatlen wrote:

Lubos Kosco Lubos.Kosco@Sun.COM writes:

Garcia-Duque, Manuel wrote:

I'm trying to search for "<" in opengrok 0.7 (repository Subversion). However, the search seems to discard that character '<'.

The other thing is that I don't recall that we do index "<" as a searchable token ... or do we ? Trond, kah, got clues ?

I don't think we do currently, except for the Lisp tokenizer, as you said (and perhaps the SQL tokenizer? I haven't checked), where '<' may be part of a symbol. Most languages don't accept '<' as part of a symbol/keyword, but we could of course make the tokenizers for those other languages return '<' and other operators as separate tokens too. I don't see many good reasons not to, except that those tokens probably have very low selectivity in most cases (matches most files and don't narrow the search much).

agree on the same, use of "<" in search query doesn't make much sense, unless Manuel has got a really good use case, hmm ?

L

(seems C/C++ or java doesn't do it:

http://src.opensolaris.org/source/xref/opengrok/trunk/src/org/opensolaris/opengrok/analysis/c/CSymbolTokenizer.lex , ... but it seems only lisp tokenizer does it

http://src.opensolaris.org/source/xref/opengrok/trunk/src/org/opensolaris/opengrok/analysis/lisp/LispSymbolTokenizer.lex# 59 )

opengrok-discuss mailing list opengrok-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/opengrok-discuss

On 2009-10-08 18:01:49 +0000, Jim R. Wilson wrote:

What would be nice (IMO) is to have the indexer keep both a punctuation aware and a punctuation agnostic index. As of the time of this writing, the index is completely punctuation agnostic, making it impossible to match things such as XML tags.

That particular approach may be overkill. I'm not enough of a Lucene expert to comment intelligently on whether there may be a better way to achieve the same effect.

On 2009-10-08 23:30:17 +0000, Bruce Furber wrote:

While you are at it would be nice to document how special characters are handled. Some like + are indexed separately while _ is treated as a normal character. Some are ignored and treated as white space.

craigkovatch commented 4 years ago

"really good use case" is being able to search HTML/XML/JSX tags. I want to query for "<List" right now but I can't because Opengrok completely ignores the LT sign and returns all results of "List" across our codebase -- which is too many to be a useful search :(