tballison / lucene-addons

Standalone versions of LUCENE_5205 and other patches: SpanQueryParser, Concordance and Co-occurrence stats
Apache License 2.0
18 stars 1 forks source link

Regarding SpanQueryParser constructor. #2

Closed modassar81 closed 8 years ago

modassar81 commented 8 years ago

Hi Tim,

The constructor public SpanQueryParser(String f, Analyzer a) in SpanQueryParser is no more available. Can you please help me understand the change as I am using this constructor? If the constructor is no more required kindly explain how the new constructor with the new multitermAnalyzer parameter can be used instead of the public SpanQueryParser(String f, Analyzer a)?

Thanks, Modassar

tballison commented 8 years ago

I made the design decision to get rid of SQP(String field, Analyzer analyzer) in place of SQP(String field, Analyzer analyzer, Analyzer multitermAnalyzer) because it is dangerous to send in a regular non-multiterm aware analyzer to be used for double duty as a regular Analyzer and as a multiterm Analyzer. It'll work quite a bit of the time, but then it won't. Specifying a single analyzer to do double duty feels to me akin to not having to specify the encoding when creating a Reader or not specifying a Locale when lowercasing.

So, the simplest way to construct a multiterm analyzer is something like:

    Analyzer asciiFoldingMultiTermAnalyzer = new Analyzer() {
      @Override
      public Analyzer.TokenStreamComponents createComponents(String fieldName) {
        Tokenizer tokenizer = new KeywordTokenizer();
        TokenFilter filter = new ASCIIFoldingFilter(tokenizer);
        filter = new LowerCaseFilter(filter);
        return new Analyzer.TokenStreamComponents(tokenizer, filter);
      }
    };

The key is to start with the KeywordTokenizer and then add the multitermaware components that are part of your primary Analyzer. The multiterm analyzer is constructed automatically for you in Solr.

Is this reasonable?

modassar81 commented 8 years ago

I already have an analyzer which has solr.ICUFoldingFilterFactory. So I think solr will automatically do the right thing for multiterm as you have suggested. So in that case can I simply pass my existing analyzer as multiterm analyzer. Please help me understand if solr can create multiterm analyzer from the MultiTermAware analyzer available in schema then why we need to pass it as a constructor parameter unless there is no such analyzer defined with any multiterm aware component. This I am saying in context of details provided in https://wiki.apache.org/solr/MultitermQueryAnalysis. Here in the link it is suggested not to have a own defined mutliterm analyzer chain.

modassar81 commented 8 years ago

I am planning to migrate to Lucene/Solr-5.4.0 in a couple of days. Kindly let me know if any changes you are planning in SpanQueryParser for Lucene/Solr-5.4.0.

Thanks, Modassar

tballison commented 8 years ago

I'm not planning any changes. The current unit tests (and the build) will fail because of a regression in 5.4.0 with handling multiterms within SpanNotQueries (LUCENE-6929).

If you'd like, I can @Ignore those failing unit tests so that you can at least build it. I think I have to make some other unit test changes to handle the new method of equality checking for BooleanQueries.

If you're still going to migrate to 5.4.0 despite LUCENE-6929, let me know, and I'll modify the unit tests so that you can build it.

tballison commented 8 years ago

To your question about Solr...shouldn't that all be taken care of for you with SOLR-5410 (another module within lucene-addons)? Do you have custom Solr code that is relying on LUCENE-5205?

modassar81 commented 8 years ago

Do you have custom Solr code that is relying on LUCENE-5205?

Yes SpanQueryParser is integrated with edismax parser for phrases. So any phrase goes to SpanQueryParser from edismax. Please let me know how I can use the new constructor.

modassar81 commented 8 years ago

Yes we need to migrate to Lucene/Solr-5.4.0.

tballison commented 8 years ago

To the first point, can't you use the SolrSpanQueryParser for phrases?

To the second, ok, I'll make the changes to the unit tests so that you can build it with 5.4.0. Just beware that multiterms in SpanNotQueries fail now!!!

modassar81 commented 8 years ago

To the first point, can't you use the SolrSpanQueryParser for phrases? I don't see this class in the package. Please point me if I am missing something. Also share your plan for the fix of multiterms in SpanNotQueries.

tballison commented 8 years ago

https://github.com/tballison/lucene-addons/blob/master/solr-5410/src/main/java/org/apache/solr/search/SolrSpanQueryParser.java

The fix for SpanNotQueries needs to be made at the Lucene level. I can't do it with the parser. I've submitted a patch with unit tests on LUCENE-6929. The fix is simple, we just need to get the interest of a Lucene/Solr committer. Perhaps you can apply the patch before you build 5.4.0?

modassar81 commented 8 years ago

Thanks Tim. I will look into using SolrSpanQueryParser. This may take time as will need to add it into the edismax and check how the required parameters will be created for its constructor and used. The current integration with SpanQueryParser is very stable so a help around it is really appreciated. This makes the integration easier and quick as we have been using it.

tballison commented 8 years ago

I strongly encourage moving to the SolrSpanQueryParser because that will take your schema info into consideration correctly and it will handle multiterms correctly. If I reverted to the old state of just submitting a single parser to handle both tokenization/normalization and normalization of multiterms, there would likely be subtle bugs.

Alternatively, if you are already in a QParser/QParserPlugin and you have the IndexSchema and you know that you are only parsing a single field, couldn't you do effectively what SolrSpanQueryParser does in getMultiTermAnalyzer but at the initialization time of the SpanQueryParser to get the right MultiTermAnalyzer? This would be a bit hacky because you're effectively limiting the SQP to a single field's configuration for analyzers (but you were doing this already???), but it would be better than initializing with a single analyzer.

  @Override
  public Analyzer getMultiTermAnalyzer(String fieldName) {
    SchemaField field = schema.getFieldOrNull(fieldName);
    if (field == null) {
      return null;
    }
    FieldType type = field.getType();

    if (type instanceof TextField) {
      return ((TextField) type).getMultiTermAnalyzer();
    }
    return null;
  }

The other hacky thing about this is that you aren't getting the benefit of the custom field handling for prefix and range handling for field types that aren't TextFields

modassar81 commented 8 years ago

Thanks Tim for pointing the options. The second option is simply there in SolrQueryParserBase which will get me the multiterm analyzer which can be passed to the SpanQueryParser. Please correct me if I am wrong. Thanks again for the pointers.

tballison commented 8 years ago

Exactly, this is hacky for the reasons I mentioned, but if you can guarantee that you are only parsing one field, and you know that that field is a TextField, then the multiterm analysis and everything else will work correctly.

tballison commented 8 years ago

This seems to have split into some separate issues: dealing with the new constructor and updating to 5.4. I opened a separate ticket to track 5.4. Are you all set with the new constructor...should I close this out?

modassar81 commented 8 years ago

I will test it tomorrow and will be in a position to conclude. Will update this issue after that. I should use https://github.com/tballison/lucene-addons/tree/lucene5.4on-0.1/lucene-5205 to integrate with Lucene/Solr-5.4.0. Right?

Thanks, Modassar

modassar81 commented 8 years ago

Hi Tim,

I have integrated SpanQueryParser with lucene/Solr-5.4.0 and have started indexing. Will do some testing and update this issue.

I have a class which is not part of Solr deployment and uses the SpanQueryParser. This class has its own analyzer chain. Now that the constructor is modified how I can get the multiTerm analyzer in this class. Can I make a class IndexSchema aware outside Solr and if not should I create my own multiterm analyzer and use that as you have suggested above? What are the points one should address while creating own multi term analyzer?

Thanks, Modassar

tballison commented 8 years ago

I have hacked IndexSchema to work at the pure Lucene level, but I think it would be best to build your own multiterm analyzer...if you are already building a custom analyzer chain.

Some important points:

  1. start with the KeywordTokenizer
  2. Make sure you don't use a synonym filter or a stop filter
  3. Generally, confirm that the filters are all multiterm aware.

I realize that this is a hassle, but I believe strongly that it is better to go through this work than to have unseen/rare bugs caused by using one Analyzer for two tasks. Thank you for working through this!

modassar81 commented 8 years ago

Hi Tim,

I am using following analyzer chain.

@Override
protected TokenStreamComponents createComponents(String arg0) {
    Tokenizer source = new WhitespaceTokenizer();
    TokenStream stopTStrem = new StopFilter(source, new CharArraySet(DEFAULT_STOP_SET, true));
    TokenStream icuFTStream = new ICUFoldingFilter(stopTStrem);
    TokenStream pattern1FStream = new PatternReplaceFilter(icuFTStream, pattern1, "", true);
    TokenStream krfStream = new KeywordRepeatFilter(pattern1FStream);
    TokenStream kStream = new KStemFilter(krfStream);
    TokenStream rdtStream = new RemoveDuplicatesTokenFilter(kStream);

    return new TokenStreamComponents(source, rdtStream);
}

Please let me know if below multi term analyzer should be fine. Is it that always we need multi term aware component only? If yes what about replacing pattern used at indexing time?

@Override
protected TokenStreamComponents createComponents(String arg0) {
    Tokenizer source = new KeywordTokenizer();
    TokenStream icuFTStream = new ICUFoldingFilter(source);
    TokenStream pattern1FStream = new PatternReplaceFilter(icuFTStream, pattern1, "", true);

    return new TokenStreamComponents(source, pattern1FStream);
}

Thanks, Modassar

modassar81 commented 8 years ago

Hi Tim,

Can you kindly help me understand the reason behind following changes in SpanQueryParser?

When wildcard query is parsed using SpanQueryParser with the latest changes it returns PrefixQuery whereas parsing a term returns TermQuery. E.g. fl:sear* (returns PrefixQuery), fl:search (returns TermQuery). With the older Lucene-5205 (till 5.2.1) code I have checked that it used to return SpanMultiTermQueryWrapper and SpanTermQuery for wildcard and term queries respectively.

Basically I am using the Span related attributes for example PositionSpan in my custom code which now is not generated for above two type of queries as they are no more Span based.

Please help.

Thanks, Modassar

tballison commented 8 years ago

To your first question, y, the components should generally be multiterm aware. If your PatternReplaceFilter needs/ operates on the full word, things may go awry. For example, if your pattern is ~s/^cat$/dog/ and you search for "cat", this will result in searching for "dog".

However, if your PRF is just replacing "-" with "_", you should be ok.

See: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201512.mbox/%3CF194523CD3CC924F90950F099354356D010E69CA60@IL-EXM01.Corp.Exlibrisgroup.com%3E

tballison commented 8 years ago

To the second question of the change in behavior...returning regular instead of span queries where possible.

If you don't need the full fledged Boolean/multi-field handling in your parser, you can try using the SpanOnlyParser.

Lots of words follow...

Historical reason: the initial SpanQueryParser was restricted to single fields and didn't support Boolean operators, and the initial goal was exactly what you're trying to do: return only SpanQueries so that I could iterate through spans in a document and do some processing (concordance/co-occurrence). However, after I added the capability to handle multiple fields and Boolean operators, I had somewhat of a hybrid output...mostly SpanQueries but now BooleanQueries too. This meant that I could no longer return SpanQueries, but had to return Queries. As the next step in the evolution, I decided to return regular queries where possible, but SpanQueries where necessary.

In short, I (finally) realized that there was an important difference between a retrieval query and a highlighting query, and I needed to be able to support retrieval and highlighting separately.

So, I now have a class in LUCENE-5317 that will convert a regular query into a SpanQuery for a given field so that you can iterate through the spans of a given document. Take a look at:

org.apache.lucene.search.queries.SpanQueryConverter and org.apache.lucene.search.spans.SimpleSpanQueryConverter

in e.g. : https://github.com/tballison/lucene-addons/tree/lucene5.4on-0.1/lucene-5317/src/main/java/org/apache/lucene/search

modassar81 commented 8 years ago

Thanks a lot for your response Tim. To your first response the pattern only replaces any non word character with a blank from beginning and end of the input. I will look into SpanQueryConverter and SimpleSpanQueryConverter and try to use for our requirement. Is SpanOnlyParser same as SpanQueryParser and differs only in following sense: "returning regular instead of span queries where possible." which means it always returns Spans.

Best, Modassar

modassar81 commented 8 years ago

Any plan for back porting the fix of wildcard in double quotes e.g. "net*" to Lucene/Solr-5.2.1?

Thanks, Modassar

tballison commented 8 years ago

Is SpanOnlyParser same as SpanQueryParser and differs only in following sense: "returning regular instead of span queries where possible." which means it always returns Spans.

SpanOnlyParser always returns SpanQueries. However, it will throw a ParseException, if there's a Boolean operator, a field operator or a MatchAllDocs operator (:) in the query.

tballison commented 8 years ago

Any plan for back porting the fix of wildcard in double quotes e.g. "net*" to Lucene/Solr-5.2.1?

Y, I did in the lexer2 branch. Sorry, I need to rename that branch. That currently works with the new test for the "net*" problem that you found.

modassar81 commented 8 years ago

Thanks for your help and response Tim.

I tried to use org.apache.lucene.search.spans.SimpleSpanQueryConverter and I am able to get the desired results. But when I added SimpleSpanQueryConverter and SpanQueryConverter in Lucene core by creating required package the build failed during compilation. It complains of symbol not found CommonTermsQuery for SpanQueryConverter. Addition of SimpleSpanQueryConverter only in Lucene core's org.apache.lucene.search.spans package did not cause any issue. Do I need to put SpanQueryConverter somewhere else in Lucene packages?

Also there are few deprecated APIs used in these converter. Just thought of understanding.

Best, Modassar

tballison commented 8 years ago

Right. SpanQueryConverter was broken out into a different class because it requires CommonTermsQuery which lives in the queries package. If you don't need CommonTermsQuery, just use SimpleSpanQueryConverter, and you'll be in good shape. Otherwise, SpanQueryConverter should work in the queries package, but I haven't tried building these addons within Lucene in quite a while.

modassar81 commented 8 years ago

Thanks for your help and suggestions Tim. All your inputs have been great help.

Best, Modassar

tballison commented 8 years ago

It has been great to get your feedback. Happy searching...er, finding! :)