Deprecate methods in CustomAnalyzer.Builder which take factory classes [LUCENE-8566]

mikemccand commented 6 years ago

CustomAnalyzer.Builder has methods which take implementation classes as follows.

withTokenizer(Class<? extends TokenizerFactory> factory, String... params)
withTokenizer(Class<? extends TokenizerFactory> factory, Map<String,String> params)
addTokenFilter(Class<? extends TokenFilterFactory> factory, String... params)
addTokenFilter(Class<? extends TokenFilterFactory> factory, Map<String,String> params)
addCharFilter(Class<? extends CharFilterFactory> factory, String... params)
addCharFilter(Class<? extends CharFilterFactory> factory, Map<String,String> params)

Since the builder also has methods which take service names, it seems like that above methods are unnecessary and a little bit misleading. Giving symbolic names is preferable to implementation factory classes, but for now, users can write code depending on implementation classes.

What do you think about deprecating those methods (adding @Deprecated annotations) and deleting them in the future releases? Those are called by only test cases so deleting them should have no impact on current lucene/solr codebase.

If this proposal gains your consent, I will create a patch. (Let me know if I missed some point. I'll close it.)

Legacy Jira details

LUCENE-8566 by Tomoko Uchida (@mocobeta) on Nov 18 2018, resolved Oct 29 2021 Linked issues:

LUCENE-8873

mikemccand commented 6 years ago

Are there any feedbacks? :)

I will try to explain some background. (I'm not a fluent English speaker. Excuse me if I make mistakes.) I have recently learned about the concept of Service Provider Interface and Java9's Module System utilizes this to decouple the interfaces and implementations (though it seems Lucene does not use the JDK's ServiceLoader but its own special loader class.) I did not know well about service providers until quite recently, but I like the essential concept and feel like that it's a good idea to limit the way to look up the factories, those are often provided from third party jars, only via names.

I think Lucene users and also Solr should use service names to instantiate the factories in the future, but first Lucene itself should use only the names and force to its users to use them. So I am considering this is a small step to unify the use of the factories.

Just for information, my studies and thoughts was staretd from a comment in this issue: https://issues.apache.org/jira/browse/LUCENE-8453

We should also document the SPI name of each factory.

But I do not want to mess up the issue board (may be because of my lack of understanding for the details or priorities,) I will close this in a few weeks if there are no comments here.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Nov 20 2018]

mikemccand commented 6 years ago

@thetaphi What do you think about this change?

[Legacy Jira: Adrien Grand (@jpountz) on Nov 20 2018]

mikemccand commented 6 years ago

Hi, originally the whole analyzer was only using the SPI names. That's as Tomoko says "the correct way". The factory methods were added on request by many users who wanted to have compile-time safety. I agree with that, so we added the factory builder methods. Indeed this allows to have at least compile checks on the names, which makes it easier (for example) to write quick tests.

A second problem of the SPI names is currently that they are documented nowhere. To figure them out, you have to know how they are derived from the factory class name. Before doing anything like that here, we should:

Officially document the SPI names
Don't derive the SPI names from class name anymore (we should put them into the classes as static final constants). By that you can add compile time safety again (user can do FooBarFactory.NAME).
Replace the solr examples in javadocs by generic ones that use SPI names
Change Solr to allow "name=...." instead of "class=..." for analyzer configs.

So I am -1 to remove them again.

[Legacy Jira: Uwe Schindler (@uschindler) on Nov 20 2018]

mikemccand commented 6 years ago

Here is more background: LUCENE-6958

[Legacy Jira: Uwe Schindler (@uschindler) on Nov 20 2018]

mikemccand commented 6 years ago

Thank you for your feedback.

I understand that I dug up the old discussion, and it clearly seems that an umbrella issue is needed to cope with the situation.

Now I am not certain about whether we should carry on or not. Is there practical value in continuing and taking our (of course I mean contributors' and reviewers') time for it, or should we simply close this issue as Won't Fix and work for more important things/issues?

[Legacy Jira: Tomoko Uchida (@mocobeta) on Nov 20 2018]

mikemccand commented 6 years ago

I'm not sure if I can complete the whole to-do list, but getting one or two things done should still be good for users and for this project. :) I will create the patch for each task (one by one) and let you know.

Officially document the SPI names

Don't derive the SPI names from class name anymore (we should put them into the classes as static final constants). By that you can add compile time safety again (user can do FooBarFactory.NAME).

Replace the solr examples in javadocs by generic ones that use SPI names

Change Solr to allow "name=...." instead of "class=..." for analyzer configs.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Nov 20 2018]

mikemccand commented 5 years ago

@thetaphi I have encountered a subtle question.

Don't derive the SPI names from class name anymore (we should put them into the classes as static final constants). By that you can add compile time safety again (user can do FooBarFactory.NAME).

I feel like that factory classes' static final fields are somewhat "implementation details." (Java interfaces can have static final fields, but current factories are not based on interfaces.) Is there a major difference between the calls of these two builder methods.

// by name
CustomAnalyzer.builder().withTokenizer(FooBarFactory.NAME)

// by class 
CustomAnalyzer.builder().withTokenizer(FooBarFactory.class)

Could you help me to clarify what we will achieve by this modification?

[Legacy Jira: Tomoko Uchida (@mocobeta) on Dec 07 2018]

mikemccand commented 5 years ago

Hi, in fact there is no difference between the two calls. Yes, the clas sname is an implementation details if you purely see it from the standpoint of somebody using "configuration" files. But those people get an error message on startup of the server.

For people building a custom analyzer from source code, using class names or constants help them when using their IDE's autocompletion. To them it does not matter if they write ".class" or ".NAME" or just use a "string" as is.

About the implementation - My proposal would be:

Add a "NAME" static public final String field to all factories (similar like Elasticsearch is doing this).
In the SPI code, we use reflection to lookup the static field named "NAME" for every class we discovered. We use the found name to register the factory class for lookup in "Factory.forName()".

[Legacy Jira: Uwe Schindler (@uschindler) on Dec 07 2018]

mikemccand commented 5 years ago

Thanks for the comment. I'd like to start from this,

Add a "NAME" static public final String field to all factories

and document the SPI names in all factories' Javadoc.

Also we might need some code validator, which can be called from the precommit build task, to make sure that each factory has the "NAME" static field.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Dec 07 2018]

mikemccand commented 5 years ago

Hi @thetaphi,

I opened a sub-issue: https://issues.apache.org/jira/browse/LUCENE-8778

Could you check it and give some feedback about this change?

[Legacy Jira: Tomoko Uchida (@mocobeta) on Apr 25 2019]

mikemccand commented 5 years ago

I opened LUCENE-8873 to improve documentation, and (hopefully) provide better ways to manage properties of the factories.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 22 2019]

mikemccand commented 5 years ago

I opened LUCENE-8894 to provide unified APIs to look up a SPI name for a tokenizer/charfilter/tokenfilter (inverse operations of lookupClass(String) equipped in the factories).

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jun 29 2019]

mikemccand commented 5 years ago

I opened SOLR-13593 to allow to use SPI names in the Solr schema definition.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jul 01 2019]

mikemccand commented 5 years ago

Hi would it be a good time to change all CustomAnalyzer Javadoc examples to ones with SPI names to encourage users to use those instead of class names? I mean, now we can change this

Analyzer ana = CustomAnalyzer.builder(Paths.get("/path/to/config/dir"))
   .withTokenizer(StandardTokenizerFactory.class)
   .addTokenFilter(StandardFilterFactory.class)
   .addTokenFilter(LowerCaseFilterFactory.class)
   .addTokenFilter(StopFilterFactory.class, "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
   .build();

to

Analyzer ana = CustomAnalyzer.builder(Paths.get("/path/to/config/dir"))
   .withTokenizer(StandardTokenizerFactory.NAME)
   .addTokenFilter(StandardFilterFactory.NAME)
   .addTokenFilter(LowerCaseFilterFactory.NAME)
   .addTokenFilter(StopFilterFactory.NAME, "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
   .build();

[Legacy Jira: Tomoko Uchida (@mocobeta) on Aug 17 2019]

mikemccand commented 5 years ago

I think as a first step we should change the examples in CustomAnalyzer.

I am +/-0 to deprecate those methods!

[Legacy Jira: Uwe Schindler (@uschindler) on Aug 17 2019]

mikemccand commented 5 years ago

I didn't intend to proceed this issue, just noticed that CustomAnalyzer Javadoc is the only place where we can promote the factory's NAME static fields (with their usage). Also the examples still contain StandardFilterFactory that was already removed, so anyway they should be updated. I will open an issue to change the Javadoc.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Aug 17 2019]

mikemccand commented 5 years ago

The Javadoc example was updated in LUCENE-8957.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Aug 28 2019]

mikemccand / stargazers-migration-test

Deprecate methods in CustomAnalyzer.Builder which take factory classes [LUCENE-8566] #565

Legacy Jira details