Backport LUCENE-8778 (improved analysis SPI name handling) to 8.x [LUCENE-8911]

mikemccand commented 5 years ago

In LUCENE-8907 I reverted LUCENE-8778 from the 8x branch. Can we backport it to 8x branch again, with transparent backwards compatibility (by emulating the factory loading method of Lucene 8.1)?

I am not so sure about it would be better or not to backport the changes, however, maybe it is good for Solr to have SOLR-13593 without waiting for release 9.0.

Legacy Jira details

LUCENE-8911 by Tomoko Uchida (@mocobeta) on Jul 11 2019, resolved Jul 21 2019 Attachments: LUCENE-8911-fix-mock.patch Linked issues:

LUCENE-8907

mikemccand commented 5 years ago

I do not have strong feeling about this, but I just start thinking that it could be better if we expose this feature from 8.x as pre-migration steps to 9.0 - especially if it affects to many Lucene and Solr users. Therefore quick-eyed custom factory authors can try their own names from now on. It should be easy to emulate old algorithm in a "lenient way" when the SPI NAME constant is not defined.

@thetaphi and @jpountz: Do you have any thoughts about this?

And of course we need to add some note about it in MIGRATE.txt anyway, I forgot that. I'll update it. Also, by the way, I feel like that it would be helpful if we can show warning messages for such use case though we don't need full-fledged loggers. But it's beyond the scope here.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jul 12 2019]

mikemccand commented 5 years ago

My opinion/idea would be the following:

Backport everything to 8.x
In addition to extracting the NAME field (which should not fail the SPI lookup) also generate the "old name". Add both names to the map (they may differ, if a user decides to use his own name).

This would bring both worlds:

Old factories still work (using the old algorithm) to generate the name
New factories would have both names (ideally they should not differ, but with the new aproach the user should be able to decide for his own name, not generated from the factory class name).

We should add a test with a fake factory (maybe add a META-INF file to the tests) without NAME to test backwards compatibility.

[Legacy Jira: Uwe Schindler (@uschindler) on Jul 12 2019]

mikemccand commented 5 years ago

And Solr should log a schema warning (Lucene can't do this).

[Legacy Jira: Uwe Schindler (@uschindler) on Jul 12 2019]

mikemccand commented 5 years ago

Hi @thetaphi,

thanks for your comment, I basically agree to your idea.

In addition to extracting the NAME field (which should not fail the SPI lookup) also generate the "old name". Add both names to the map (they may differ, if a user decides to use his own name).

While we can strictly keep consistency of the (old) naming rule by this special treatment, it looks slightly strange to me that a factory temporarily have two names. When upgrading to 9.0.0, the client code that look up custom factories by auto-generated names can be broken. It does not look good to me. The old naming rule is not documented anywhere, as far as I know, so I think it is rather an implementation than interface. Can't we discard the "old name" when the NAME field is provided by the author? I know It is not a big problem at all, I just need a bit more clear interpretation for the "backwards compatibility" here and want to seek the modest way to implement it.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jul 12 2019]

mikemccand commented 5 years ago

I would not strictly enforce the old naming rules, because the idea behind the NAME field is that the developer of a TokenStream component can decide for a good name that unrelated to class name. E.g., this allows to restructure your classes.

The new design was from the beginning made in a way that it's perfectly possible to assign "alias" names for tokenstream components in the future. E.g., we could deprecate an old name and rename something and temporary keep both names for lookup. The NAME field only contains the new one.

This is basically the same: Let the developer decide for the name that fit's best for his need. For backwards compatibility just keep the old name. For Solr that's not an issue, as old schemas use class names anyways.

For Lucene-Internal components tis is not a problem, as we kept the old names - but I'd like to possibly have the option to rename some of those in the future!

[Legacy Jira: Uwe Schindler (@uschindler) on Jul 12 2019]

mikemccand commented 5 years ago

Hi,

I might miss something... but have a question: if we only keep the name provided by the author ("NAME" field) and discard the "old name", does this cause problems for Lucene components or break any internal behaviour? It would affect to client code?

I have no strong objection but still do not fully understand the reason why we should introduce some complexity into our code to support multiple names.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jul 12 2019]

mikemccand commented 5 years ago

This would not add any complexity, only at the place where the SPIIterator is consumed you have 2 "adds" to the Map. One for the name field and a second one for the "legacy" name if it's not already there.

Anywhere else no changes.

[Legacy Jira: Uwe Schindler (@uschindler) on Jul 12 2019]

mikemccand commented 5 years ago

I opened a pull request (to the 8x branch) for review.

https://github.com/apache/lucene-solr/pull/782

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jul 13 2019]

mikemccand commented 5 years ago

The branch is ready to merge, I think.

@thetaphi: Could you please review the changes in AnalysisSPILoader and its tests? All other parts are identical to the master branch.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jul 13 2019]

mikemccand commented 5 years ago

The changes (and added tests) seem straightforward to me. I will wait for a few days and merge it to 8x branch if there are no comments/objections.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jul 14 2019]

mikemccand commented 5 years ago

Hi @mocobeta,

sorry for the delay (was in "weekend mode"). The changes look fine, that's exactly how I thought about it. It's good that the original names is only modified if the legacy name is not yet in the map, so we don't get duplicate names with just different spelling. Just to confirm: The generated names are all lower case, right? So if there is a factory with no NAME field, it would produce a lowercase only entry in both (map and set).

+1

Should we get this into 8.2 or not?

[Legacy Jira: Uwe Schindler (@uschindler) on Jul 15 2019]

mikemccand commented 5 years ago

Thanks @thetaphi,

Just to confirm: The generated names are all lower case, right? So if there is a factory with no NAME field, it would produce a lowercase only entry in both (map and set).

Yes, I just copied the code which generates names from 8.1 branch as-is, so they are all lowercased.

name = clazzName.substring(0, clazzName.length() - suffix.length()).toLowerCase(Locale.ROOT);

Should we get this into 8.2 or not?

I didn't think of merging it to the 8.2 branch again since it is now "on feature freeze" (as announced in the dev-list) and I am okay with we will delay this until 8.3..., any thoughts?

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jul 15 2019]

mikemccand commented 5 years ago

Commit b5e8dc3af4227401233289fdf7433be9d7440ca1 in lucene-solr's branch refs/heads/branch_8x from Tomoko Uchida https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b5e8dc3

LUCENE-8911: Backport LUCENE-8778 (improved analysis SPI name handling) to 8.x (#782)

This also keeps old names for backwards compatibility on 8.x

[Legacy Jira: ASF subversion and git services on Jul 16 2019]

mikemccand commented 5 years ago

I will update lucene/MIGRATE.txt later on.

Seems it would take some time before the first RC will be created so we could merge this into 8_2 branch again, but I hesitate a bit to ask to the RM to do so. Let me know if it would be better getting into this 8.2 rather than waiting 8.3 \:)

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jul 16 2019]

mikemccand commented 5 years ago

I have my doubts for this change to make 8.2. Elasticsearch CI has reported a failure that I think it is related with. this change:

ant test  -Dtestcase=TestFactories -Dtests.method=test -Dtests.seed=FEA8D71DFC111060 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=zh-TW -Dtests.timezone=Europe/Guernsey -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1

[Legacy Jira: Ignacio Vera (@iverase) on Jul 16 2019]

mikemccand commented 5 years ago

Thank you, it seems that a mock tokenizer class which was added here does not generate proper token stream under certain conditions. I will fix the mock class. Anyway I agree with you that we should wait to see how the changes here behave.

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jul 16 2019]

mikemccand commented 5 years ago

Here is the patch for fixing tests: LUCENE-8911-fix-mock.patch

[Legacy Jira: Tomoko Uchida (@mocobeta) on Jul 21 2019]

mikemccand commented 5 years ago

Commit 5ff3ebec1249552a6f43e23763d48a492f8ed135 in lucene-solr's branch refs/heads/branch_8x from Tomoko Uchida https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5ff3ebe

LUCENE-8911: Use MockTokenizer for tests.

[Legacy Jira: ASF subversion and git services on Jul 21 2019]

mikemccand / stargazers-migration-test

Backport LUCENE-8778 (improved analysis SPI name handling) to 8.x [LUCENE-8911] #908

Legacy Jira details