mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

Support numeric value in Field class [LUCENE-8870] #867

Open mikemccand opened 5 years ago

mikemccand commented 5 years ago

I checked the following comment in Field class.

// TODO: allow direct construction of int, long, float, double value too..?

We already have some fields like IntPoint and StoredField, but I think it's okay.

The test cases are set in the TestField class.


Legacy Jira details

LUCENE-8870 by Namgyu Kim (@danmuzi) on Jun 19 2019, updated Jun 26 2019 Attachments: LUCENE-8870.patch

mikemccand commented 5 years ago

About the TestField class, I think its class structure needs to be changed slightly.

It is no direct connection with this issue. But I plan to modify it like TestIntPoint, TestDoublePoint, ... Should not the test class name depend on the class name?

What do you think about it?

[Legacy Jira: Namgyu Kim (@danmuzi) on Jun 19 2019]

mikemccand commented 5 years ago

Personally I find the Field type-facade kind of annoying; it imposes this artificial type safety: in the end we store the value as an Object and then later cast it. Callers that also handle values generically, as Objects then need an adapter to detect  the type of a value, cast it properly, only to have Lucene throw away all the type info and do that dance all over again internally!

Having said that, it's not really relevant to this change, which seems helpful. Maybe use Objects.requireNonNull for the null checks?

[Legacy Jira: Michael Sokolov (@msokolov) on Jun 22 2019]

mikemccand commented 5 years ago

Thank you for your reply! @msokolov :D

I want to make sure that I understand your opinion well.

in the end we store the value as an Object and then later cast it. Callers that also handle values generically, as Objects then need an adapter to detect the type of a value, cast it properly, only to have Lucene throw away all the type info and do that dance all over again internally!

I think this is a structure for providing various constructors to users. (Reader, CharSequence, BytesRef, ...) Of course we can only provide it in Object form and handle it in the constructor. But isn't it unfriendly to API users? And I'm not sure about it because the IndexableFieldType check logic is different depending on the value type. ex) BytesRef -> there is no check logic. CharSequence -> (!IndexableFieldType#stored() && IndexableFieldType#indexOptions() == IndexOptions.NONE) should be false. Reader -> (IndexableFieldType#indexOptions() == IndexOptions.NONE || !IndexableFieldType#tokenized()) and (IndexableFieldType#stored()) should be false.

In fact, I worried about using the Number class when writing this patch. I think API users may prefer int, float, double, ... rather than the Number class. What do you think about this?

Maybe use Objects.requireNonNull for the null checks?

I made it by referring to the current code structure. That method generates the NullPointerException, and current structure is the IllegalArgumentException.

[Legacy Jira: Namgyu Kim (@danmuzi) on Jun 23 2019]

mikemccand commented 5 years ago

I think this new constructor would be misleading. For instance it might be tempting to use if you wanted to index doubles. But the only thing you can index with this constructor is the string representation of the double values, which is unlikely to be helpful.

I wonder whether we should make this class abstract instead so that it can't be instantiated directly, and potentially enhance some of its sub classes to address use-cases that were only doable with this Field class until now, such as having a text field with term vectors enabled.

[Legacy Jira: Adrien Grand (@jpountz) on Jun 24 2019]

mikemccand commented 5 years ago

Thank you for your reply! @jpountz :D

When I wrote a patch, the biggest advantage that I think is the FieldType conversion for Numeric types. Of course, it is not recommended way(it is already mentioned Expert in Javadoc), but it can give users FieldType customization.

ex) Currently NumericDocValuesField does not support the option for stored. So users need to add a separate StoredField. If we provide this patch, the user can get the characteristics of NumericDocValuesField and StoredField in a single field.

FieldType type = new FieldType();
type.setStored(true);
type.setDocValuesType(DocValuesType.NUMERIC);
type.freeze();

Document doc = new Document();
Field field = new Field("number", 1234, type);
doc.add(field);
indexWriter.addDocument(doc);

After that, we can use methods such as

Sort sort = new Sort();
sort.setSort(new SortField("number", SortField.Type.INT));

and

doc.get("number");

in the "number" field.

[Legacy Jira: Namgyu Kim (@danmuzi) on Jun 25 2019]

mikemccand commented 5 years ago

Currently NumericDocValuesField does not support the option for stored. So users need to add a separate StoredField.

I'm viewing this one as a feature. :) Having the same field indexed and stored often makes sense because the analyzer takes care of converting the text into the internal representation that the index cares about. But for doc values, we expect users to do this work of converting the data to what works for Lucene. For instance if you are indexing doubles, you would likely index as a byte[] in points, use the long bits of the double in doc values, and use the double directly for storing, these are 3 different representations for the same data. My gut feeling is that trying to fold everything into a single field would make things more complicated rather than simpler.

[Legacy Jira: Adrien Grand (@jpountz) on Jun 26 2019]

mikemccand commented 5 years ago

Thank you for your reply! @jpountz :D

My gut feeling is that trying to fold everything into a single field would make things more complicated rather than simpler.

Yeah, I agree with some parts. I thought it would be nice to give more options to user. But if you think it can cause some confusions to users, I think remaining the current status is better.

About the TestField class, I think its class structure needs to be changed slightly.

It is no direct connection with this issue. But I plan to modify it like TestIntPoint, TestDoublePoint, ... Should not the test class name depend on the class name?

By the way, what do you think about my first comment in this issue? It looks a little bit ambiguous, but I'm curious about your opinion.

[Legacy Jira: Namgyu Kim (@danmuzi) on Jun 26 2019]