Closed toddfarmer closed 6 years ago
Note: Comment by Wes McKinney (wesm): +1, my outside perspective looking at patches that touch the templates is that it seems a recipe for brittleness that makes refactoring difficult. It would make the codebase easier for newcomers, too
Note: Comment by Li Jin (icexelloss): +1, the complexity of those templates is bothering me as well.
Note: Comment by Bryan Cutler (bryanc):
+1 on fixing this up also, however there are already quite a lot of base classes and interfaces. [~jnadeau]
, following our other discussion from ARROW-1361, would it be possible to make FixedVector
and VariableVector
generic classes that take a ArrowType as type parameter and then specialize for types as needed? NullableValueVectors
could remain a template, possibly simplified, or use a base class as mentioned above. I'm not entirely sure how feasible this is though..
Note: Comment by Jacques Nadeau (jnadeau): It seems like there a milestone here that we should figure out.
I see [~siddteotia]
assigned this to himself. I'd like to also make sure we have others involved in these subtasks and would like to break them out here. Other opinions on subtasks we should have? [~bryanc]
and/or [~icexelloss]
either of you interested in working on pieces of this? Definitely think it is overdue.
Note: Comment by Jacques Nadeau (jnadeau): Some initial thoughts on requirements:
Note: Comment by Wes McKinney (wesm):
I think [~alphalfalfa]
may be interested in helping, since he has dug through the templates a couple of times for adding more logical types to the hierarchy. It'll be worthwhile to harden the object hierarchy before we accumulate a lot more downstream users and risk unpleasant disruption. Certainly this should take place before a 1.0.0 release. Feedback from other active users like [~elahrvivaz]
will be helpful
Note: Comment by Gonzalo Ortiz (gortizja): +1
I have found that the current hierarchy and heavily use of delegation makes implementations like IntVector very slow compared to ByteBuffer. I have some JMH benchmarks where I iterate on a buffer, reading the content as ints to accumulate them. When using ByteBuffer it takes 17us to execute an iteration. With IntVector.Accessor it gets 100 us!
Note: Comment by Emilio Lahr-Vivaz (elahrvivaz): +1 agree with everything said so far. I'd prefer not to have any code generation at all, unless it provides some performance gains.
Note: Comment by Jingyuan Wang (alphalfalfa): I would certainly like to help. We should probably consider to add a benchmark suite to measure both the performance of assembling and retrieval of value vectors.
Note: Comment by Jacques Nadeau (jnadeau): I've created an initial list of subtasks. Who wants to work on what?
thx
Note: Comment by Li Jin (icexelloss): [~jacques@dremio.com] Thank much for putting this together. I will be out next week so I will avoid being on any critical path. But I'd like to check status and see what I can help when I get back.
Note: Comment by Wes McKinney (wesm): This might be a good opportunity to remove non-nullable container types. Nullability in C++ is a property of the metadata only, and computationally non-null arrays and nullable arrays with 0 null count are semantically equal. It would make the code simpler to maintain
Note: Comment by Gonzalo Ortiz (gortizja):
On the one hand, what [~wesmckinn]
is right. They could be the same structure. On the other, they are not exactly semantically equal. Non nullable structures doesn't return nulls, which can be used by some static checkers or languages like Kotlin.
There are also two performance issues, one on each model. They are just a theoretical reasoning (that may be wrong!), we would need to verify that some JMH test (and it can change from one JVM to another):
TL;RD: From the library developer perspective, I would like to have only one implementation. From a user perspective, I would like to have both implementations, ideally without delegation (which means repeated code on nullable-structure)
Note: Comment by Wes McKinney (wesm):
[~gortizja]
this is probably a discussion for a whole separate JIRA. A couple points:
Non nullable structures doesn't return nulls, which can be used by some static checkers or languages like Kotlin.
This is true, hypothetically, but I question whether this benefit can only be reaped by doubling the size of the type hierarchy. In the instances where you can choose an optimized code path for non-null data, you could elect to invoke that same code path when the null count is zero (which you will want to do regardless of the nullability). This is sort of a "composition over inheritance" argument – it's not to say that we can have some helper interfaces that specialize based on metadata, but as a property of the type hierarchy I am not sure if it's worth the increased code size.
So this is a point on having two implementations: one with nulls and one without them.
Since you can branch at the vector level, the performance implications are less obvious (benchmarks could help settle it). Somewhere you must dynamic dispatch to the correct kernel function for some analytical operation, and the dispatch could be based on null count.
[if] the nullable is just a decoration/delegation on the non-nullable, nullable structures may behave quite worst than we expect for the same reason
The performance issues you cite would appear mostly in non-vectorized code. If you're doing something performance sensitive, you're more likely to be working directly with the buffers, and selecting kernel functions based on the dynamic metadata (which can include nullability / null count). So if you implement Sum, you will want to have a sum with nulls and a sum without. Somewhere you have to decide which one to call. You will want to elect to call the SumWithoutNulls when you have a nullable vector but with zero null count anyway, so this actually makes your kernel dispatch more complex.
Note: Comment by Gonzalo Ortiz (gortizja): From a Java developer perspective, I can only implement a SumWithIntVector and prey to the JIT lord, hoping that the JVM discovers it can use vectorization... and JVM 8 it is not very good at it. But you are right. This sounds a little bit offtopic.
Note: Comment by Jacques Nadeau (jnadeau): There have been cases where we have found the existence of non-nullable vectors useful when doing decomposed calculations. For example, it makes things easier to manage if we want to do a tree assuming no nulls and then use a word intersection pattern to implement SQL's null output if any input is null semantic. This could be done without the separate classes but is probably easier with them. On the flipside, if we push to a fractured subclasses model for vectors, I think it is necessary to avoid composite vectors since that would make address management and allocation fairly challenging.
Note: Comment by Bryan Cutler (bryanc): I just have a couple of questions regarding these changes, apologies if these have been answered previously
1) Are there any guidelines as far as doing public API breaking changes? Should things be deprecated or is this something that is a little more open until Arrow 1.0 is released?
2) How exactly does performance play into the class hierarchies or code gen? Is it to make sure the data can be vectorized or are there more things to consider? I just would like to make sure I know what not to do from a performance standpoint. Is this something that would be described in the document in ARROW-1471?
Note: Comment by Wes McKinney (wesm): API stability is not guaranteed at the moment. We have been trying to deprecate things gracefully in C++. Since development efforts in Java have been fairly limited since the Arrow project started I think we need to gather a critical mass of active Java developers before declaring API stability.
Note: Comment by Bryan Cutler (bryanc):
That sounds fine to me in regards to API stability, thanks [~wesmckinn]
.
Note: Comment by Laurent Goujon (laurent): There's a Java effort to come up with a vector API to help with vectorization (as part of the Panama project). May be worth having a look at it.
The latest version (by John Rose) is available at http://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java There are also some slides from Intel folks giving some insights about it might help vectorization inside the JVM: www.oracle.com/technetwork/java/jvmls2016-graves-3125549.pptx
Note: Comment by Gonzalo Ortiz (gortizja):
[~laurentgo]
is great to have that in mind. I'm really interested in how to be sure that operations on Arrow Vectors are, indeed, vectorized at CPU level and once Panama is released, we will be able to control that. But AFAIK they don't have a clear API (yet) that Arrow can try to get close to. They just have a proposal that it is very likely to change.
Note: Comment by Wes McKinney (wesm): Since there are more child JIRAs here I'm moving this to 0.9.0 to close out remaining items in the next release
Note: Comment by Wes McKinney (wesm): Where does this work stand?
Note: Comment by Siddharth Teotia (siddteotia): This work and all the follow-up refactoring is complete. I am marking it as resolved. Setting the fix-version as 0.9.0 although some of the tasks were completed in 0.8.0.
Note: This issue was originally created as ARROW-1463. Please see the migration documentation for further details.
Original Issue Description:
The templates used in the java package are very high mainteance and the if conditions are hard to track. As started in the discussion here: https://github.com/apache/arrow/pull/1012, I'd like to propose that we modify the structure of the internal value vectors and code generation dynamics.
Create new abstract base vectors:
BaseFixedVector BaseVariableVector BaseNullableVector
For each of these, implement all the basic functionality of a vector without using templating.
Evaluate whether to use code generation to generate specific specializations of this functionality for each type where needed for performance purposes (probably constrained to mutator and accessor set/get methods). Giant and complex if conditions in the templates are actually worse from my perspective than a small amount of hand written duplicated code since templates are much harder to work with.