tdunning / t-digest

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means
Apache License 2.0
1.98k stars 226 forks source link

Point release beyond 3.2 #163

Closed tdunning closed 2 years ago

tdunning commented 3 years ago

Chris said this:

I looked through the issues and have them categorized and such. Here are my opinions- In short I think a point release is worth it now to get the fixes for mergetree. I think it is worth a point release as the fixes to the merge algorithm are substantial in both stability and correctness. AVLTree needs time and effort to stabilize - it does have some issues. For serialization, my recommendation is to do as little effort as necessary to come up with a decently forward compatible versioned POD scheme for mergetree alone. Java serialization is often a disaster and a major security risk and once things are in simple datatypes users can serialize however they like including using Java serialization. This should cut the effort required as it sidesteps the issue of how to actually save the data and avoids unnecessary dependencies. I don't agree Avro is a good choice, parquet is a good option for longterm storage due to its default good compression and wide language support and Arrow is a good option and very well supported across languages now. There are lots of other options not considered such as dendrite; virtually an endless number of them I think. So starting with a carefully built POD (or POJO as Java people call them) I think is reasonable and if people want protobufs or flatbuffers they can auto-generate them from the POD description. For reference, the Arrow project uses flatbuffers as its base serialization format and builds their schema and record pathways off of a basic flatbuffer message type. There appears to be a high-quality header-only c++ implementation that @derrickburns was working on that perhaps would be a good idea. A minimal clean C-library based on that implementation would then get you Python, R, Julia, Go, etc. with no further bindings. So that cuts your core tested language matrix to Java and C++/C. I messaged you on linked in and we can continue offline from there or we can continue the discussion here; whichever way works for you.

@cnuernber Let's continue the discussion about the point release here.

I think that your points are really excellent all around. If you don't incorporate them into the serialization document, I will go ahead and do so.

The key here, then, is just the point release. My feeling has been that the changes so far are important and don't make anything worse. That is basically an echo of what you came up with.

I have been wondering for some time whether the hand rolled release through mvn central directly is the best path. It certainly is doable and there are big benefits to being boring and stable, but I hate being the only one in the game.

Do you have any thoughts about modern patterns of releasing jars? I haven't kept up with fashions in quite a while.

cnuernber commented 3 years ago

Thanks, and it is great to see this moving forward!

Hand-rolled to maven I personally think is totally acceptable assuming it is cryptographically signed and compiled with Java-8 or older. I am not certain the normal Java or Scala-accepted pathways here really so I will have to defer to someone more involved in those communities.

tdunning commented 3 years ago

Yes. It would definitely be signed and compiled with Java 8.

Let's make and popularize a punch list of issues blocking the release if any, then. I don't know of any.

On Thu, Mar 25, 2021 at 9:56 AM Chris Nuernberger @.***> wrote:

Thanks, and it is great to see this moving forward!

Hand-rolled to maven I personally think is totally acceptable assuming it is cryptographically signed and compiled with Java-8 or older. I am not certain the normal Java or Scala-accepted pathways here really so I will have to defer to someone more involved in those communities.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tdunning/t-digest/issues/163#issuecomment-807093125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6QYIGTXUINR3BOKNATTFNTLXANCNFSM4ZZXHYIQ .

cnuernber commented 3 years ago

I think it depends on if we intend to further address the AVL tree pathway which I don't recommend at this point. If there is no intention to do so then there are no blocking issues that I saw and there is clear benefit to a point release as it addresses at least a few issues while not introducing any new known ones.

My point list is empty.

tdunning commented 3 years ago

Other candidates that don't show up include:

https://arxiv.org/abs/2102.09299

On Thu, Mar 25, 2021 at 11:23 AM Chris Nuernberger @.***> wrote:

I think it depends on if we intend to further address the AVL tree pathway which I don't recommend at this point. If there is no intention to do so then there are no blocking issues that I saw and there is clear benefit to a point release as it addresses at least a few issues while not introducing any new known ones.

My point list is empty.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tdunning/t-digest/issues/163#issuecomment-807231699, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB5E6QIQDSHY7BVJHBI3V3TFN5R7ANCNFSM4ZZXHYIQ .

tdunning commented 3 years ago

So.

The situation is that Travis runs are clean.

I added more quality tests and resolved a few lingering AVLTree issues.

Nobody on twitter spoke up about blockers.

A we ready to pull the trigger?

cnuernber commented 3 years ago

My only key issue (array index out of bounds) was resolved.