mmore500 / hstrat-surface-concept

Algorithm development for lightweight fixed-length recency-proportional hereditary stratigraphy data structure
MIT License
0 stars 1 forks source link

SOSA Review 1 #116

Open mmore500 opened 1 week ago

mmore500 commented 1 week ago

---------------------- REVIEW 1 --------------------- SUBMISSION: 122 TITLE: Structured Downsampling for Fast, Memory-efficient Curation of Online Data Streams AUTHORS: Matthew Andres Moreno, Luis Zaman and Emily Dolson

----------- Overall evaluation ----------- Summary:

This submission tackles the problem of sampling in a streaming setting: Given a stream and limited memory storage, decided whether each element of the stream should be stored (potentially overriding previously stored elements). The goal is to evenly space the stored elements \emph{in time}, that is, the value of the elements do not matter, but rather their position in the stream: for a stream of length 500 with 4 elements to store, the aim is to store the 100th, 200th, 300th, and 400th element.

Authors propose three different algorithms to optimize for three different cost functions:

Their algorithm relies on the sequence A007814 of the OEIS, in which the n-th term a(n) is the 2-adic valuation of n (the number of zeroes at the end of its binary representation). Its first elements are : 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 5... They note that the gap between two zeroes is 2, between two ones is 4, between two twos is 8, and more generally between 2 numbers i is 2^{i+1}. Hence for their first cost function, they aim to store all elements n who satisfy a(n)>v(n) for some threshold function v. They show that this yields a 2-approximation.

For the second and third cost function, they instead store, for each integer k, the smallest (respectively largest) encountered integer n such that a(n) = k. They show an upper bound on the cost function.

They finally give an implementation of their algorithms.

Evaluation:

The simplicity of the algorithms is enjoyable and definitely falls under the scope of SOSA. It also shows that the authors put in a lot of effort into writing it and making the figures. The way the paper is written however, is far from the standard in theoretical computer science. While the paper falls short on many points that could be considered cosmetic, some other ones are more serious.

I strongly recommend authors to read a few more SOSA accepted papers from previous years. Each paper is roughly 10 pages long, and their emphasis is on clarity, making them an enjoyable read, so it shouldn't take a lot of time. Moreover, it will give authors an example of what's usually expected of a SOSA submission.

Comments: My main issue is that the problem definition is not clearly stated. Just after you define your model and definitions, in section 2.3, you should explain what your cost functions are, in a section/subsection called "problem definition". This should happen before you start introducing your solution with the Time Hanoi value. It must also be defined correctly mathematically, for example: "We aim to minimize \max_{\breve T \le T} G(\breve T)". Similarly for the other two cost functions.

The figures are overcomplicated. Aim to simplify them as much as possible.

For SOSA, the aim is to make things as simple and easy to read as possible. Don't mix definitions and results, like in section 2.2 you define the time and the talk about which algorithm needs a time limit or not. You talk about the "goal" of an algorithm, which is different from its objective function. "Ingestion time" when you mean what is regularly called "position". And overall I believe the theoretical contribution can be presented way more clearly and concisely, and definitely can fit into 10 pages, far from the 33 pages of the current submission.

Implementation is a plus, but definitely not a big thing in a theory conference like SOSA. A simple paragraph at the end of the submission is enough.

Why do you use so many colors for the symbols you use? If there's a particular reason, you can state it, it seems to me like a lot of effort for not much increased comprehensibility.

mmore500 commented 6 days ago