Open xingularity opened 2 weeks ago
This issue is to continue the discussion in https://github.com/solvcon/modmesh/discussions/380#discussioncomment-11154329 . @j8xixo12 , I am not entirely sure what does sequence of data mean. Could you please provide a definition to help specification?
This issue is to continue the discussion in #380 (comment) . @j8xixo12 , I am not entirely sure what does sequence of data mean. Could you please provide a definition to help specification?
Hi @yungyuc
I think I might not explain clear enough in the beginning of this issue. The data provided by this DataFrame should be ordered by index or timestamp, this is the correct sequence which user can expect. Current implementation does not guarantee this. We need to enhance it.
The data provided by this DataFrame should be ordered by index or timestamp, this is the correct sequence which user can expect. Current implementation does not guarantee this. We need to enhance it.
By "ordered by index or timestamp", does it mean that a sorting function should be provided?
The data provided by this DataFrame should be ordered by index or timestamp, this is the correct sequence which user can expect. Current implementation does not guarantee this. We need to enhance it.
By "ordered by index or timestamp", does it mean that a sorting function should be provided?
Yes or no. User does not need to be aware of the sorting function. They should always expect getting ordered data when retrieving. The sorting should be done before the data provided to user or when the data inserted into this DataFrame. This sorting should be completely transparent to user.
By "ordered by index or timestamp", does it mean that a sorting function should be provided?
Yes or no. User does not need to be aware of the sorting function. They should always expect getting ordered data when retrieving. The sorting should be done before the data provided to user or when the data inserted into this DataFrame. This sorting should be completely transparent to user.
I see. Then the goal is to provide the API and let application code to make a decision to call it or not. modmesh contains both engine and application code. What we are working on now is the engine part.
I updated the issue description based on the discussions so far.
@j8xixo12 could you please review the discussions so far and share your thoughts?
By "ordered by index or timestamp", does it mean that a sorting function should be provided?
Yes or no. User does not need to be aware of the sorting function. They should always expect getting ordered data when retrieving. The sorting should be done before the data provided to user or when the data inserted into this DataFrame. This sorting should be completely transparent to user.
I see. Then the goal is to provide the API and let application code to make a decision to call it or not. modmesh contains both engine and application code. What we are working on now is the engine part.
Time series data should have guaranteed ordering, but we cannot assume that users will provide ordered data, even if the data has timestamps or indices. So I agree with @xingularity’s perspective that sorting should be done before providing it to the user or when the data inserted into this DataFrame. Thus DataFrame should provide a sorting function.
However, SimpleArray
is a container, so I don’t think it should provide a sorting function. The sorting function should be a standalone function, with SimpleArray as an input argument.
Time series data should have guaranteed ordering, but we cannot assume that users will provide ordered data, even if the data has timestamps or indices. So I agree with @xingularity’s perspective that sorting should be done before providing it to the user or when the data inserted into this DataFrame. Thus DataFrame should provide a sorting function.
Please consider that time series is a special case of data frame. By providing the sorting function on arrays in data frame and reordering for data frame, the monotonicity can be realized. The data frame can then used as a time series.
However,
SimpleArray
is a container, so I don’t think it should provide a sorting function. The sorting function should be a standalone function, with SimpleArray as an input argument.
An array can certainly use a sorting function, like numpy.ndarray.sort()
. It's been there for decades.
It's OK to make free functions for sorting, but that incurs significant maintenance efforts. We should not do it right now.
An array can certainly use a sorting function, like
numpy.ndarray.sort()
. It's been there for decades.
Hi @yungyuc and @j8xixo12
The sorting function we need is different from it. In current prototype, each column including the index column is stored in different container. And the data column should be sorted according to the data in index column. The actual scenario could be like this. I propose that we provide two helper functions. One is the argsort, the other one is an interface to retrieve array data with a given index sequence.
An array can certainly use a sorting function, like
numpy.ndarray.sort()
. It's been there for decades.Hi @yungyuc and @j8xixo12
The sorting function we need is a little bit different it. In current prototype, each column including the index column is stored in different container. And the data column should be sorted according to the data in index column. The actual scenario could be like this. I propose that we provide two helper functions. One is the argsort, the other one is an interface to retrieve array data with a given index sequence.
Yes, we need SimpleArray.argsort()
(should provides only out-of-place mode) working like numpy.argsort()
. But the need for argsort()
does not remove the need for SimpleArray.sort()
. argsort()
may be prototyped like:
>>> data = [2, 3, 1]
>>> _tmp = list((v, i) for i, v in enumerate(data))
>>> print(_tmp)
[(2, 0), (3, 1), (1, 2)]
>>> _tmp.sort()
>>> print(_tmp)
[(1, 2), (2, 0), (3, 1)]
>>> argindices = list(i for v, i in _tmp)
>>> print(argindices)
[2, 0, 1]
There should be both SimpleArray.sort()
and SimpleArray.argsort()
in Python and both SimpleArray::sort()
and SimpleArray::argsort()
in C++. The Python functions are simply wrappers to the C++ workers. But the two C++ workers should share code.
At this moment I do not want to provide free-function interfaces to the sorting and reordering for maintenance reasons. Keeping them class member functions takes much less efforts of maintenance.
>>> data = [2, 3, 1] >>> _tmp = list((v, i) for i, v in enumerate(data)) >>> print(_tmp) [(2, 0), (3, 1), (1, 2)] >>> _tmp.sort() >>> print(_tmp) [(1, 2), (2, 0), (3, 1)] >>> argindices = list(i for v, i in _tmp) >>> print(argindices) [2, 0, 1]
There should be both
SimpleArray.sort()
andSimpleArray.argsort()
in Python and bothSimpleArray::sort()
andSimpleArray::argsort()
in C++. The Python functions are simply wrappers to the C++ workers. But the two C++ workers should share code.At this moment I do not want to provide free-function interfaces to the sorting and reordering for maintenance reasons. Keeping them class member functions takes much less efforts of maintenance.
I agree with the concept shown in the prototype code. SimpleArray.sort()
and SimpleArray.argsort()
do share common part, and we should not reinvent the wheel.
I agree with the concept shown in the prototype code.
SimpleArray.sort()
andSimpleArray.argsort()
do share common part, and we should not reinvent the wheel.
Thanks for updating the issue description on 9th Nov. I removed my update on 8th Nov from the description since it's outdated. We can use the latest description to develop.
(Updated on 9th Nov)
To provide ordered data from a data frame potentially storing large volume of data, efficient sorting capability needs to be built. It can be built by providing
sort()
andargsort()
helper functions onSimpleArray
, and areorder()
function is provided in the data frame class in Python. A proposed sequence of ordering data is here.The
SimpleArray.sort()
andSimpleArray.argsort()
should be provided in both Python and C++, and the Python functions are just a wrapper of C++ implementation. The sorting function should work like numpy.ndarray.sort but provide both in-place and out-of-place options.SimpleArray.argsort()
should be provided only out-of-place.The reordering helper should shuffle one, multiple selected, or all columns in a data frame.
SimpleArray.sort()
SimpleArray.argsort()
DataFrame.reorder()
(Initiated on 6th Nov)
Original statement
TimeSeriesDataFrame is to provide data in correct sequence order by index/timestamp. Current prototype implementation only reads text and organizes data in a columnar format, it does not guarantee the sequence of the data when retrieving the data. A sorting algorithm is required to guarantee the order of data.