Open rlamy opened 4 weeks ago
Good point. We need to keep information about the order and use it in the following statements when it makes sense. We should use the underlying IDs for that.
Good point. We need to keep information about the order and use it in the following statements when it makes sense. We should use the underlying IDs for that.
It does not seem possible to generate a sequential list of IDs in BigQuery (more details here). I am currently stuck on how we can provide deterministic results from all chains in an efficient way.
With that said; Is there a clear use case for ordering an intermediate step in a chain? I can find superficial examples in tests such as this:
def name_len(path):
return (len(posixpath.basename(path)),)
DataChain.from_storage(path, session=session).order_by("file.path").map(
name_len, params=["file.path"], output={"name_len": int}
).save(ds_name)
but do we support cumulative values being calculated across rows, wouldn't this be done via group_by
? What is the use case that we are trying to support? Can we justify the complexity added to support this? Note: I can understand order_by
preceding limit
(this should be handled by the generated SQL anyway).
For instance, in SQL, the order of the results from a SELECT query is undefined unless there is an ORDER BY clause.
We could patch order_by
before select
or select_except
in the following way:
@detach
def select(self, *args, **kwargs) -> "Self":
named_args = [v.label(k) for k, v in kwargs.items()]
query = self.clone()
last_step = query.steps[-1] if query.steps else None
query.steps.append(SQLSelect((*args, *named_args)))
if isinstance(last_step, SQLOrderBy):
query.steps.append(SQLOrderBy(last_step.args))
return query
but I think it would be better to explicitly state that .select()
doesn't preserve ordering (as suggested by @rlamy) and then patch collect
directly.
It does not seem possible to generate a sequential list of IDs in BigQuery (more details https://github.com/iterative/studio/issues/10635#issuecomment-2381829809). I am currently stuck on how we can provide deterministic results from all chains in an efficient way.
I'd also like to understand why we are trying to stamp out the following SQL behaviour:
For instance, in SQL, the order of the results from a SELECT query is undefined unless there is an ORDER BY clause.
Why must the order of queries be deterministic in DataChain even when an order has not been specified?
Edit: There are remnants in the system of this ordering behaviour but it looks like it is completely untested (see https://github.com/iterative/datachain/pull/489).
It looks like there is some misunderstanding in the ordering.
First, let's clarify basic assumptions:
sys.id
is presented in any table and is used for UDFs.
ROW_NUMBER() OVER () AS row_num
order_by
- sys.id
represents the order.
ROW_NUMBER() OVER (ORDER BY aaaa DESC) AS row_num
chain.order_by("a").select("a").save("my_ds")
is not ordered since order_by is not the last statementsys.id
(like SELECT * FROM t ORDER BY sys__id LIMIT 20
) for showing datasets in UI to make the result look stable even for not ordered datasets.chain.order_by("val").save("ds_ordered")
then next time the order can be reused in some queries DataChain.from_dataset("ds_ordered").limit(3)
- returns top 3 with respect to the order.
Please LMK if you disagree with some of these or you'd like to add something.
Thanks for clarifying those basic assumptions. I think I have identified some discrepancies between the assumptions and the current implementation.
- Rows are not ordering if the order is not specified. However, unique sys.id is presented in any table and is used for UDFs.
- it's like row number without ordering - ROW_NUMBER() OVER () AS row_num
- If a last statement in a chain before saving/persist was order_by- sys.id represents the order.
- it's like row num with ordering - ROW_NUMBER() OVER (ORDER BY aaaa DESC) AS row_num
I don't think this is strictly true on the ClickHouse side because we have this definition for sys.id
:
Column("sys__id", UInt64, primary_key=True, server_default=func.rand64()),
which means that if it is not explicitly set we end up with a random value. We do use rowNumberInAllBlocks() + 1
(close to row_number() over()
) when copying a query into a table but this is not generally how sys.id
is created.
- We use order by sys.id (like SELECT * FROM t ORDER BY sys__id LIMIT 20) for showing datasets in UI to make the result look stable even for not ordered datasets.
The preview will always appear stable now because we persist the dataset's first 20 rows in the metastore against the dataset version and use that instead of loading the dataset each time. The following test passes for both SQLite and ClickHouse:
BUT it seems that it only passes by chance because to create the preview we do this:
if not dataset_version.preview:
values["preview"] = (
DatasetQuery(name=dataset.name, version=version, catalog=self)
.limit(20)
.to_db_records()
)
which roughly translates to SELECT * FROM table LIMIT 20
so this is not technically ordered either.
- (I'm not sure it's implemented yet) If a dataset was ordered chain.order_by("val").save("ds_ordered") then next time the order can be reused in some queries DataChain.from_dataset("ds_ordered").limit(3) - returns top 3 with respect to the order.
I have tested this and it seems to work but only by chance as well (no ordering provided in the SQL).
As you can see from https://github.com/iterative/datachain/pull/489 we have some logic in the code base where we order by sys.id
but this is not tested and seems broken in places after going through the above.
It makes me think, that the fact of ordering as well as the ordering criteria (columns, DESC/ASC) must be stored together with a dataset during save()
IMO, this is worth investigating but as stated elsewhere this seems like a difficult problem to solve, especially across databases and at scale (see https://github.com/iterative/studio/issues/10635#issuecomment-2403630921 for some context). Having all the obvious use-cases like "Another good use case - streaming dataset to ML training with a given order" would be a great start and agreeing that we do not guarantee the order of "unordered chains" would be very helpful.
From a long discussion in the planning meeting:
For now, the order of a dataset will only be guaranteed when an order_by
statement is present at the end of a chain. We may return to optimising this later.
Perfect, thanks @mattseddon . It should simplify things a lot. Thanks for pushing this forward.
Can we close this / repurpose this btw?
We need to specify clearly how
.order_by()
interacts with other methods.For instance, in SQL, the order of the results from a SELECT query is undefined unless there is an ORDER BY clause. Currently,
chain.order_by("a").select("a")
generates SQL that look likeSELECT a FROM (SELECT a FROM some_table ORDER BY a)
, which is therefore technically unordered, even though it'll usually return sorted results. While it might be fine to say that.select()
doesn't preserve ordering, it seems obvious thatchain.order_by("a").collect("a")
should return the "a" column in sorted order, andchain.collect("a")
is currently implemented aschain.select("a").collect()
...