kamu-data / kamu-cli

Next-generation decentralized data lakehouse and a multi-party stream processing network
https://kamu.dev
Other
300 stars 13 forks source link

Ineffeciencies when listing dataset flows via GraphQL query #856

Open zaychenko-sergei opened 1 week ago

zaychenko-sergei commented 1 week ago

A casual dataset flows view that lists about 10 flows runs for ~1.36s, and performs highly ineffecient repository access operations. (see Grafana trace)

There are over 7000 spans, including numerous access to get_active_polling_source for the very same dataset (the only one). Internally this is causing a lot of metadata chain iteration activity, reading multiple S3 files, then re-using the cached version.

Possible solutions:

In addition, the same trace in Grafana uncovered need in #850