molotch commented 2 years ago

Being able to use Azurite as input/output for blob/table streams would be very nice.

Fleid commented 2 years ago

This is one I've also been thinking about. To add to our work item, would you mind sharing what scenario this would enable for you, or problem this would solve?

molotch commented 2 years ago

My scenario is to try out Table Storage partitioning both on the input and output side. I use WSL/docker for everything else so it would be a great fit. Also WSL2 still does not work with VPN:s which kind of limits the use of Azure. And of course access to Azure is itself a limiting factor.

I can also see the use in CI/CD pipelines to run tests.

Fleid commented 2 years ago

If I remember correctly the local testing mode doesn't generate a partitioned topology. It behaves as if it was running on a single node - which is effectively what is happening - that will process all the data regardless of partitioning.

It's a good tool for development and unit testing, but it's not really appropriate for performance or integration testing.

Also table storage is also not a supported input at the moment. From storage account we only support blob storage both for streaming and reference data.

As for CI/CD, I'm not sure I see the gap in our current capabilities? You can have a first stage that plays the unit tests on files via the npm package, then builds. A second stage that deploys to a live testing environment and do the integration testing, and a third stage that deploys to production.

molotch commented 2 years ago

Haven't looked closer at the npm package yet so it might cover CI/CD like you say.

Unless there's partition based functions (like Sparks mapPartition) to run I guess single node input partitioning isn't that important. But output partitioning would still determine how the data is structured on the output. So I guess that's still in play even on a single node?

If I run a query lacking any aggregations (making node colocation of data unnecessary from a logical standpoint), i.e. just parsing events and storing them using a multi node cluster where only the output is partitioned (not the input). Would that trigger a reshuffle and all writes will be done by one node or would each node output it's own writes to each of the partitions when the trigger criteria is met?

Fleid commented 2 years ago

Unless there's partition based functions (like Sparks mapPartition) to run I guess single node input partitioning isn't that important. But output partitioning would still determine how the data is structured on the output. So I guess that's still in play even on a single node?

Yes, but the query runner we ship in the VSCode Extension and the npm package doesn't come with the complete output adapter stack. It can only output to a single JSON file. It is designed for local development and unit testing after all, and that way it's more lightweight.

So you are right for a single node running in the service (1, 3 and 6 SU), but it doesn't apply "offline" (VSCode, npm).

If I run a query lacking any aggregations (making node colocation of data unnecessary from a logical standpoint), i.e. just parsing events and storing them using a multi node cluster where only the output is partitioned (not the input). Would that trigger a reshuffle and all writes will be done by one node or would each node output it's own writes to each of the partitions when the trigger criteria is met?

There is not a single answer to that. Our approach depends on the type of output (EH, Cosmos DB, storage...), the query, the current SU, if we're batching events out or not. etc.

Note that if the input is not partitioned at all, then you won't be able to scale beyond one node (6SU). So this discussion applies to an input that is partitioned but not aligned to the output partitioning scheme, or completely aligned but you don't scale to one node per partition.

Fleid commented 2 years ago

microsoft / vscode-asa

Add support for local debugging using Azurite as input/output #59

please-close