nats-io / nats-spark-connector

Apache License 2.0
8 stars 6 forks source link

Add the option to have batch pulls #10

Open AlejandroUPC opened 1 year ago

AlejandroUPC commented 1 year ago

What motivated this proposal?

Is there a way to micro batching the actual NATS streams? Let's say I just want to try and pull 20,000 messages, if there are so many or whatever is in the queue. Something like "Pull X or MAX(queue_length)".

It's understable that the connector its in early stages but it would a common use case, also having PySpark examples would help.

The idea is that even fi NATS is a stream technology it makes perfect sense to work in microbatches and standard Spark dataframes instead of StreamReaders/etc.

Something similar to what is implemted in this Python library here.

What is the proposed change?

We have a flavour? or method to pull a fix amount of messages into a Spark Dataframe, avoiding all the streaming APIs once the messages are pulled.

Who benefits from this change?

Anyone that rathers using examples that are done with Python.

What alternatives have you evaluated?

Some users that might have the specific use case of batching NATS messages and want to avoid the hurdle to be forced on working on streaming APIs.

stoddabr commented 2 days ago

This library is a thin layer over the Nats.java library. Might be better to use that if you don't need streaming: https://github.com/nats-io/nats.java/tree/main?tab=readme-ov-file#fetch

You can run it using scala fairly easily. Happy to help debug code if you run into issues