numaproj / numaflow

Kubernetes-native platform to run massively parallel data/streaming jobs
https://numaflow.numaproj.io/
Apache License 2.0
1.13k stars 113 forks source link

ISB fails to recover consumer for vertex stream #2025

Open anotherfiz opened 2 months ago

anotherfiz commented 2 months ago

Summary ISB does not recover numaflow pipeline consumers if system comes down in an unclean state (eg, power loss) When ISB attempts to recovery consumer for stream I get an error that says: Error unmarshalling consumer metafile /meta.inf: unexpected end of JSON input

To Reproduce

  1. apply numaflow pipeline
  2. while numaflow pipeline is being configured, kill system uncleanly
  3. occasioanlly ISB will start with the above error

Expected behavior I would expect that when the ISB is unable to recover a consumer, that it kick the associated vertexes and restart the pipeline cleanly

Screenshots image (1)

Environment (please complete the following information):

Additional context

Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

For quick help and support, join our slack channel.

syayi commented 2 months ago

@anotherfiz thanks for raising this. We'll take a look and get back.

Interesting to see you're deploying Numaflow on a Edge device. If you're open to sharing, we'd love to hear your feedback

anotherfiz commented 2 months ago

@anotherfiz thanks for raising this. We'll take a look and get back.

Interesting to see you're deploying Numaflow on a Edge device. If you're open to sharing, we'd love to hear your feedback

Sure - we use numaflow to simplify the interfaces between different signal processing pods (acquisition, demod, decode, etc). In the past we have used RMQ. Besides this issue, it has been fantastic - lightweight , fast, and easy to setup.

whynowy commented 2 months ago

@anotherfiz - do you mind sharing your ISB Service spec?

anotherfiz commented 2 months ago

@anotherfiz - do you mind sharing your ISB Service spec?

  jetstream:
    settings: |
      max_payload: 33554432 # 8MB
    bufferConfig: |
      stream:
        maxAge: 21600s
    imagePullSecrets:
      - name: private-registry
    replicas: 1
    version: 2.9.21
whynowy commented 2 months ago

@anotherfiz - do you mind sharing your ISB Service spec?

  jetstream:
    settings: |
      max_payload: 33554432 # 8MB
    bufferConfig: |
      stream:
        maxAge: 21600s
    imagePullSecrets:
      - name: private-registry
    replicas: 1
    version: 2.9.21
  1. Do you have persistence config?
  2. Could you use 2.10.x version (the latest is 2.10.20) to see how it works? Nats JetStream had lots of bugs fixed.
anotherfiz commented 2 months ago

@anotherfiz - do you mind sharing your ISB Service spec?

  jetstream:
    settings: |
      max_payload: 33554432 # 8MB
    bufferConfig: |
      stream:
        maxAge: 21600s
    imagePullSecrets:
      - name: private-registry
    replicas: 1
    version: 2.9.21
  1. Do you have persistence config?
  2. Could you use 2.10.x version (the latest is 2.10.20) to see how it works? Nats JetStream had lots of bugs fixed.

I do not have a persistence config.

upgrading jetstream is a possibility down the road - but is not a viable solution in the moment.

whynowy commented 2 months ago

@anotherfiz - do you mind sharing your ISB Service spec?

  jetstream:
    settings: |
      max_payload: 33554432 # 8MB
    bufferConfig: |
      stream:
        maxAge: 21600s
    imagePullSecrets:
      - name: private-registry
    replicas: 1
    version: 2.9.21
  1. Do you have persistence config?
  2. Could you use 2.10.x version (the latest is 2.10.20) to see how it works? Nats JetStream had lots of bugs fixed.

I do not have a persistence config.

upgrading jetstream is a possibility down the road - but is not a viable solution in the moment.

You should have a persistence config at least, it's available even it's running in k3s, otherwise it's purely in memory.

syayi commented 2 months ago

@anotherfiz thanks for raising this. We'll take a look and get back. Interesting to see you're deploying Numaflow on a Edge device. If you're open to sharing, we'd love to hear your feedback

Sure - we use numaflow to simplify the interfaces between different signal processing pods (acquisition, demod, decode, etc). In the past we have used RMQ. Besides this issue, it has been fantastic - lightweight , fast, and easy to setup.

That's great to hear. Numaflow's vision was to keep it lightweight, simplify event processing and be closer to where the developers are. If you don't mind, can you add to the Users list if you haven't already?

whynowy commented 2 months ago

@anotherfiz - let me know if the persistence helps.

anotherfiz commented 2 months ago

@anotherfiz - let me know if the persistence helps.

Sorry for delay. We have added the persistence config, but it may be some time before we get concrete results as this is an edge condition.