ngageoint / scale

Processing framework for containerized algorithms
http://ngageoint.github.io/scale/
Apache License 2.0
105 stars 45 forks source link

Node count > 1000 causes failure to schedule #1825

Open gisjedi opened 4 years ago

gisjedi commented 4 years ago

Description When a Scale instance is running with a pre-existing database that includes over 1000 nodes, nothing will schedule if all ready instances are above the initial 1000 records returned. The problem appears to be related to the fact that Scale only requests the first 1000 records from the node table and so it is unable to match the offers to the nodes tracked in memory.

One potential solution is to just update the maximum records returned, the best solution would be to page over all the active nodes.

Reproduction Steps Steps to reproduce the problem:

  1. Create at least 1000 nodes with IPs that aren't present in cluster in the nodes table
  2. Launch Scheduler and see offers incoming from new nodes
  3. Queue a couple test jobs and observe that they are never scheduled.
cshamis commented 4 years ago

This is insane.

  1. Yes. We only care about active nodes.
  2. Inactive nodes are of historical importance only.
  3. We will never have 1000 active nodes.
  4. Ergo, we don't support more than 1000 active nodes.

On Tue, Nov 19, 2019 at 11:44 AM Jonathan Meyer notifications@github.com wrote:

Description When a Scale instance is running with a pre-existing database that includes over 1000 nodes, nothing will schedule if all ready instances are above the initial 1000 records returned. The problem appears to be related to the fact that Scale only requests the first 1000 records from the node table and so it is unable to match the offers to the nodes tracked in memory.

One potential solution is to just update the maximum records returned, the best solution would be to page over all the active nodes.

Reproduction Steps Steps to reproduce the problem:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Version and Environment Details

  • Client OS: [e.g. Windows 10 build 1601]
  • Browser: [e.g. chrome 76.x, etc.]
  • Scale API: [e.g. v6.9.0-snapshot+1d15ca7e]
    • Can be found at /api/v6/version/
  • Scale UI: [e.g. v0.4.3]
    • Can be found at the bottom of the UI interface. Cluster details below are very helpful for problem resolution, but external Scale users aren't expected to provide.
  • DC/OS: [e.g. 1.11.2 Enterprise, 1.12.0 Community, etc.]
  • Marathonlb: [e.g. 1.4.2]
  • Agent OS: [e.g. CentOS 7.4]
  • Agent Docker: [e.g. 1.17.05]
  • Agent Infrastructure: [e.g. AWS, GovCloud, On-premise]

Additional context Add any other context about the problem here.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ngageoint/scale/issues/1825?email_source=notifications&email_token=ADZJETFN62IGDUOKHUY4JZLQUQJU7A5CNFSM4JPGFLA2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H2M3AEA, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZJETCMKE6KP625W4OAU3LQUQJU7ANCNFSM4JPGFLAQ .