Open kylebaron opened 1 month ago
Thanks for capturing this @kylebaron do you think we should move this to an internal Metworx ticket, or do you think it's worth looking into whether the way bbr
is submitting models is playing into this?
I think this part is relevant to the way that bbr is doing it; you can get some really skewed run times so I think this batching strategy will have problems sooner than later
Not terrible for this example that runs fast and is easy; this won't work when you get a much more complicated model and the variability in run time is large, with some runs taking very long to finish
internal:~$ qstat -f |grep amd; qstat |grep -c Run
all.q@ip-10-254-0-108.ec2.inte BIP 0/0/16 5.35 lx-amd64
all.q@ip-10-254-0-19.ec2.inter BIP 0/0/16 5.45 lx-amd64
all.q@ip-10-254-1-149.ec2.inte BIP 0/0/16 5.44 lx-amd64
all.q@ip-10-254-1-3.ec2.intern BIP 0/0/16 6.06 lx-amd64
all.q@ip-10-254-1-68.ec2.inter BIP 0/0/16 5.71 lx-amd64
all.q@ip-10-254-2-163.ec2.inte BIP 0/1/16 5.60 lx-amd64
all.q@ip-10-254-3-159.ec2.inte BIP 0/1/16 5.29 lx-amd64
2
I think this is submitting in batches of 100; sometimes we have 30 runs going, sometimes we have 100. But there's 192 worker cores available. I'm not sure how we got so many workers??
This set seems to have recruited lots of unneeded compute
30ish runs active
100 runs active
This set was appropriately scaled, but new jobs can get scheduled until every run of previous batch finishes
Not terrible for this example that runs fast and is easy; this won't work when you get a much more complicated model and the variability in run time is large, with some runs taking very long to finish
The run ended up with additional compute; I'm not sure why. This isn't an issue for
bbr
to solve, but wanted to document this was happening.