kylebaron commented 1 month ago

I think this is submitting in batches of 100; sometimes we have 30 runs going, sometimes we have 100. But there's 192 worker cores available. I'm not sure how we got so many workers??

This set seems to have recruited lots of unneeded compute

30ish runs active

internal:~/project.mrg/xy/current/models/pk/106-boot$ qstat -f |grep amd; qstat |grep -c Run
all.q@ip-10-254-0-101.ec2.inte BIP   0/0/16         8.05     lx-amd64
all.q@ip-10-254-0-12.ec2.inter BIP   0/4/16         8.19     lx-amd64
all.q@ip-10-254-0-204.ec2.inte BIP   0/3/16         8.05     lx-amd64
all.q@ip-10-254-0-234.ec2.inte BIP   0/3/16         7.84     lx-amd64
all.q@ip-10-254-0-39.ec2.inter BIP   0/2/16         8.14     lx-amd64
all.q@ip-10-254-2-165.ec2.inte BIP   0/3/16         8.07     lx-amd64
all.q@ip-10-254-2-83.ec2.inter BIP   0/1/16         7.68     lx-amd64
all.q@ip-10-254-3-145.ec2.inte BIP   0/3/16         7.61     lx-amd64
all.q@ip-10-254-3-148.ec2.inte BIP   0/4/16         8.02     lx-amd64
all.q@ip-10-254-3-156.ec2.inte BIP   0/4/16         7.82     lx-amd64
all.q@ip-10-254-3-38.ec2.inter BIP   0/2/16         8.00     lx-amd64
all.q@ip-10-254-3-95.ec2.inter BIP   0/5/16         7.78     lx-amd64
34 # <--- number of total runs going

100 runs active

$ qstat -f |grep amd; qstat |grep -c Run
all.q@ip-10-254-0-101.ec2.inte BIP   0/8/16         7.59     lx-amd64
all.q@ip-10-254-0-12.ec2.inter BIP   0/4/16         7.67     lx-amd64
all.q@ip-10-254-0-204.ec2.inte BIP   0/5/16         7.50     lx-amd64
all.q@ip-10-254-0-234.ec2.inte BIP   0/8/16         7.56     lx-amd64
all.q@ip-10-254-0-39.ec2.inter BIP   0/7/16         7.32     lx-amd64
all.q@ip-10-254-2-165.ec2.inte BIP   0/5/16         7.59     lx-amd64
all.q@ip-10-254-2-83.ec2.inter BIP   0/7/16         7.28     lx-amd64
all.q@ip-10-254-3-145.ec2.inte BIP   0/6/16         7.44     lx-amd64
all.q@ip-10-254-3-148.ec2.inte BIP   0/5/16         7.41     lx-amd64
all.q@ip-10-254-3-156.ec2.inte BIP   0/5/16         7.43     lx-amd64
all.q@ip-10-254-3-38.ec2.inter BIP   0/8/16         7.98     lx-amd64
all.q@ip-10-254-3-95.ec2.inter BIP   0/4/16         7.56     lx-amd64
100 # <--- number of total runs going

This set was appropriately scaled, but new jobs can get scheduled until every run of previous batch finishes

Not terrible for this example that runs fast and is easy; this won't work when you get a much more complicated model and the variability in run time is large, with some runs taking very long to finish

internal:~$ qstat -f |grep amd; qstat |grep -c Run
all.q@ip-10-254-0-108.ec2.inte BIP   0/0/16         5.35     lx-amd64
all.q@ip-10-254-0-19.ec2.inter BIP   0/0/16         5.45     lx-amd64
all.q@ip-10-254-1-149.ec2.inte BIP   0/0/16         5.44     lx-amd64
all.q@ip-10-254-1-3.ec2.intern BIP   0/0/16         6.06     lx-amd64
all.q@ip-10-254-1-68.ec2.inter BIP   0/0/16         5.71     lx-amd64
all.q@ip-10-254-2-163.ec2.inte BIP   0/1/16         5.60     lx-amd64
all.q@ip-10-254-3-159.ec2.inte BIP   0/1/16         5.29     lx-amd64
2

The run ended up with additional compute; I'm not sure why. This isn't an issue for bbr to solve, but wanted to document this was happening.

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@ip-10-254-0-10.ec2.inter BIP   0/0/16         6.46     lx-amd64
---------------------------------------------------------------------------------
all.q@ip-10-254-0-108.ec2.inte BIP   0/0/16         6.79     lx-amd64
---------------------------------------------------------------------------------
all.q@ip-10-254-0-19.ec2.inter BIP   0/0/16         6.62     lx-amd64
---------------------------------------------------------------------------------
all.q@ip-10-254-0-37.ec2.inter BIP   0/0/16         5.24     lx-amd64
---------------------------------------------------------------------------------
all.q@ip-10-254-1-149.ec2.inte BIP   0/0/16         6.46     lx-amd64
---------------------------------------------------------------------------------
all.q@ip-10-254-1-3.ec2.intern BIP   0/0/16         6.57     lx-amd64
---------------------------------------------------------------------------------
all.q@ip-10-254-1-68.ec2.inter BIP   0/0/16         6.86     lx-amd64
---------------------------------------------------------------------------------
all.q@ip-10-254-2-163.ec2.inte BIP   0/0/16         6.58     lx-amd64
---------------------------------------------------------------------------------
all.q@ip-10-254-2-65.ec2.inter BIP   0/0/16         5.94     lx-amd64
---------------------------------------------------------------------------------
all.q@ip-10-254-3-159.ec2.inte BIP   0/0/16         6.65     lx-amd64
---------------------------------------------------------------------------------
all.q@ip-10-254-3-226.ec2.inte BIP   0/0/16         6.00     lx-amd64

seth127 commented 1 month ago

Thanks for capturing this @kylebaron do you think we should move this to an internal Metworx ticket, or do you think it's worth looking into whether the way bbr is submitting models is playing into this?

kylebaron commented 1 month ago

I think this part is relevant to the way that bbr is doing it; you can get some really skewed run times so I think this batching strategy will have problems sooner than later

This set was appropriately scaled, but new jobs can get scheduled until every run of previous batch finishes

Not terrible for this example that runs fast and is easy; this won't work when you get a much more complicated model and the variability in run time is large, with some runs taking very long to finish

internal:~$ qstat -f |grep amd; qstat |grep -c Run
all.q@ip-10-254-0-108.ec2.inte BIP   0/0/16         5.35     lx-amd64
all.q@ip-10-254-0-19.ec2.inter BIP   0/0/16         5.45     lx-amd64
all.q@ip-10-254-1-149.ec2.inte BIP   0/0/16         5.44     lx-amd64
all.q@ip-10-254-1-3.ec2.intern BIP   0/0/16         6.06     lx-amd64
all.q@ip-10-254-1-68.ec2.inter BIP   0/0/16         5.71     lx-amd64
all.q@ip-10-254-2-163.ec2.inte BIP   0/1/16         5.60     lx-amd64
all.q@ip-10-254-3-159.ec2.inte BIP   0/1/16         5.29     lx-amd64
2

metrumresearchgroup / bbr

Submitting bootstrap run in batches can leave lots of unused compute #695

This set seems to have recruited lots of unneeded compute

30ish runs active

100 runs active

This set was appropriately scaled, but new jobs can get scheduled until every run of previous batch finishes

This set was appropriately scaled, but new jobs can get scheduled until every run of previous batch finishes