phetsims / aqua

Automatic QUality Assurance
MIT License
2 stars 4 forks source link

Add dedicated CT server #151

Closed kathy-phet closed 1 year ago

zepumph commented 2 years ago

The goal for this is to have something ordered by July.

It isn't clear to me what the specs are on this. Let's discuss this further!

mattpen commented 2 years ago

phet-server-july_21.pdf

Here is the quote for the recent phet-server2 replacement. Do we want to go with something like this?

Here is the machine on dell.com, where we can see pricing for various memory/cpu/storage options: https://www.dell.com/en-us/shop/cty/pdp/spd/poweredge-r440/pe_r440_tm_vi_vp_sb. The pricing will be slightly different as we'll purchase it through CU Marketplace, which I believe has some discounts but I'm not aware of the details on that. Maybe @oliver-phet could help.

Should we schedule a meeting to discuss internally? Or should I reach out to Jason to schedule a meeting with him to discuss (with @zepumph and optionally @jonathanolson)?

zepumph commented 2 years ago

I'm curious about what memory is right. 8GB felt a bit low, but I'm not really sure:

image

Or is that 8GB per core?

I also didn't really understand the "best practices" notes and if we were doing things correctly.

mattpen commented 2 years ago

That's 8GB per memory stick in the screen shot, you can hit the + and add more sticks. On phet-server2 we have 8x8GB configured.

mattpen commented 2 years ago

Bayes stats:

Memory: 256GB (looks like 8x32GB) CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz -- 40 threads (looks like 2 cpus x 10 cores/cpu x 2 threads/core) Storage: Looks like we need at least 11TB plus room to grow? vgs reports how much is physically available, df -h reports how much is allocated and used.

[mape5853@bayes ~]$ sudo vgs
  VG #PV #LV #SN Attr   VSize  VFree
  os   1   5   0 wz--n- 10.91t 3.38t
[mape5853@bayes ~]$ df -h
Filesystem           Size  Used Avail Use% Mounted on
devtmpfs             126G     0  126G   0% /dev
tmpfs                126G   14G  112G  12% /dev/shm
tmpfs                126G  131M  126G   1% /run
tmpfs                126G     0  126G   0% /sys/fs/cgroup
/dev/mapper/os-root   10G  6.5G  3.6G  65% /
/dev/sda2            477M  164M  285M  37% /boot
/dev/mapper/os-home   15G   11G  4.7G  69% /home
/dev/mapper/os-var   6.0G  3.7G  2.4G  62% /var
/dev/mapper/os-data  7.5T  4.9T  2.6T  66% /data
tmpfs                 26G     0   26G   0% /run/user/17931
tmpfs                 26G     0   26G   0% /run/user/380503
tmpfs                 26G     0   26G   0% /run/user/454144
tmpfs                 26G     0   26G   0% /run/user/584799
tmpfs                 26G     0   26G   0% /run/user/451065

This should be a good starting point for a conversation with Jason.

Using an R440 poweredge, to just match these specs would cost $25,389.82.

mattpen commented 2 years ago

@jonathanolson @zepumph - do we know what our limiting factors are for CT? What resources should we prioritize increasing?

According to https://bayes.colorado.edu/xymon/ (creds are in the doc), it almost always reports that bayes.colorado.edu is maxed out on CPU. However, that report isn't always measuring things correctly, so we should use it as a guide to investigate and not final answers.

kathy-phet commented 2 years ago

Thanks, Matt. For getting this issue rolling. Maybe one 30 minute meeting internally with you, JO, MK. And then a meeting with Jason.

Also @oliver-phet - Do we have the original bayes machine quote from CU OIT? Can you attach it here.

oliver-phet commented 2 years ago

Also @oliver-phet - Do we have the original bayes machine quote from CU OIT? Can you attach it here.

Is this what you were asking for? The 2015 machine specs? Dell912687874.pdf

zepumph commented 2 years ago

We have set up a meeting for this afternoon and will report back.

mattpen commented 2 years ago

Configuration that we agreed upon in a meeting with @jonathanolson @oliver-phet @kathy-phet @zepumph and @mattpen

https://www.dell.com/en-us/shop/cty/pdp/spd/poweredge-r440/pe_r440_tm_vi_vp_sb?configurationid=d8f3f518-daf6-460b-b82e-bac10d0b9dbc

zepumph commented 2 years ago

@oliver-phet pointed out that the "max" memory slots per CPU for this processor is 6:

https://ark.intel.com/content/www/us/en/ark/products/192437/intel-xeon-gold-6230-processor-27-5m-cache-2-10-ghz.html

mattpen commented 2 years ago

I sent an email to Jason requesting a meeting time.

oliver-phet commented 2 years ago

(Maybe there isn't anything we can do about this) but the processor we have selected is a "2nd Gen" and Intel launched their "3rd Gen" Xeon Scalable processors in Q2'21. https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable/gold/products.html

mattpen commented 2 years ago

We'd probably have to select a different motherboard or housing I would imagine. I think Jason could help us with this if we're interested in using 3rd gen instead of 2nd.

zepumph commented 2 years ago

Looks like the 6230 (what we were looking at with 40T per core) was released Q2'19

kathy-phet commented 2 years ago

A different PowerEdge has those newer options: https://www.dell.com/en-us/shop/servers-storage-and-networking/poweredge-r550-rack-server/spd/poweredge-r550/pe_r550_tm_vi_vp_sb

this has this one: Intel® Xeon® Gold 5318Y 2.1G, 24C/48T, 11.2GT/s, 36M Cache, Turbo, HT (165W) DDR4-2933

kathy-phet commented 2 years ago

This one has Platinum processor options with crazy number of threads https://www.dell.com/en-us/shop/servers-storage-and-networking/poweredge-r650-rack-server/spd/poweredge-r650/pe_r650_14796_vi_vp

zepumph commented 2 years ago

I got pretty far on this one in the r650, but realized they don't have an option for HDDs (only SSD), so that added ~5000 to the final price:

https://www.dell.com/en-us/shop/servers-storage-and-networking/poweredge-r650-rack-server/spd/poweredge-r650/pe_r650_tm_vi_vp_sb?view=configurations&configurationid=6da9fe31-eaf5-4551-8a80-1574b75d0e07

kathy-phet commented 2 years ago

650 is showing SAS HD options for me?

kathy-phet commented 2 years ago

Bonus of r650 is its in stock instead of out of stock. https://www.dell.com/en-us/shop/servers-storage-and-networking/poweredge-r650-rack-server/spd/poweredge-r650/pe_r650_14796_vi_vp?configurationid=fd2911a3-0ef7-4116-ae2b-1f08c6b9bb1b With platenium 64T cores ... but maybe not everything we need.

oliver-phet commented 2 years ago

Clicking around a bit more... I think the R440 chassis was limiting our CPU choices. This R450 dual CPU rack allows 3rd gen processors: https://www.dell.com/en-us/shop/servers-storage-and-networking/poweredge-r450-rack-server/spd/poweredge-r450/pe_r450_15127_vi_vp?view=configurations&configurationid=782abc84-b21c-4b11-baf8-39b7219aa6c2

zepumph commented 2 years ago

Questions I have for our next meeting:

  1. Ideal memory per CPU
  2. If one CPU with X threads is better or worse than 2 CPUs each with X/2 threads.
  3. If there is a specific space requirement on the "rack" we are going to mount to
samreid commented 2 years ago

I recently had a good experience using GitHub Codespaces while investigating https://github.com/phetsims/chipper/issues/1353 and it seemed like it might also work for CT since you can clone many repos, install things and run programs, including web servers. Running an unbuilt sim over port forwarding didn't work, but it's unclear whether that would affect CT self-loading (no port forwarding). Of course it would be possible to run into other incompatibility problems. We previously (a few years ago) determined that cloud hosting would be too expensive (AWS, I believe), but now that Codespaces came out I wanted to double check that quote.

The pricing is listed at https://docs.github.com/en/billing/managing-billing-for-github-codespaces/about-billing-for-github-codespaces, which shows that 1 hour on a 32 core machine is $2.88. A month has 730 hours, so that works out to $2100/month, which sounds pretty expensive. To match the phet-server-2 quote above, the break-even point would be at $12,125.34/$2100, which is only 5.7 months. I don't think we would want to do something like this without a breakeven point at 4+ years. But to get to that price, we would have to sacrifice cores or not run it 24/7. Storage is quoted at $0.07 per 1GB/month, and I did not count it in the calculation. Likewise I did not check AWS to see how their prices compare today.

Anyways, just wanted to jot down a paper trail in case someone (like future me) asks about cloud computing.

mattpen commented 2 years ago

---------- Forwarded message --------- From: Jason Edward Hill jason@colorado.edu Date: Tue, Nov 8, 2022 at 1:51 PM Subject: Re: Purchase advice on a new PhET server for a testing machine (re BAYES recent failure) To: Kathy Perkins katherine.perkins@colorado.edu, Matthew Pennington Matthew.Pennington@colorado.edu Cc: Jonathan B Olson jonathan.olson@colorado.edu, Michael J Kauzmann Michael.Kauzmann@colorado.edu, Oliver Pascal Nix Oliver.Nix@colorado.edu

Hi all,

Again, my apologies for the delay on this. I've attached a quote. It includes all of what you specified, and I added the 2-port SFP28 NIC, got rid of the power cords, added a ready-rails mount without cable management.

Please look it over and let me know what you think.

Cheers, Jason

Phet-Dev.pdf

@kathy-phet would like us all to review Jason's comments in his latest email and the quote he provided (included above), and either comment with changes that should be made or approve the purchase

oliver-phet commented 2 years ago

I went through the quote process and generated an identical server with the same price, looks good to me.

kathy-phet commented 2 years ago

I sent Jason an email asking if we should run this by the Dell Rep.

zepumph commented 1 year ago

The server has been purchased and acquired. Next steps are in https://github.com/phetsims/special-ops/issues/234.