tumblr / collins

groovy kind of love
tumblr.github.com/collins
Apache License 2.0
572 stars 99 forks source link

Add gpu support #537

Closed jyundt closed 7 years ago

jyundt commented 7 years ago

This pull request adds support for capturing GPU information in collins. Currently, it is hardcoded for NVIDIA GPUs.

I wasn't clear if the definitions in AsssetMeta were still in use, so I added GPU_COUNT and GPU_DESCRIPTION to both AssetMeta.scala and the asset_meta table.

michaeljs1990 commented 7 years ago

Since we couldn't find a great way to separate the onboard graphics card from the external GPUs (The lshw xml is identical) we do have it hard coded currently. I am not sure how this should be handled since if we just allow it to parse all the GPUs every box would show up with a count of one and then you get into the case of checking NUM_NVIDIA_GPUS and NUM_AMD_GPUS and whatever other flavor you have.

The whitelist was an easy fix for us but possibly not a great way to handle it for a wider audience.

cc @byxorna @roymarantz @defect

byxorna commented 7 years ago

@jyundt @michaeljs1990 a common pattern is to make a configuration array of strings to match against LSHW output to identify external graphics adapters (by product, vendor, etc). Up to you if you wanna make it opt-in or opt-out. I would love this to be configurable via production.conf, with a sane default like matching on Manufacturer field with "AMD","NVIDIA"

michaeljs1990 commented 7 years ago

https://github.com/tumblr/collins/blob/master/support/ruby/collins-client/lib/collins/asset.rb#L206 will want to add gpus to this as well for consistency and so it's easy to access from the ruby client. Not sure if other stuff will need to be updated as well.

michaeljs1990 commented 7 years ago

Not sure what this is about... but I was testing this PR inside the provided dockerfile for this repo and even adding the additional config needed for gpuVendors to the production.conf file it will not intake GPUs but does show the additional field in hardware.

Likely doing something really dumb but should likely figure out exactly what that dumb thing is :/

jyundt commented 7 years ago

Not sure what this is about... but I was testing this PR inside the provided dockerfile for this repo and even adding the additional config needed for gpuVendors to the production.conf file it will not intake GPUs but does show the additional field in hardware.

I think I found the culprit, the docker specific validations.conf file is missing the GpuConfig.

I'll validate and update my PR.

michaeljs1990 commented 7 years ago

FWIW we ended up having to deploy this which just happens to be the current master branch with this deployed on-top of it. It's working well so far with the exception that we had to install the NVIDIA driver for our genesis iso to get access to some extra information about the GPUs in /proc/sys by default the lshw returns "NVIDIA Corporation" for product and vendor.

Our current plan is spicing up the lshw xml for product by injecting the actual model name found in /proc/sys/**/information (can't remember exact location right now).

I should say anything older than the Titan X models do return same product information.

michaeljs1990 commented 7 years ago

Something strange is happening on update. I checked by uploading a valid XML file with GPUs and then changing the name of a few things such as the NIC and GPU product. The NIC is updated fine and has a new name however it seems GPU is never updated.

The farthest I have been able to trace it back is to LshwHelper.updateAsset however it seems to have the correct XML at this point.

byxorna commented 7 years ago

@jyundt is your latest update fixing the odd behavior that @michaeljs1990 pointed out? Is this diff ready for review now? I want to make sure people dont spend time reviewing a WIP

Thanks for the diff, this is gonna be awesome!!!

jyundt commented 7 years ago

@jyundt is your latest update fixing the odd behavior that @michaeljs1990 pointed out? Is this diff ready for review now? I want to make sure people dont spend time reviewing a WIP

Yes, I incorporated the change from @michaeljs1990 with f301fd99f9f703ab51b302ccb44ce79d828403ce.

My most recent commit (d019c27fb30c1541dbb31de075a834283154060a) should incorporate all of the requested changes except cleaning up reconstructGpu (it's still messy).

byxorna commented 7 years ago

I think this is good to land. @jyundt can you open a new diff for gh-pages to document the new configuration bits for the GPU section?

jyundt commented 7 years ago

I think this is good to land. @jyundt can you open a new diff for gh-pages to document the new configuration bits for the GPU section?

551, let me know if you'd like anything documented/worded differently.

byxorna commented 7 years ago

@jyundt looks great! Lets wait for @defect and @roymarantz to review before we land?

defect commented 7 years ago

One teeny tiny comment, but looks good to me :)

byxorna commented 7 years ago

@jyundt ill land this unless you have any last change you need to make, cool?

jyundt commented 7 years ago

No, no additional changes. Thanks for your help with this!