sandialabs / sceptre-phenix

phenix is an orchestration tool and GUI for Sandia's minimega platform
https://sandialabs.github.io/sceptre-docs/
GNU General Public License v3.0
19 stars 25 forks source link

Add Ability to Mount VMs and Download/Upload Files #109

Closed jacdavi closed 2 years ago

jacdavi commented 2 years ago

When interacting with a VM in a running experiment, users can now access files within the VM directly from the UI by clicking the mount vm button within the VM's informational modal.

RBAC policies are applied to limit the ability for users to mount and unmount VMs, as well as list, download, and upload files to and from VMs.

Note that this feature only works if the miniccc agent is active within the VM. If it's not, the mount vm button will be disabled.

Also note that this feature is also currently behind a feature flag in the phenix UI named vm-mount that is not enabled by default. To enable it, pass the --features=vm-mount flag to the phenix ui command when starting the UI process.

Additional notes and changes:

activeshadow commented 2 years ago

Dude @jacdavi this sounds awesome! Thank you! I'll test this and your other PR ASAP.

jacdavi commented 2 years ago

Also working on some fixes for multiple users/browsers interacting with mounts. Right now, a 2nd person cannot mount a vm if a user is currently downloading/uploading a file on any other vm. The best solution seems to be a per vm mutex.

Another issue is if two users are browsing the same vm and the 2nd closes the modal, that unmounts the vm breaking the 1st user's session when they try to do anything. Still thinking of the best way to prevent this. Maybe track user/sessions in the mount call and remove them in unmount. Then only actually run cc mount after there are no user/sessions remaining.

activeshadow commented 2 years ago

Also working on some fixes for multiple users/browsers interacting with mounts. Right now, a 2nd person cannot mount a vm if a user is currently downloading/uploading a file on any other vm. The best solution seems to be a per vm mutex.

Yes, I was going to suggest having a mutex per VM when I did a deeper review of the HTTP handlers code.

Another issue is if two users are browsing the same vm and the 2nd closes the modal, that unmounts the vm breaking the 1st user's session when they try to do anything. Still thinking of the best way to prevent this. Maybe track user/sessions in the mount call and remove them in unmount. Then only actually run cc mount after there are no user/sessions remaining.

If it were me coding this, I would have probably gone down the route of having a resource counter per VM that got incremented/decremented as users opened the file browser modal. One of the issues I can think of here is to make sure the count gets decremented if a user browses away from the UI without explicitly closing the modal.

jacdavi commented 2 years ago

I did test setting myself to global-viewer on the backend and things seemed to work fine. Correct me if I'm wrong, but both already have star policies that should match for list/get.

activeshadow commented 2 years ago

I did test setting myself to global-viewer on the backend and things seemed to work fine. Correct me if I'm wrong, but both already have star policies that should match for list/get.

Whoops, you're right! I failed to review the other portions of those default policies before commenting.

jacdavi commented 2 years ago

Also working on some fixes for multiple users/browsers interacting with mounts. Right now, a 2nd person cannot mount a vm if a user is currently downloading/uploading a file on any other vm. The best solution seems to be a per vm mutex.

Yes, I was going to suggest having a mutex per VM when I did a deeper review of the HTTP handlers code.

Another issue is if two users are browsing the same vm and the 2nd closes the modal, that unmounts the vm breaking the 1st user's session when they try to do anything. Still thinking of the best way to prevent this. Maybe track user/sessions in the mount call and remove them in unmount. Then only actually run cc mount after there are no user/sessions remaining.

If it were me coding this, I would have probably gone down the route of having a resource counter per VM that got incremented/decremented as users opened the file browser modal. One of the issues I can think of here is to make sure the count gets decremented if a user browses away from the UI without explicitly closing the modal.

Ok, I made the change to a per-vm mutex and tracking the number of users who have mounted. Tracking unmounts is still somewhat unsolved. The user can only close the modal using the (x) and that correctly calls unmount.

However, that doesn't account for closing the browser/tab or refreshing. I added a listener for beforeunload, but it seems inconsistent. It didn't work at all in Firefox and only worked on Chrome for refreshing.

activeshadow commented 2 years ago

Also working on some fixes for multiple users/browsers interacting with mounts. Right now, a 2nd person cannot mount a vm if a user is currently downloading/uploading a file on any other vm. The best solution seems to be a per vm mutex.

Yes, I was going to suggest having a mutex per VM when I did a deeper review of the HTTP handlers code.

Another issue is if two users are browsing the same vm and the 2nd closes the modal, that unmounts the vm breaking the 1st user's session when they try to do anything. Still thinking of the best way to prevent this. Maybe track user/sessions in the mount call and remove them in unmount. Then only actually run cc mount after there are no user/sessions remaining.

If it were me coding this, I would have probably gone down the route of having a resource counter per VM that got incremented/decremented as users opened the file browser modal. One of the issues I can think of here is to make sure the count gets decremented if a user browses away from the UI without explicitly closing the modal.

Ok, I made the change to a per-vm mutex and tracking the number of users who have mounted. Tracking unmounts is still somewhat unsolved. The user can only close the modal using the (x) and that correctly calls unmount.

However, that doesn't account for closing the browser/tab or refreshing. I added a listener for beforeunload, but it seems inconsistent. It didn't work at all in Firefox and only worked on Chrome for refreshing.

@jacdavi ugh... okay I was afraid of that. It's great how all the browsers work so differently! /s

activeshadow commented 2 years ago

@jacdavi have you tested this at all with phenix and minimega running in Docker? @glattercj and I are seeing some issues with files actually showing up in the file browser modal and we're assuming it has something to do with how the resulting mount from cc mount is being handled by Docker, but if you've tested it using Docker to run phenix and minimega then that wouldn't be our issue.

jacdavi commented 2 years ago

@jacdavi have you tested this at all with phenix and minimega running in Docker? @glattercj and I are seeing some issues with files actually showing up in the file browser modal and we're assuming it has something to do with how the resulting mount from cc mount is being handled by Docker, but if you've tested it using Docker to run phenix and minimega then that wouldn't be our issue.

I haven't done any testing with docker. I have seen a few issues with individual files (e.g., can't download, wrong size), but those issues also existed when using minimega from the CLI. So due to time constraints, I considered them out of scope. Is this something similar or do the problems only occur with the new GUI?

activeshadow commented 2 years ago

@jacdavi have you tested this at all with phenix and minimega running in Docker? @glattercj and I are seeing some issues with files actually showing up in the file browser modal and we're assuming it has something to do with how the resulting mount from cc mount is being handled by Docker, but if you've tested it using Docker to run phenix and minimega then that wouldn't be our issue.

I haven't done any testing with docker. I have seen a few issues with individual files (e.g., can't download, wrong size), but those issues also existed when using minimega from the CLI. So due to time constraints, I considered them out of scope. Is this something similar or do the problems only occur with the new GUI?

@jacdavi I just tested again using Docker containers to run minimega and phenix and it does seem to be working as expected now. I'm going to do some additional review and testing today to see if we can get a bow on this and get it merged in. This is awesome!

activeshadow commented 2 years ago

@jacdavi I noticed an edge case where I killed the miniccc agent in a VM, but the UI still saw it as active so it let me attempt to mount the VM's disk. After attempting to mount, phenix became unresponsive for any actions that required a call to minimega, and eventually I got a toast saying something about it not being able to unmount the VM.

activeshadow commented 2 years ago

I think we need to update the RunningExperiment UI component to do an extra call to get the VM info when the VM details modal is opened so we can get an up-to-date status of the CC active state. Right now, the CC active state is grabbed only when the running experiment table is first loaded. So if someone is on that table for a while it could get stale.

In general I think it would be nice to also periodically update the entire table to keep the uptime column updated and the CC active status updated if we show that in the table view. Perhaps this is ripe for a future effort though.

jacdavi commented 2 years ago

@jacdavi I noticed an edge case where I killed the miniccc agent in a VM, but the UI still saw it as active so it let me attempt to mount the VM's disk. After attempting to mount, phenix became unresponsive for any actions that required a call to minimega, and eventually I got a toast saying something about it not being able to unmount the VM.

So this is probably the call to mm.ExecC2Command hanging since I didn't provide a timeout and I think the default is 5m. What do you think a sensible timeout would be? Also, the call to mm.ExecC2Command should call IsC2ClientActive before continuing, so then that method might not have caught that cc was inactive.

jacdavi commented 2 years ago

I think we need to update the RunningExperiment UI component to do an extra call to get the VM info when the VM details modal is opened so we can get an up-to-date status of the CC active state. Right now, the CC active state is grabbed only when the running experiment table is first loaded. So if someone is on that table for a while it could get stale.

In general I think it would be nice to also periodically update the entire table to keep the uptime column updated and the CC active status updated if we show that in the table view. Perhaps this is ripe for a future effort though.

I did add a call to update the modal when it's opened: https://github.com/sandia-minimega/phenix/blob/180a1c5bda418ed5fbb473baed04800142dc406d/src/js/src/components/RunningExperiment.vue#L1590 Otherwise, when you started an experiment, it would never show cc as active until you reloaded the page.

Agree about the periodic update!

activeshadow commented 2 years ago

I think we need to update the RunningExperiment UI component to do an extra call to get the VM info when the VM details modal is opened so we can get an up-to-date status of the CC active state. Right now, the CC active state is grabbed only when the running experiment table is first loaded. So if someone is on that table for a while it could get stale. In general I think it would be nice to also periodically update the entire table to keep the uptime column updated and the CC active status updated if we show that in the table view. Perhaps this is ripe for a future effort though.

I did add a call to update the modal when it's opened:

https://github.com/sandia-minimega/phenix/blob/180a1c5bda418ed5fbb473baed04800142dc406d/src/js/src/components/RunningExperiment.vue#L1590

Otherwise, when you started an experiment, it would never show cc as active until you reloaded the page. Agree about the periodic update!

@jacdavi oh sweet! I wonder why I still saw that edge case then where I killed the miniccc agent in a VM but it still showed as active when I opened the modal. I'll test it again to see if I can reproduce. Thanks!

activeshadow commented 2 years ago

@jacdavi I noticed an edge case where I killed the miniccc agent in a VM, but the UI still saw it as active so it let me attempt to mount the VM's disk. After attempting to mount, phenix became unresponsive for any actions that required a call to minimega, and eventually I got a toast saying something about it not being able to unmount the VM.

So this is probably the call to mm.ExecC2Command hanging since I didn't provide a timeout and I think the default is 5m. What do you think a sensible timeout would be? Also, the call to mm.ExecC2Command should call IsC2ClientActive before continuing, so then that method might not have caught that cc was inactive.

This is a good lead to go down... I'll take a look now. Thanks!

activeshadow commented 2 years ago

@jacdavi I noticed an edge case where I killed the miniccc agent in a VM, but the UI still saw it as active so it let me attempt to mount the VM's disk. After attempting to mount, phenix became unresponsive for any actions that required a call to minimega, and eventually I got a toast saying something about it not being able to unmount the VM.

So this is probably the call to mm.ExecC2Command hanging since I didn't provide a timeout and I think the default is 5m. What do you think a sensible timeout would be? Also, the call to mm.ExecC2Command should call IsC2ClientActive before continuing, so then that method might not have caught that cc was inactive.

This is a good lead to go down... I'll take a look now. Thanks!

@jacdavi okay so I updated the mount/unmount calls to use a timeout and that seems to help with the edge case. The only issue I'm hitting now is Docker specific... if I try to mount a VM that doesn't have cc active (but the UI still thinks it does), it times out as I mentioned but when I try to shut down the minimega container Docker fails to kill it. Perhaps we can address this issue at a different time since it's specific to Docker.

activeshadow commented 2 years ago

@jacdavi at some point soon I'd like to rebase your branch on the current master, but that will mean doing a force push to your branch. Let me know when you're okay with me doing that since you'll have to delete your branch locally and then pull it down again from GitHub.

jacdavi commented 2 years ago

@jacdavi at some point soon I'd like to rebase your branch on the current master, but that will mean doing a force push to your branch. Let me know when you're okay with me doing that since you'll have to delete your branch locally and then pull it down again from GitHub.

I'm good whenever you are :). Both of your commits look great

activeshadow commented 2 years ago

Also working on some fixes for multiple users/browsers interacting with mounts. Right now, a 2nd person cannot mount a vm if a user is currently downloading/uploading a file on any other vm. The best solution seems to be a per vm mutex.

Yes, I was going to suggest having a mutex per VM when I did a deeper review of the HTTP handlers code.

Another issue is if two users are browsing the same vm and the 2nd closes the modal, that unmounts the vm breaking the 1st user's session when they try to do anything. Still thinking of the best way to prevent this. Maybe track user/sessions in the mount call and remove them in unmount. Then only actually run cc mount after there are no user/sessions remaining.

If it were me coding this, I would have probably gone down the route of having a resource counter per VM that got incremented/decremented as users opened the file browser modal. One of the issues I can think of here is to make sure the count gets decremented if a user browses away from the UI without explicitly closing the modal.

Ok, I made the change to a per-vm mutex and tracking the number of users who have mounted. Tracking unmounts is still somewhat unsolved. The user can only close the modal using the (x) and that correctly calls unmount. However, that doesn't account for closing the browser/tab or refreshing. I added a listener for beforeunload, but it seems inconsistent. It didn't work at all in Firefox and only worked on Chrome for refreshing.

@jacdavi ugh... okay I was afraid of that. It's great how all the browsers work so differently! /s

@jacdavi given how you wrote the mount/unmount code, I don't think it's too big of a deal if a user browses away from the UI with the file browser modal open. Yes, the mount will end up staying active until the experiment is stopped and the VM is shut down but I think that's acceptable.

jacdavi commented 2 years ago

Also working on some fixes for multiple users/browsers interacting with mounts. Right now, a 2nd person cannot mount a vm if a user is currently downloading/uploading a file on any other vm. The best solution seems to be a per vm mutex.

Yes, I was going to suggest having a mutex per VM when I did a deeper review of the HTTP handlers code.

Another issue is if two users are browsing the same vm and the 2nd closes the modal, that unmounts the vm breaking the 1st user's session when they try to do anything. Still thinking of the best way to prevent this. Maybe track user/sessions in the mount call and remove them in unmount. Then only actually run cc mount after there are no user/sessions remaining.

If it were me coding this, I would have probably gone down the route of having a resource counter per VM that got incremented/decremented as users opened the file browser modal. One of the issues I can think of here is to make sure the count gets decremented if a user browses away from the UI without explicitly closing the modal.

Ok, I made the change to a per-vm mutex and tracking the number of users who have mounted. Tracking unmounts is still somewhat unsolved. The user can only close the modal using the (x) and that correctly calls unmount. However, that doesn't account for closing the browser/tab or refreshing. I added a listener for beforeunload, but it seems inconsistent. It didn't work at all in Firefox and only worked on Chrome for refreshing.

@jacdavi ugh... okay I was afraid of that. It's great how all the browsers work so differently! /s

@jacdavi given how you wrote the mount/unmount code, I don't think it's too big of a deal if a user browses away from the UI with the file browser modal open. Yes, the mount will end up staying active until the experiment is stopped and the VM is shut down but I think that's acceptable.

Ultimately I agree. Plus minimega will clear any mounts when you kill a VM. So shutting down the VM/experiment in phenix will clean everything up.

I did discover that things hang if you "pause" a vm and try to mount it. However, the same is true if you "stop" a vm in minimega and try to ls the mount directory. So I'm planning to open a bug for minimega on that one.

activeshadow commented 2 years ago

Okay @jacdavi I just did a force push. Please get your local branch in sync and do another round of testing to make sure the merge conflicts I fixed didn't end up breaking anything. All of the conflicts were in the UI javascript code.

activeshadow commented 2 years ago

I can see issues arising when, for example, a Windows VM that has to reboot a couple of times gets mounted by a user in between reboots. I just had this happen to me while testing. This is partly caused by using the VM UUID to detect if the cc agent is active, but using the host name won't always prevent the issue. This is also partly caused by minimega's default frequency for reaping dead cc agents being 30s, but again, having it happen more often won't help prevent all edge cases.

activeshadow commented 2 years ago

Also, I think I may have borked the Scorch table when addressing merge conflicts.

jacdavi commented 2 years ago

Also, I think I may have borked the Scorch table when addressing merge conflicts.

I might not have updated that buefy table correctly. I've never used scorch, so I wasn't sure how to check it...

activeshadow commented 2 years ago

@jacdavi I think this might be ready... however, I'd like to look into putting the feature behind a feature flag just in case it's causing issues for someone. If you want to go ahead and do one more round of testing with the latest commits that would be great. I can then squash all the commits and see where I'm at with the feature flag stuff.

jacdavi commented 2 years ago

@jacdavi I think this might be ready... however, I'd like to look into putting the feature behind a feature flag just in case it's causing issues for someone. If you want to go ahead and do one more round of testing with the latest commits that would be great. I can then squash all the commits and see where I'm at with the feature flag stuff.

I like the feature flag, that makes a lot of sense. Is there some type of documentation where it would be appropriate to mention the flag? Also thanks, for separating the mount functions to their own file. I wanted to do that, but wasn't sure if there was a convention I would be breaking.

Tested things out again, and they look good to me!

aherna commented 2 years ago

I can see issues arising when, for example, a Windows VM that has to reboot a couple of times gets mounted by a user in between reboots. I just had this happen to me while testing. This is partly caused by using the VM UUID to detect if the cc agent is active, but using the host name won't always prevent the issue. This is also partly caused by minimega's default frequency for reaping dead cc agents being 30s, but again, having it happen more often won't help prevent all edge cases.

@jacdavi @activeshadow We need to make sure uuid is still used as hostname with our experiments is not workable. Hostnames are dynamically generated at boot time and wont know what they will be.

we use uuid for soh

activeshadow commented 2 years ago

I can see issues arising when, for example, a Windows VM that has to reboot a couple of times gets mounted by a user in between reboots. I just had this happen to me while testing. This is partly caused by using the VM UUID to detect if the cc agent is active, but using the host name won't always prevent the issue. This is also partly caused by minimega's default frequency for reaping dead cc agents being 30s, but again, having it happen more often won't help prevent all edge cases.

@jacdavi @activeshadow We need to make sure uuid is still used as hostname with our experiments is not workable. Hostnames are dynamically generated at boot time and wont know what they will be.

we use uuid for soh

@aherna the convention we use for phenix in almost all cases but yours is to use the host name to tell when cc is ready because Windows VMs have typically gone through all their reboots once the host name reported by miniccc is the same as the host name used in the topology file.

SoH, by default, uses host names. You must set the option to use UUIDs instead.

I'm wondering if it would make sense to add a setting to the topology node schema that signifies whether or not the host name of the VM will end up matching the host name used in the topology. If not, then we'd know to use UUIDs for that particular VM.

I'm open to other ideas as well, but I will keep pushing against using UUIDs (by default anyway) when determining if cc is active for VMs, especially when it comes to allowing a VM's file system to be mounted.

jacdavi commented 2 years ago

I can see issues arising when, for example, a Windows VM that has to reboot a couple of times gets mounted by a user in between reboots. I just had this happen to me while testing. This is partly caused by using the VM UUID to detect if the cc agent is active, but using the host name won't always prevent the issue. This is also partly caused by minimega's default frequency for reaping dead cc agents being 30s, but again, having it happen more often won't help prevent all edge cases.

@jacdavi @activeshadow We need to make sure uuid is still used as hostname with our experiments is not workable. Hostnames are dynamically generated at boot time and wont know what they will be. we use uuid for soh

@aherna the convention we use for phenix in almost all cases but yours is to use the host name to tell when cc is ready because Windows VMs have typically gone through all their reboots once the host name reported by miniccc is the same as the host name used in the topology file.

SoH, by default, uses host names. You must set the option to use UUIDs instead.

I'm wondering if it would make sense to add a setting to the topology node schema that signifies whether or not the host name of the VM will end up matching the host name used in the topology. If not, then we'd know to use UUIDs for that particular VM.

I'm open to other ideas as well, but I will keep pushing against using UUIDs (by default anyway) when determining if cc is active for VMs, especially when it comes to allowing a VM's file system to be mounted.

Yeah Arthur is right that mounting won't work with our Windows VMs. I see you changed the cc check, but we'd also need to change the actual mount/unmount exec commands.

If you'd rather not use UUID for the general release, would it make sense for me to make another fork with UUIDs for our setup use? Would that reintroduce any bugs you fixed? Then the PR can be merged as-is or with the schema change you suggested.

jacdavi commented 2 years ago

I added 6063b0b547ae4f00b2c5ee45e4dae69d25544e96 since I was unable to start up our experiment that has a number of interfaces without an address like below. I can undo it if this is an issue with our setup.

          network:
            interfaces:
                - name: eth0
                  proto: dhcp
                  type: ethernet
                  vlan: vlan
activeshadow commented 2 years ago

I can see issues arising when, for example, a Windows VM that has to reboot a couple of times gets mounted by a user in between reboots. I just had this happen to me while testing. This is partly caused by using the VM UUID to detect if the cc agent is active, but using the host name won't always prevent the issue. This is also partly caused by minimega's default frequency for reaping dead cc agents being 30s, but again, having it happen more often won't help prevent all edge cases.

@jacdavi @activeshadow We need to make sure uuid is still used as hostname with our experiments is not workable. Hostnames are dynamically generated at boot time and wont know what they will be. we use uuid for soh

@aherna the convention we use for phenix in almost all cases but yours is to use the host name to tell when cc is ready because Windows VMs have typically gone through all their reboots once the host name reported by miniccc is the same as the host name used in the topology file. SoH, by default, uses host names. You must set the option to use UUIDs instead. I'm wondering if it would make sense to add a setting to the topology node schema that signifies whether or not the host name of the VM will end up matching the host name used in the topology. If not, then we'd know to use UUIDs for that particular VM. I'm open to other ideas as well, but I will keep pushing against using UUIDs (by default anyway) when determining if cc is active for VMs, especially when it comes to allowing a VM's file system to be mounted.

Yeah Arthur is right that mounting won't work with our Windows VMs. I see you changed the cc check, but we'd also need to change the actual mount/unmount exec commands.

If you'd rather not use UUID for the general release, would it make sense for me to make another fork with UUIDs for our setup use? Would that reintroduce any bugs you fixed? Then the PR can be merged as-is or with the schema change you suggested.

@jacdavi whoops good catch. I just pushed a commit to use UUIDs when mounting. We're now skipping cc checks when unmounting so it's not needed there.

I'm fine with using UUIDs for this PR, but I would like to figure out a way to make it configurable in the (very near) future.

activeshadow commented 2 years ago

I added 6063b0b since I was unable to start up our experiment that has a number of interfaces without an address like below. I can undo it if this is an issue with our setup.

          network:
            interfaces:
                - name: eth0
                  proto: dhcp
                  type: ethernet
                  vlan: vlan

Ah yes, good catch! We don't use DHCP much, as you have probably noticed. :joy:

activeshadow commented 2 years ago

@jacdavi I did some additional testing with auth enabled and on a URL path that isn't / and everything seemed to work as expected. As such, I think I'm ready to squash and merge this in. Let me know how testing has gone for you today.

jacdavi commented 2 years ago

Things look good to me, I'm ready for the merge!