intel / pmem-csi

Persistent Memory Container Storage Interface Driver
Apache License 2.0
164 stars 55 forks source link

cache data in LVM device manager? #413

Open pohly opened 5 years ago

pohly commented 5 years ago

In https://github.com/intel/pmem-csi/pull/408#issuecomment-532540904, @okartau pointed out that https://github.com/intel/pmem-csi/blob/e09d8dae00fcd126011c6bbe66951f90b35fec30/pkg/pmem-device-manager/pmd-lvm.go caches information about the LVM state on the node. @avalluri explained that this is to avoid expensive invocation of LVM commands and parsing of their output.

Let's continue the discussion whether that caching is worth the extra complexity. It already was the source of one bug (the one fixed by https://github.com/intel/pmem-csi/pull/408).

Do we have data how expensive the LVM command invocation+parsing really is?

okartau commented 5 years ago

I repeat my reasoning here which I posted earlier in a PR Currently we create entries recording that Volume when a Volume is created (in LVM mode):

  1. LVM volume itself
  2. cache entry in lvm.devices
  3. entry in Node Volume registry
  4. entry in Controller Volume registry

Node, Controller, and k8s (and possibly even more, some sidecars) are exchanging messages. With rapid RPC, messages get repeated, delayed, lost, mutexes (in multiple players and layers) are waited (sometimes for long time during larger-volume shred). PMEM-CSI will face so many combinations of possible errors that it is impossible to test everything and in long run those 4 entities may get out of sync. Debugging those problems is difficult/impossible without detailed logs (which are not enabled in production). That's why I propose to look for ways to reduce the amount of state we maintain. If a LVM operation takes some (minor) extra time, it is no big problem, as provisioning of Volume is not generally time-critical operation. My idea is to prioritize reliability over performance.

avalluri commented 5 years ago

Let's continue the discussion whether that caching is worth the extra complexity.

I don't agree if it adds any extra complexity :), It's just one more data structure where we add/delete values to/from it. One can argue that it needs mutex locking but, all LVM calls are already synchronous.

Do we have data how expensive the LVM command invocation+parsing really is?

No, I would see this more like a generic programming design, where a simple(inexpensive) data structure could solve than context switched exec call.

avalluri commented 5 years ago

I repeat my reasoning here which I posted earlier in a PR Currently we create entries recording that Volume when a Volume is created (in LVM mode):

  1. LVM volume itself
  2. cache entry in lvm.devices
  3. entry in Node Volume registry
  4. entry in Controller Volume registry

Node, Controller, and k8s (and possibly even more, some sidecars) are exchanging messages. With rapid RPC, messages get repeated, delayed, lost, mutexes (in multiple players and layers) are waited (sometimes for long time during larger-volume shred). PMEM-CSI will face so many combinations of possible errors that it is impossible to test everything and in long run those 4 entities may get out of sync. Debugging those problems is difficult/impossible without detailed logs (which are not enabled in production). That's why I propose to look for ways to reduce the amount of state we maintain. If a LVM operation takes some (minor) extra time, it is no big problem, as provisioning of Volume is not generally time-critical operation. My idea is to prioritize reliability over performance.

I agree with your argument and this brings me a question of whether PMEM-CSI really needs any caching even in step 3? It can get all the needed information by querying its StateManager and/or DeviceManager.

pohly commented 5 years ago

Amarnath Valluri notifications@github.com writes:

Let's continue the discussion whether that caching is worth the extra complexity.

I don't agree, it adds any extra complexity :), It's just one more data structure where we add/delete values to/from it. One can argue that it needs mutex locking but, all LVM calls are already synchronous.

Do we have data how expensive the LVM command invocation+parsing really is?

No, I would see this more like a generic programming design, where a simple(inexpensive) data structure could solve than context switched exec call.

If maintaining that data structure is simple, then why was there a bug in the code? Code that doesn't exist can't have bugs. That's also a design principle. For PMEM-CSI, correctness and robustness are more important than performance.

avalluri commented 5 years ago

If maintaining that data structure is simple, then why was there a bug in the code?

The bug was not because of the complexity. Result of a human error in writing and reviewing code.

pohly commented 5 years ago

Amarnath Valluri notifications@github.com writes:

If maintaining that data structure is simple, then why was there a bug in the code?

The bug was not because of the complexity. Result of a human error in writing and reviewing code.

That just proofs my point :-) Even if the code is simple, mistakes happen and it is better to not have the code in the first place if it is not absolutely needed.

Regarding robustness: what outside events could lead to the cached data becoming stale, and how would PMEM-CSI recover from that?

avalluri commented 5 years ago

what outside events could lead to the cached data becoming stale

None, The cache is updated only when calls(Create/Delete volume) initiated by Kubernetes/PMEM-CSI Node Controller. Changes happening to LVM volumes outside of driver context(say manually removing lvm volume) are not tracked by driver anyway.

pohly commented 5 years ago

Amarnath Valluri notifications@github.com writes:

what outside events could lead to the cached data becoming stale

None, The cache is updated only when calls(Create/Delete volume) initiated by Kubernetes/PMEM-CSI Node Controller. Changes happening to LVM volumes outside of driver context(say manually removing lvm volume) are not tracked by driver anyway.

What happens if the admin manually does a lvremove for a volume that he wants to get rid of quickly?

avalluri commented 5 years ago

What happens if the admin manually does a lvremove for a volume that he wants to get rid of quickly?

This is not the expected way to remove a volume by admin, he has to remove the appropriate PVC/PV. But anyway, in this case, our driver cache(not just LVM cache) still treats that the volume exists till we get a request from Kubernetes to delete that volume.

pohly commented 5 years ago

Amarnath Valluri notifications@github.com writes:

What happens if the admin manually does a lvremove for a volume that he wants to get rid of quickly?

This is not the expected way to remove a volume by admin, he has to remove the appropriate PVC/PV.

I know and you know, but an admin might not care and this might even be justified when the cluster is screwed and he needs to clean up. He might first manually delete the volume and then delete the PVC.

But anyway, in this case, our driver cache(not just LVM cache) still treats that the volume exists till we get a request from Kubernetes to delete that volume.

And will that then work?

avalluri commented 5 years ago

But anyway, in this case, our driver cache(not just LVM cache) still treats that the volume exists till we get a request from Kubernetes to delete that volume. And will that then work?

As per current code, DeleteVolume, in this case, fails at ClearDevice().

okartau commented 5 years ago

ironically, just after I started this "let's reduce bookkeeping entries", I see the need to add more of that. Nodeserver should register Published volumes so that it can check in repeated calls, was some attribute (like read-only) same. Currently we dont keep that information. To implement CSI spec better, we have to start recording those bits of information. And as Nodeserver has no access to previously existing records (other than DeviceManager), it likely needs another set of registry (i.e. 5th copy of volume information added to above-listed 4).

BTW, while looking around in that code, I see there is already VolInfo designed in nodeServer struct, but we never store anything there, looks like this is not fully implemented by original idea. We try to read entry out of that struct, however, and always get empty which is correct, and we log empty value.