tumblr / collins

groovy kind of love
tumblr.github.com/collins
Apache License 2.0
572 stars 99 forks source link

NICs in wrong order unless memory banks meet specific criteria #534

Open brandt opened 7 years ago

brandt commented 7 years ago

Here's a weird bug.

When processing uploaded lshw data, Collins gets the NICs backwards unless there are more than 5 memory banks, of which the last three must be occupied.

Reproducing

I've included some usable example lshw files to reproduce the behavior. (They're missing a lot of details, but Collins will accept them.)

I used the tumblr/collins:latest Docker image.

To verify order, go to the #hardware-details tab of the asset page. The correct order should be:

  1. 52:54:00:00:00:01
  2. 52:54:00:00:00:02

Example Failure

If a host (VM for example) has only one memory bank, its NICs will be listed in reverse order in Collins:

<!-- NICs in reverse order -->
<node class="system">
  <node id="core">
    <node class="processor" handle="DMI:1400" id="cpu">
      <product>Foo</product>
    </node>
    <node class="bridge">
      <node class="network" id="network:0">
        <serial>52:54:00:00:00:01</serial>
        <capacity>1</capacity>
      </node>
      <node class="network" id="network:1">
        <serial>52:54:00:00:00:02</serial>
        <capacity>1</capacity>
      </node>
    </node>
    <node class="memory" id="memory">
      <node class="memory" id="bank">
        <size>1</size>
      </node>
    </node>
  </node>
</node>

Example Success

However, if there are at least 5 memory banks (of which the last 3 must be occupied), it will be in the correct order:

<!-- NICs in correct order -->
<node class="system">
  <node id="core">
    <node class="processor" handle="DMI:1400" id="cpu">
      <product>Foo</product>
    </node>
    <node class="bridge">
      <node class="network" id="network:0">
        <serial>52:54:00:00:00:01</serial>
        <capacity>1</capacity>
      </node>
      <node class="network" id="network:1">
        <serial>52:54:00:00:00:02</serial>
        <capacity>1</capacity>
      </node>
    </node>
    <node class="memory" id="memory">
      <node class="memory" id="bank:0">
        <size>0</size>
      </node>
      <node class="memory" id="bank:1">
        <size>0</size>
      </node>
      <node class="memory" id="bank:2">
        <size>1</size>
      </node>
      <node class="memory" id="bank:3">
        <size>1</size>
      </node>
      <node class="memory" id="bank:4">
        <size>1</size>
      </node>
    </node>
  </node>
</node>

With any fewer banks, the NICs are in the wrong order. If any of the last 3 banks are empty, the NICs will also be in the wrong order.

michaeljs1990 commented 7 years ago

I tested this out locally with the data you provided above and everything is returned in the correct order. I need to test with some real LSHW outputs though that feature both cases to make sure nothing else strange changes.

michaeljs1990 commented 7 years ago

Out of curiosity was this causing any issues or was it simply a UI bug for you?

brandt commented 7 years ago

Our provisioning system assumes that the order in Collins corresponds with the order of the NICs as they appear on the running machine. It uses that info to create an ephemeral system entry in Cobbler.

It's always worked properly on metal, but when we dual-homed a VM it would configure the NICs backwards.

michaeljs1990 commented 7 years ago

Well if you are feeling adventurous you can patch in the PR above it doesn't modify how data is saved only how it is returned after being fetched from the DB so no risk of messing anything up permanently.

brandt commented 7 years ago

Awesome, thank you!

byxorna commented 7 years ago

@brandt id be careful with that... Relying on the ordering in collins can cause biosdevname ordering issues, if for example your NIC has 2 ports, but the 2 mac addresses enumerate backwards. I.e. interface index in kernel => mac: 0 => AA:BB:CC:DD:EE:F1, 1 => AA:BB:CC:DD:EE:F0. Some tooling naievely sorts interfaces by mac, and others use ifindex like the kernel does

brandt commented 7 years ago

@byxorna yeah, the assumptions about NIC order have been fragile.

Collins is our "source of truth," so the design goal behind this was to eliminate an unnecessary duplication of information. We'll be able to get by with this fix for now, but I have some ideas on how to do it more safely using the LLDP data.