openshift-kni / debug-tools

debug and troubleshooting tools to check CNF workload configuration - unsupported
Apache License 2.0
0 stars 10 forks source link

Re enable machineinfo test #60

Closed marioferh closed 2 years ago

marioferh commented 2 years ago

Re-enable machineinfo test with new updated data of #59 Add clean of CPUVendorID to cleanProcfs to avoid system dependence Remove -P from machineinfo command. Not used procfs in machineinfotest and can lead to confusion

openshift-ci[bot] commented 2 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marioferh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift-kni/debug-tools/blob/main/OWNERS)~~ [marioferh] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
ffromani commented 2 years ago

/lgtm again extremely terse commit message (for example: reference the PR updating the e2e data; declaring that thanks to that we can now re-enable the tests), but good enough for now.

yanirq commented 2 years ago

Did you rebase on top of https://github.com/openshift-kni/debug-tools/pull/59 ? I see one test failing

marioferh commented 2 years ago

Did you rebase on top of #59 ? I see one test failing

Yep, rebased it, but its failing, checking it

openshift-ci[bot] commented 2 years ago

New changes are detected. LGTM label has been removed.

marioferh commented 2 years ago

there is some info in the machineinfo arch dependent, no all date comes from /proc /sys

ffromani commented 2 years ago

there is some info in the machineinfo arch dependent, no all date comes from /proc /sys

which infos?

marioferh commented 2 years ago

I can see many differences between run in the same machine as generated json and in the ci:

e.g.:

     -  "memory_by_type": map[string]interface{}{},
    +   "memory_by_type": map[string]interface{}{
    +       "Unbuffered-DDR4": map[string]interface{}{"capacity": float64(1.03079215104e+11), "dimm_count": float64(12)},
    +   }, 
ffromani commented 2 years ago

I can see many differences between run in the same machine as generated json and in the ci:

e.g.:

     -    "memory_by_type": map[string]interface{}{},
    +     "memory_by_type": map[string]interface{}{
    +         "Unbuffered-DDR4": map[string]interface{}{"capacity": float64(1.03079215104e+11), "dimm_count": float64(12)},
    +     }, 

this data is collected from sysinfo: https://github.com/google/cadvisor/blob/master/machine/machine.go#L157 and https://github.com/google/cadvisor/blob/master/machine/info.go#L77 . Let's make sure we collect this data ourselves (in gather-sysinfo, though).

ffromani commented 2 years ago

or, if the data is not relevant for us, let's make sure we never collect it: https://github.com/openshift-kni/debug-tools/blob/main/pkg/machineinformer/informer.go#L49

marioferh commented 2 years ago

or, if the data is not relevant for us, let's make sure we never collect it: https://github.com/openshift-kni/debug-tools/blob/main/pkg/machineinformer/informer.go#L49

I have to investigate which data is relevant

marioferh commented 2 years ago

/test ci-e2e-test

marioferh commented 2 years ago

@fromanirh there are three differences in machineinfo.json "vendor_id": "AuthenticAMD", "num_cores": 104, this can be hardcoded

And the topology that is different in local cluster and ci cluster.

I don't know if can be hard-coded or not

marioferh commented 2 years ago

typo: should read machineinfo

sorry, where?

ffromani commented 2 years ago

typo: should read machineinfo

sorry, where?

in the commit message proper image

ffromani commented 2 years ago

@fromanirh there are three differences in machineinfo.json "vendor_id": "AuthenticAMD", "num_cores": 104, this can be hardcoded

And the topology that is different in local cluster and ci cluster.

I don't know if can be hard-coded or not

the number of cores and the architecture should totally be read from procfs and sysfs so they should work there. We need to dig deeper to avoid hiding bugs.

Additionally, it would be nice to have docs to explain how to collect the machineinfo for the e2e tests, or next time we need to update we will end up in this stage again

marioferh commented 2 years ago

ok, let me check why is taking this data from system and no from local data

ffromani commented 2 years ago

before you submit the final version, note you still have the macineinfo typo in the commit message

marioferh commented 2 years ago

before you submit the final version, note you still have the macineinfo typo in the commit message

already fixed

openshift-ci[bot] commented 2 years ago

@marioferh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ci 892fda62310dd3511909020b24d6060b315efd3e link true /test ci

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
marioferh commented 2 years ago

@fromanirh fixed it. The initial issue was: tests launched in local extract info and then delete it, when I generate the output json with local command it was taking the local sys, not the uncompressed one from the tar :S The second was, CPUVendorID was not cleaned with the --clean-procfs-info. I've already added that cleanup. The command has -P ./test/data/dell_2_numa/proc then I though that proc data comes from local folder not from system proc, that confused me.

ffromani commented 2 years ago

ok, this is more complex than we initially thought. We can pause this work and resume later when there is more bandwidth and we can sort out and tackle the dependencies.