oamg / leapp

Leapp - Application & OS Modernization Framework (For in-place upgrades, follow leapp-repository)
https://oamg.github.io/leapp/
Apache License 2.0
86 stars 69 forks source link

Save actor tracebacks in leapp.db #832

Closed fernflower closed 1 year ago

fernflower commented 1 year ago

Now when an exception is thrown during actor execution it won't be shown only on stdout, but will appear in leapp.db and leapp report as high risk report message.

OAMG-4186

github-actions[bot] commented 1 year ago

Thank you for contributing to the Leapp project!

Please note that every PR needs to comply with the Leapp Guidelines and must pass all tests in order to be mergeable. If you want to request a review or rebuild a package in copr, you can use following commands as a comment:

To launch regression testing public members of oamg organization can leave the following comment:

Please open ticket in case you experience technical problem with the CI. (RH internal only)

Note: In case there are problems with tests not being triggered automatically on new PR/commit or pending for a long time, please consider rerunning the CI by commenting leapp-ci build (might require several comments). If the problem persists, contact leapp-infra.

fernflower commented 1 year ago

/rerun

github-actions[bot] commented 1 year ago

Copr build succeeded: https://copr.fedorainfracloud.org/coprs/build/6020934

github-actions[bot] commented 1 year ago

Testing Farm request for RHEL-8.6-rhui/5990856;6020934 regression testing has been created. Once finished, results should be available here. Full pipeline log.

github-actions[bot] commented 1 year ago

Testing Farm request for RHEL-7.9-rhui/5990856;6020934 regression testing has been created. Once finished, results should be available here. Full pipeline log.

fernflower commented 1 year ago

/rerun

github-actions[bot] commented 1 year ago

Copr build succeeded: https://copr.fedorainfracloud.org/coprs/build/6091405

github-actions[bot] commented 1 year ago

Testing Farm request for RHEL-8.6-rhui/6080984;6091405 regression testing has been created. Once finished, results should be available here. Full pipeline log.

github-actions[bot] commented 1 year ago

Testing Farm request for RHEL-7.9-rhui/6080984;6091405 regression testing has been created. Once finished, results should be available here. Full pipeline log.

PeterMocary commented 1 year ago

LGTM, although I agree with the suggestion from @pirat89 in previous comment - changing the immediate CLI output to something similar to the StopActorExecutionError would be nice.

Tried on rhel7 vm by raising an exception in scancpu actor.

Leapp report: /var/log/leapp/leapp-report.txt contains expected output:

Risk Factor: high
Title: Actor scancpu unexpectedly terminated with exit code: 1
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/leapp/repository/actor_definition.py", line 74, in _do_run
    actor_instance.run(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/leapp/actors/__init__.py", line 289, in run
    self.process(*args)
  File "/usr/share/leapp-repository/repositories/system_upgrade/common/actors/scancpu/actor.py", line 16, in process
    scancpu.process()
  File "/usr/share/leapp-repository/repositories/system_upgrade/common/actors/scancpu/libraries/scancpu.py", line 142, in process
    raise Exception("Testing exception")
Exception: Testing exception

Summary:
Key: d6c2efd3b09890ac78b14f0fb428e5236fe1c36c
----------------------------------------

Database: leapp-inspector messages --actor=scancpu:

######################################################################
                          PRODUCED MESSAGES
######################################################################
Stamp: 2023-06-19T11:37:56.658310Z
Actor: scancpu
Phase: FactsCollection
Type: ErrorModel
Message_data:
{
    "actor": "scancpu",
    "details": null,
    "message": "Actor scancpu unexpectedly terminated with exit code: 1\nTraceback (most recent call last):\n  File \"/usr/lib/python2.7/site-packages/leapp/repository/actor_definition.py\", line 74, in _do_run\n    actor_instance.run(*args, **kwargs)\n  File \"/usr/lib/python2.7/site-packages/leapp/actors/__init__.py\", line 289, in run\n    self.process(*args)\n  File \"/usr/share/leapp-repository/repositories/system_upgrade/common/actors/scancpu/actor.py\", line 16, in process\n    scancpu.process()\n  File \"/usr/share/leapp-repository/repositories/system_upgrade/common/actors/scancpu/libraries/scancpu.py\", line 142, in process\n    raise Exception(\"Testing exception\")\nException: Testing exception\n",
    "severity": "fatal",
    "time": "2023-06-19T11:37:56.657966Z"
}
######################################################################

leapp-inspector actors --actor=scancpu

Actor: scancpu
Executed: True
Phase: FactsCollection
Started: 2023-06-19T11:37:56.593592Z
Produced messages:
    - ErrorModel
Executed shell commands:
    - ['lscpu']
Logs:
--- {"message": "2023-06-19 11:37:56.627 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: External command has started: ['lscpu']", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.633 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: Architecture:          x86_64", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.634 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: CPU op-mode(s):        32-bit, 64-bit", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.635 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: Byte Order:            Little Endian", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.636 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: CPU(s):                2", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.637 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: On-line CPU(s) list:   0,1", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.638 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: Thread(s) per core:    1", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.639 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: Core(s) per socket:    1", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.639 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: Socket(s):             2", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.640 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: NUMA node(s):          1", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.641 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: Vendor ID:             GenuineIntel", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.642 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: CPU family:            6", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.642 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: Model:                 142", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.643 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: Model name:            Intel(R) Core(TM) i7-10610U CPU @ 1.80GHz", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.644 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: Stepping:              12", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.645 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: CPU MHz:               2303.996", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.645 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: BogoMIPS:              4607.99", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.646 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: Virtualization:        VT-x", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.647 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: Hypervisor vendor:     KVM", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.648 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: Virtualization type:   full", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.648 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: L1d cache:             32K", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.649 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: L1i cache:             32K", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.650 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: L2 cache:              4096K", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.650 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: L3 cache:              16384K", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.651 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: NUMA node0 CPU(s):     0,1", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.652 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 arat umip md_clear spec_ctrl intel_stibp arch_capabilities", "level": "DEBUG"}
--- {"message": "2023-06-19 11:37:56.653 DEBUG    PID: 8240 leapp.workflow.FactsCollection.scancpu: External command has finished: ['lscpu']", "level": "DEBUG"}

I'd suggest to add an ERROR log as well, if that makes sense and would be doable.

fernflower commented 1 year ago

@PeterMocary @pirat89 Thanks for the review! If I understand you guys correctly, you would like to wrap the hard crash into a leapp exception and put the traceback in the details, right? (I honestly prefer the hard crash way, as any actor crashing is a severe thing that has to be as obvious as possible, but I will implement it the way you suggest and we'll see :)

pirat89 commented 1 year ago

@fernflower In case of the report I am more wondering WDYT about putting it into the report like this:

Risk Factor: high
Title: Actor scancpu unexpectedly terminated with exit code: 1
Summary:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/leapp/repository/actor_definition.py", line 74, in _do_run
    actor_instance.run(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/leapp/actors/__init__.py", line 289, in run
    self.process(*args)
  File "/usr/share/leapp-repository/repositories/system_upgrade/common/actors/scancpu/actor.py", line 16, in process
    scancpu.process()
  File "/usr/share/leapp-repository/repositories/system_upgrade/common/actors/scancpu/libraries/scancpu.py", line 142, in process
    raise Exception("Testing exception")
Exception: Testing exception

Key: d6c2efd3b09890ac78b14f0fb428e5236fe1c36c
----------------------------------------
######################################################################
                          PRODUCED MESSAGES
######################################################################
Stamp: 2023-06-19T11:37:56.658310Z
Actor: scancpu
Phase: FactsCollection
Type: ErrorModel
Message_data:
{
    "actor": "scancpu",
    "details": "{'traceback': 'Traceback (most recent call last):\n  File \"/usr/lib/python2.7/site-packages/leapp/repository/actor_definition.py\", line 74, in _do_run\n    actor_instance.run(*args, **kwargs)\n  File \"/usr/lib/python2.7/site-packages/leapp/actors/__init__.py\", line 289, in run\n    self.process(*args)\n  File \"/usr/share/leapp-repository/repositories/system_upgrade/common/actors/scancpu/actor.py\", line 16, in process\n    scancpu.process()\n  File \"/usr/share/leapp-repository/repositories/system_upgrade/common/actors/scancpu/libraries/scancpu.py\", line 142, in process\n    raise Exception(\"Testing exception\")\nException: Testing exception\n'}"
    "message": "Actor scancpu unexpectedly terminated with exit code: 1",
    "severity": "fatal",
    "time": "2023-06-19T11:37:56.657966Z"
}
######################################################################

I mean to keep the title as shorter msg and use details/summary/... for the rest. But as I told I am ok with this solution also. Btw, for the console output it's same as before:

Failed to get unit file state for ip6tables.service: No such file or directory
2023-06-20 13:16:00.228 DEBUG    PID: 3771 leapp.workflow.FactsCollection.system_facts: Command ['systemctl', 'is-enabled', 'ip6tables'] failed with exit code 1.
2023-06-20 13:16:00.230 DEBUG    PID: 3771 leapp.workflow.FactsCollection.system_facts: External command has finished: ['systemctl', 'is-enabled', 'ip6tables']
2023-06-20 13:16:00.231 DEBUG    PID: 3771 get_firewalld_status: The ip6tables service is likely not enabled nor running
Process Process-220:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/site-packages/leapp/repository/actor_definition.py", line 74, in _do_run
    actor_instance.run(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/leapp/actors/__init__.py", line 289, in run
    self.process(*args)
  File "/usr/share/leapp-repository/repositories/system_upgrade/common/actors/systemfacts/actor.py", line 62, in process
    if not architecture.matches_architecture(architecture.ARCH_S390X).mooo():
AttributeError: 'bool' object has no attribute 'mooo'

=============================================================================================
Actor system_facts unexpectedly terminated with exit code: 1 - Please check the above details
=============================================================================================

I actually realized the "banner" statement just now :-D It's cool.

pirat89 commented 1 year ago

@fernflower @PeterMocary btw, I've realized that for error models in the report.txt file, we do not print that it's error, inhibitor, ... it looks just a high severity msg. I am thinking we should improve it in future somehow to make it clear that it's error, but I do not want to change the structure/schema because of that.

pirat89 commented 1 year ago

@PeterMocary @pirat89 Thanks for the review! If I understand you guys correctly, you would like to wrap the hard crash into a leapp exception and put the traceback in the details, right? (I honestly prefer the hard crash way, as any actor crashing is a severe thing that has to be as obvious as possible, but I will implement it the way you suggest and we'll see :)

Just realized you summarized that right! Sorry, my brain is slow today :/

fernflower commented 1 year ago

/rerun

github-actions[bot] commented 1 year ago

Copr build succeeded: https://copr.fedorainfracloud.org/coprs/build/6107459

github-actions[bot] commented 1 year ago

Testing Farm request for RHEL-8.6-rhui/6096363;6107459 regression testing has been created. Once finished, results should be available here. Full pipeline log.

github-actions[bot] commented 1 year ago

Testing Farm request for RHEL-7.9-rhui/6096363;6107459 regression testing has been created. Once finished, results should be available here. Full pipeline log.

fernflower commented 1 year ago

@PeterMocary @pirat89 All your comments should be resolved - moved trace to the message details (so that a nice Summary with shorter message name will be shown in the report), error is logged at the time of the crash as well. Could you please re-review? (leapp-report.txt)

----------------------------------------
Risk Factor: high
Title: Actor bacula_check unexpectedly terminated with exit code: 1
Summary: Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/leapp/repository/actor_definition.py", line 74, in _do_run
    actor_instance.run(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/leapp/actors/__init__.py", line 289, in run
    self.process(*args)
  File "/etc/leapp/repos.d/system_upgrade/common/actors/baculacheck/actor.py", line 20, in process
    42 / 0
ZeroDivisionError: division by zero

Key: 4117a75983f26b1c6987c1bcc84ac31f9a2d8480

(except from leapp-upgrade.log)

2023-06-23 09:09:23.81  INFO     PID: 235468 leapp.workflow.Checks: Executing actor bacula_check 
2023-06-23 09:09:23.131 ERROR    PID: 235468 leapp.workflow.Checks: Actor bacula_check has crashed: Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/leapp/repository/actor_definition.py", line 74, in _do_run
    actor_instance.run(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/leapp/actors/__init__.py", line 289, in run
    self.process(*args)
  File "/etc/leapp/repos.d/system_upgrade/common/actors/baculacheck/actor.py", line 20, in process
    42 / 0
ZeroDivisionError: division by zero
pirat89 commented 1 year ago

@fernflower thanks! lgtm and tests passed. Merging.