Niels instructions - Githubissues

A ton of instructions that Niels gave me:
Line 9. I recommend sticking with the following indentation method, as it makes it more easy when a segment (such as a set of function arguments) ends.

setup(
    ...
)
The following makes it more difficult to see if you put the ) there, especially if the arguments are long (longer than a, b, c at least).

function(a
         b
         c)
Line 18. It is more common to use LICENSE than LICENSE.txt. I don't know why, but it's probably best to stick to standards.

Line 23. Remove spaces, such as "plotille>=3.7" (you do it correctly in tests_require)

htanalyser/htanalyser.py
Line 11. Remove import htcondor, modify line 12 to instead read from htcondor import JobEventType as jet, JobEventLog as jel. Otherwise you can call htcondor.JobEventType and jet to do the same thing.

Line 26. Don't provide a docstring for the class and the init, it is either/or because they both refer to the same functionality (class initialization). This may lead to problems if you ever decide to use a tool that automatically parses docstrings for online documentation.

Line 50. You don't have to explain why you are using default values for parameters. It's a given.

Line 53. You list the parameters, but don't describe them. If they do not need description, then leave them out. However, if you describe even one of them, describe them all.

Line 64. If it fits, you can shorten such brief conditional statements to: self.show_list = [] if show_list is None else show_list

Line 97. This fits on one line. It's better to add the comment on the line before it, provided the comment is necessary.

Line 99. It is much more convenient to loop through the dictionary, with:

for key, value in dict.items():
    ...
You can also add an enumerator (does what you try to accomplish with range), with:

for i, (key, value) in enumerate(dict.items()):
    ...
Note that it seems the dictionary contains lists, and the lists are associated by location in that list. It is probably a good idea, for clarity's sake, to "refactor" the dictionary, so that it would be a list of dictionaries, like:

resources = [
    {'resources': 1, 'requested': 2, 'usage': 3},
    {'resources': 1, 'requested': 5, 'usage': 2}
]
If the original structure must be maintained for other reasons, then ignore what I said. You can still use enumerate instead of range, though.

If you are ever afraid a key might not exist, you can use dict.get('foo', None). This will return the value associated with the key 'foo', or return None if it does not exist.

Line 106. Comments not necessary, you can easily interpret from the code what you are doing. Same for the later comments. Generally, whenever you place a comment, first think if you can make the code more readable by changing variable names or splitting the section up into more lines. If that doesn't improve clarity enough you can write a comment. Mostly you'll use it to explain specialized (often external) know-how, e.g., "API does not allow X, hence Y was used".

Line 111. You don't have to do a type conversion to string when using f-strings, because f-strings by default convert everything between brackets to strings. Also, when lines get lengthy, you can split them up into the several actions composing them:

msg = f'[red]{resources["Usage"][i]}[/red]'
resources['Usage'][i] = msg
Line 152. You can use f-strings here too, so: rprint('[red]The smart_output_error method requires a {self.std_err} file as parameter[/red]'). Note that you switch between f-strings and 'foo' + bar + 'baz' uses throughout the script. Try to stick with a method, unless the other has clear advantages.

Line 155. When you are working with file paths, use Python's os library (os.path in particular) to do the job. This tool ensures that things will function regardless of OS, preventing many headaches.

Line 160. Try to put spaces at the end of a line, rather than at the beginning. This way all lines line up.

Line 16. ) on the next line.

Line 249. That's a huge Try block. You expect the possibility of an OS error, but this possibility is likely limited to one line. Therefore, encapsulate only the part that is relevant. The more code you add in here, the more unexpected things can happen. What if another line that you did not expect to trigger an error suddenly errors out with an OSError? You'll never know now. Note that you can break out of loops (to cancel them entirely) or use continue to move to the next iteration immediately without continuing the current.

Line 313. You're emulating a function that dict.get already has (in fact, there's no other reason to use dict.get than to use its default functionality): cpu_usage = event.get('CpusUsage', np.nan), also see the comment for line 99.

Line 337. Not strictly necessary, but try to use numpy dtypes (e.g., np.float64 instead of float)

Line 419. This will crash if the first if-clause above it triggers, because then state is not defined.

Line 461. For clarity, use get_host_by_addr instead of gethostbyaddr. Classes use CamelCase, whereas functions use snake_case without capitalization. The purpose of snake case is to separate words by an underscore to maintain their readability.

Line 471. Redundant use of list, dict.keys() returns an iterable that works with in, that is:

if ip in self.store_dns_lookups.keys():
    ...
Line 475. You are outputting to the root logger, it is better to make a custom logger (see HTCrystalBall code)

Line 502. While not strictly necessary, it is better to split this in two lines. That makes it more clear which part of the code triggers the error: the for loop (e.g. if tracker is not an iterator) or the track function.

tracker = track(log_files, transient=True, description='Processing...')
for file in tracker:
    ...
Line 512. What's the purpose of f""? If you don't have enough space, either split into msg = f'foo' and then assigning msg to the result_dict, or use shorter names.

Line 556. Be careful, you are returning a string but your function specifies it should be a List[dict]. It is better to log this, than to return an error message. If you expect multiple output types, you can use:

from typing import Union

def analyse(self, log_files: list_of_logs, show_legend=True) -> Union[log_inf_list, str]:
    ...
Importantly, you're returning what is likely an error or output message as a return value of the string. This requires some mental gymnastics to use. It is better to return False, and then check if the return object is True. Remember: every object that contains content will return True, unless that content is anything that resolves to False (i.e., 0, None, np.nan, etc.). Thus, if output = ['a', 'b'] in case 1, and output = [] in case 2, then if output: resolves for case 1, and does not resolve for case 2.

Line 576. An empty dictionary equals False. The same thing applies to lists and other iterables. So you can do:

if job_dict:
    ...
Line 614. When you need to resort to such nasty line breaks, you are nesting too much. You can do:

if ram_history:
    ram = np.array(...)

if ram_history and len(ram) > 1:
    fig = Figure()
...
rather than nest if len(ram) > 1 into the previous statement. Technically this slows down your code because of the additional if-statement, but this is so incredibly light-weight you won't even notice. Such a level of optimization is not necessary here, but readability is much more important.

Line 797. Are you sure this works? Because average_job_duration is a float, and a float does not have the .days or .seconds method.

Line 833. No reason to make valid_files. You can do:

if not log_files:
    return "foo"
Line 1139. Why did you encapsulate a type error into a function that doesn't do much else than the type error itself? Just use raise TypeError('Expecting a list or a string'). Note that it is fine to error-out a script if something unexpected happens, but in this case since you are expecting it, you could end more gracefully with a log message and sys.exit.

General tips. Shorten this code up a bit. If there are code blocks that perform a specific function that can be understood outside of the context it is currently placed in, don't hesitate to turn it into a function even if you only use it once. This is all for the sake of brevity and readability. That class you've made is very large and unwieldy, anything to shorten it would be an improvement.

You could separate the logic in a "model, view, controller"-style. You have a separate part that obtains the data from HTCondor (model), a part that validates and modifies the data (controller), and then finally a part that formats this data for printing purposes (view).
This also closes #101
Impacted Files	Coverage Δ
setup.py	`0.00% <0.00%> (ø)`
htcanalyze/main.py	`77.70% <58.33%> (-1.00%)`	:arrow_down:
htcanalyze/logvalidator.py	`67.77% <80.00%> (ø)`
htcanalyze/htcanalyze.py	`77.57% <85.36%> (-0.05%)`	:arrow_down:
psyinfra / HTCAnalyze

Niels instructions #111

Codecov Report