vocalpy / hybrid-vocal-classifier

a Python machine learning library for animal vocalizations and bioacoustics
http://hybrid-vocal-classifier.readthedocs.io
BSD 3-Clause "New" or "Revised" License
23 stars 8 forks source link

CLN/DEV: refactor codebase #122

Open NickleDave opened 3 years ago

NickleDave commented 3 years ago

starting this issue where I will make notes on how to possible refactor hvc codebase so it is less spaghetti code. May break some comments out into separate issues

NickleDave commented 3 years ago
NickleDave commented 3 years ago

There should be a config sub-package with the current parseconfig module inside of it. Currently there's a parse sub-package -- naming not quite explicit enough.

edit: would prefer a sub-package named config with a module named parse having a function parse_yaml_config

A lot of the logic around configuration parsing could be moved into this module so that other functions just do straight iteration over some sort of attrs class / dataclass instance, e.g. a class called Config with an attribute todolist that is literally a list of instances called TodoItem or something, each item having spect_params, seg_params etc. If an item in the .yml file declares its own spect_params and those need to take precedence over top-level spect params, that can be determined during config parsing to produce the TodoList class as expected, instead of what currently happens where each function internally determines whether there are spect_params for the current item in the to-do list that need to take precedence over global spect_params

Some of the "private" helper functions for parsing configs might be useful to a general user. Of course trade-off is those would then definitely need to be unit-tested and generally better maintained. Is there key functionality that's worth factoring out? E.g., I am using hvc.parse.extract._validate_feature_group_and_convert_to_list for the tweetynet paper to just get the list of svm features so I can iterate over them. Seems like having a general model_name_to_features_list function could be useful, especially under a framework where a "model" is specified in a way that includes its set of features.

NickleDave commented 3 years ago

core code in FeatureExtractor is basically unreadable, in particular the method _from_file. All the weird stuff I did with if x in locals and np.concatenate really obscures what's going on. It also prevents any kind of optimization that uses parallelization, e.g. dask. I think the easiest thing to do would be re-write this in a possibly slow but readable way, make sure I didn't break it, and then optimize from there.