Closed LeeBergstrand closed 7 months ago
It should be noted that I did this refactoring blindly and may have made a few mistakes along the way. I don't have a test suite for this code yet. This is where unit tests are essential. They let you rapidly make changes without fear of breaking anything.
@jmtsuji, what are your thoughts? I suggest using this refactoring as a starting point, and I can guide you through the process. Does that work?
@LeeBergstrand Thanks for these suggestions and changes! Indeed, I've been thinking that adding some objects into repair.py
could help to simplify the code. (Starting to learn OOP a couple weeks ago.) Your other refactoring suggestions are also much appreciated. It might take me a few days to get to this PR, but soon I'll run this updated code through some end-to-end tests, check the code, and add further changes as needed.
@jmtsuji For subset_sequences
in repair.py
:
sequence_names = []
for record in load_fasta_sequences(input_fasta_filepath):
sequence_names.append(record.name)
if record.name in subset_sequence_ids:
yield record
# Raise an error if there are duplicate sequence names
if len(set(sequence_names)) < len(sequence_names):
sequence_names_series = pd.Series(sequence_names)
duplicates_names = set(sequence_names_series[sequence_names_series.duplicated() == True])
error = RuntimeError(f'Duplicate sequence IDs were detected in the input FastA file "{input_fasta_filepath}": '
f'{", ".join(duplicates_names)}')
logger.error(error)
raise error
@jmtsuji I will double-check to ensure the code after the yield
( if len(set(sequence_names)) < len(sequence_names):
) is executed. I know that if you have a return
none of the code below the return
is executed.
@LeeBergstrand Would you mind starting a review of this PR? For some reason, I am unable to request your review from my end (could be a temporary glitch in GitHub). I tried sending in an "approval" after self review to see if I could reset the system, but it didn't work... that's why you will see an "approval" on my end. Thanks in advance for your review!
@jmtsuji Ya I don't think you can do a review on your own pull request. Feel free to comment when you want me to have a look at your comments.
@LeeBergstrand Thanks again for starting this round of refactoring. I tested your changes using end-to-end tests, and they passed after I cleaned up a couple very minor issues. I then did additional refactoring of other long functions in
repair.py
following some of the concepts you introduced. A couple of the functions, likeiterate_linking_contig_ends
andrun_end_repair
, are still a bit long, but they are more compact and readable than before, I think. The current version is still passing end-to-end tests. Feel free to let me know if you have any further feedback on this version. Look forward to your review.
@LeeBergstrand Thanks for your review! Very helpful. I'll do further refactoring next week and will get back to you once finished.
@jmtsuji For
subset_sequences
inrepair.py
: I will double-check to ensure the code after theyield
(if len(set(sequence_names)) < len(sequence_names):
) is executed. I know that if you have areturn
none of the code below thereturn
is executed.
Good point. To help make sure the code after yield
is executed, I added a try-finally
clause, as you'll see in the updated code:
try:
sequence_names = []
for record in load_fasta_sequences(input_fasta_filepath):
sequence_names.append(record.name)
if record.name in subset_sequence_ids:
yield record
finally:
# Raise an error if there are duplicate sequence names
if len(set(sequence_names)) < len(sequence_names):
sequence_names_series = pd.Series(sequence_names)
duplicates_names = set(sequence_names_series[sequence_names_series.duplicated() == True])
error = RuntimeError(f'Duplicate sequence IDs were detected in the input FastA file '
f'"{input_fasta_filepath}": {", ".join(duplicates_names)}')
logger.error(error)
raise error
Sounds like the finally
clause will run once try
is finished, whether or not an exception is produced. Using finally
has some caveats when combined with a yield statement (see https://stackoverflow.com/a/67429966 comments) -- I think the finally
clause will only be run after the generator closes. However, for the way the subset_sequences
function is used in repair.py
(within while
loops), I imagine the finally
statement should work.
Let me know if you think this is still not good practice, and we can consider other avenues to check that the FastA file doesn't contain duplicate sequence IDs. Thanks.
@LeeBergstrand Refactoring is close to done. Will push the code to this branch and request another review soon, after I have a chance to do some final edits and test the code.
@jmtsuji For
subset_sequences
inrepair.py
: I will double-check to ensure the code after theyield
(if len(set(sequence_names)) < len(sequence_names):
) is executed. I know that if you have areturn
none of the code below thereturn
is executed.Good point. To help make sure the code after
yield
is executed, I added atry-finally
clause, as you'll see in the updated code:try: sequence_names = [] for record in load_fasta_sequences(input_fasta_filepath): sequence_names.append(record.name) if record.name in subset_sequence_ids: yield record finally: # Raise an error if there are duplicate sequence names if len(set(sequence_names)) < len(sequence_names): sequence_names_series = pd.Series(sequence_names) duplicates_names = set(sequence_names_series[sequence_names_series.duplicated() == True]) error = RuntimeError(f'Duplicate sequence IDs were detected in the input FastA file ' f'"{input_fasta_filepath}": {", ".join(duplicates_names)}') logger.error(error) raise error
Sounds like the
finally
clause will run oncetry
is finished, whether or not an exception is produced. Usingfinally
has some caveats when combined with a yield statement (see https://stackoverflow.com/a/67429966 comments) -- I think thefinally
clause will only be run after the generator closes. However, for the way thesubset_sequences
function is used inrepair.py
(withinwhile
loops), I imagine thefinally
statement should work.Let me know if you think this is still not good practice, and we can consider other avenues to check that the FastA file doesn't contain duplicate sequence IDs. Thanks.
Okay, so one generally only uses try
statements when trying to catch errors
.
try:
code (try this code, it might encounter an error)
except:
code (run this code if the error happens)
finally:
code (run this code if the error happens or not)
So, we should only use try statements when catching errors, as this is their intended purpose. Using them in your suggested context is a bit of a corner-case usage.
I had a conversation with the jet brains AI about generators (yields). The prior code above might work even without the finally. What happens is that all the variables in the generator function are frozen between iterations.
One design pattern I follow is never having an exit statement (return or yield) in the middle of a function. The exit statement in the middle makes it harder to debug when using step-through debugger software. So I generally like to have an exist statement at the end of a function. This makes it easy to step through and double-check the function's logic when debugging. Often, I will assign a return value to a variable and return it at the end.
One of the things the AI suggested and I thought of beforehand was to check for duplicate names on the fly rather than waiting until the end of the function execution. For example:
sequence_names = set() # Sets use less memory than lists as they do not keep order.
for record in load_fasta_sequences(input_fasta_filepath):
new_sequence_name = record.name
# Add duplicate error handling here so the generator breaks when we find any duplicate ID (the ID has been seen before).
if new_sequence_name in sequence_names:
error = RuntimeError(f'Duplicate sequence ID was detected in the input FastA file "{input_fasta_filepath}": '
f'{new_sequence_name}')
logger.error(error)
raise error
sequence_names.add(new_sequence_name)
if record.name in subset_sequence_ids:
yield record
This has the following benefits:
@jmtsuji Thoughts?
One of the things the AI suggested and I thought of beforehand was to check for duplicate names on the fly rather than waiting until the end of the function execution.
@jmtsuji Thoughts?
@LeeBergstrand Thanks for looking into this. I was also considering this check-on-the-fly approach but didn't want to waste CPU time by running the ID check every single iteration (I assume this will eventually become computationally expensive, e.g., for metagenomes). However, given that we are just using this tool for genomes at the moment, I agree that the code you added is much more straightforward and includes many other benefits that you mentioned (I learned from the list you sent -- thanks). I've added in the refactored version you suggested.
@LeeBergstrand Thanks for the update in #13 . Now that #13 is merged in, would you like to further review this branch, or do you think this branch is OK to merge to develop? Thanks.
@LeeBergstrand Thanks for the update in #13 . Now that #13 is merged in, would you like to further review this branch, or do you think this branch is OK to merge to develop? Thanks.
@jmtsuji We've made a lot of changes. Let's merge into develop so we are all synced up. Other than a few formatting things here and there the code looks good.
@LeeBergstrand Sounds good, and thanks once again for all your input.
@jmtsuji Okay, so I have started a refactoring of the rotary-utils code. I am presenting this as a guide for you. Potential changes to repair.py and the rest of the code have been made to act as a starting point and to introduce some concepts.
Changes:
external.py
. This forces all the code that involves subcommands to be in one place. If we choose to use a different CLI program library to call the subcommands down the road, then there will be only one place to look.run_end_repair
function. It's generally a good rule of thumb that a Python function should fit on a single page. It should be broken down into a series of child functions if it doesn't fit on one page. I have done this. Though they might not make the most sense to you. Feel free to refactor them in different ways.RepairPaths
,AssemblyInfo
, andContigInfo
, that represent the many variables contained withinrun_end_repair
. These objects allow easier refactoring, allowing you to pass bundles of info to functions instead of passing many variables. This is the core of object-oriented programming (OOP). Some of these objects could be generated outside of therun_end_repair
and perhaps passed to it and other functions such asstitch_all_contigs
. I also think thatAssemblyInfo
andContigInfo
could be merged in some ways as contigs are part of an assembly.To do this refactoring, I made extensive use of Pycharm's automated refactoring tools: https://www.jetbrains.com/help/pycharm/refactoring-source-code.html