rotary-genomics / spokewrench

Toolkit for manipulating circular DNA sequence elements
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Develop refactor #6

Closed LeeBergstrand closed 7 months ago

LeeBergstrand commented 7 months ago

@jmtsuji Okay, so I have started a refactoring of the rotary-utils code. I am presenting this as a guide for you. Potential changes to repair.py and the rest of the code have been made to act as a starting point and to introduce some concepts.

Changes:

  1. I moved all the code that calls external programs to a single file called external.py. This forces all the code that involves subcommands to be in one place. If we choose to use a different CLI program library to call the subcommands down the road, then there will be only one place to look.
  2. Refactored run_end_repair function. It's generally a good rule of thumb that a Python function should fit on a single page. It should be broken down into a series of child functions if it doesn't fit on one page. I have done this. Though they might not make the most sense to you. Feel free to refactor them in different ways.
  3. I created a series of data classes, RepairPaths, AssemblyInfo, and ContigInfo, that represent the many variables contained within run_end_repair. These objects allow easier refactoring, allowing you to pass bundles of info to functions instead of passing many variables. This is the core of object-oriented programming (OOP). Some of these objects could be generated outside of the run_end_repair and perhaps passed to it and other functions such as stitch_all_contigs. I also think that AssemblyInfo and ContigInfo could be merged in some ways as contigs are part of an assembly.

To do this refactoring, I made extensive use of Pycharm's automated refactoring tools: https://www.jetbrains.com/help/pycharm/refactoring-source-code.html

LeeBergstrand commented 7 months ago

It should be noted that I did this refactoring blindly and may have made a few mistakes along the way. I don't have a test suite for this code yet. This is where unit tests are essential. They let you rapidly make changes without fear of breaking anything.

@jmtsuji, what are your thoughts? I suggest using this refactoring as a starting point, and I can guide you through the process. Does that work?

jmtsuji commented 7 months ago

@LeeBergstrand Thanks for these suggestions and changes! Indeed, I've been thinking that adding some objects into repair.py could help to simplify the code. (Starting to learn OOP a couple weeks ago.) Your other refactoring suggestions are also much appreciated. It might take me a few days to get to this PR, but soon I'll run this updated code through some end-to-end tests, check the code, and add further changes as needed.

LeeBergstrand commented 7 months ago

@jmtsuji For subset_sequences in repair.py:

sequence_names = []
    for record in load_fasta_sequences(input_fasta_filepath):
        sequence_names.append(record.name)
        if record.name in subset_sequence_ids:
            yield record
    # Raise an error if there are duplicate sequence names
    if len(set(sequence_names)) < len(sequence_names):
        sequence_names_series = pd.Series(sequence_names)
        duplicates_names = set(sequence_names_series[sequence_names_series.duplicated() == True])
        error = RuntimeError(f'Duplicate sequence IDs were detected in the input FastA file "{input_fasta_filepath}": '
                             f'{", ".join(duplicates_names)}')
        logger.error(error)
        raise error

@jmtsuji I will double-check to ensure the code after the yield ( if len(set(sequence_names)) < len(sequence_names):) is executed. I know that if you have a return none of the code below the return is executed.

LeeBergstrand commented 7 months ago

@LeeBergstrand Would you mind starting a review of this PR? For some reason, I am unable to request your review from my end (could be a temporary glitch in GitHub). I tried sending in an "approval" after self review to see if I could reset the system, but it didn't work... that's why you will see an "approval" on my end. Thanks in advance for your review!

@jmtsuji Ya I don't think you can do a review on your own pull request. Feel free to comment when you want me to have a look at your comments.

LeeBergstrand commented 7 months ago

@LeeBergstrand Thanks again for starting this round of refactoring. I tested your changes using end-to-end tests, and they passed after I cleaned up a couple very minor issues. I then did additional refactoring of other long functions in repair.py following some of the concepts you introduced. A couple of the functions, like iterate_linking_contig_ends and run_end_repair, are still a bit long, but they are more compact and readable than before, I think. The current version is still passing end-to-end tests. Feel free to let me know if you have any further feedback on this version. Look forward to your review.

jmtsuji commented 7 months ago

@LeeBergstrand Thanks for your review! Very helpful. I'll do further refactoring next week and will get back to you once finished.

jmtsuji commented 7 months ago

@jmtsuji For subset_sequences in repair.py: I will double-check to ensure the code after the yield ( if len(set(sequence_names)) < len(sequence_names):) is executed. I know that if you have a return none of the code below the return is executed.

Good point. To help make sure the code after yield is executed, I added a try-finally clause, as you'll see in the updated code:

    try:
        sequence_names = []
        for record in load_fasta_sequences(input_fasta_filepath):
            sequence_names.append(record.name)

            if record.name in subset_sequence_ids:
                yield record
    finally:
        # Raise an error if there are duplicate sequence names
        if len(set(sequence_names)) < len(sequence_names):
            sequence_names_series = pd.Series(sequence_names)
            duplicates_names = set(sequence_names_series[sequence_names_series.duplicated() == True])

            error = RuntimeError(f'Duplicate sequence IDs were detected in the input FastA file '
                                 f'"{input_fasta_filepath}": {", ".join(duplicates_names)}')
            logger.error(error)
            raise error

Sounds like the finally clause will run once try is finished, whether or not an exception is produced. Using finally has some caveats when combined with a yield statement (see https://stackoverflow.com/a/67429966 comments) -- I think the finally clause will only be run after the generator closes. However, for the way the subset_sequences function is used in repair.py (within while loops), I imagine the finally statement should work.

Let me know if you think this is still not good practice, and we can consider other avenues to check that the FastA file doesn't contain duplicate sequence IDs. Thanks.

jmtsuji commented 7 months ago

@LeeBergstrand Refactoring is close to done. Will push the code to this branch and request another review soon, after I have a chance to do some final edits and test the code.

LeeBergstrand commented 7 months ago

@jmtsuji For subset_sequences in repair.py: I will double-check to ensure the code after the yield ( if len(set(sequence_names)) < len(sequence_names):) is executed. I know that if you have a return none of the code below the return is executed.

Good point. To help make sure the code after yield is executed, I added a try-finally clause, as you'll see in the updated code:

    try:
        sequence_names = []
        for record in load_fasta_sequences(input_fasta_filepath):
            sequence_names.append(record.name)

            if record.name in subset_sequence_ids:
                yield record
    finally:
        # Raise an error if there are duplicate sequence names
        if len(set(sequence_names)) < len(sequence_names):
            sequence_names_series = pd.Series(sequence_names)
            duplicates_names = set(sequence_names_series[sequence_names_series.duplicated() == True])

            error = RuntimeError(f'Duplicate sequence IDs were detected in the input FastA file '
                                 f'"{input_fasta_filepath}": {", ".join(duplicates_names)}')
            logger.error(error)
            raise error

Sounds like the finally clause will run once try is finished, whether or not an exception is produced. Using finally has some caveats when combined with a yield statement (see https://stackoverflow.com/a/67429966 comments) -- I think the finally clause will only be run after the generator closes. However, for the way the subset_sequences function is used in repair.py (within while loops), I imagine the finally statement should work.

Let me know if you think this is still not good practice, and we can consider other avenues to check that the FastA file doesn't contain duplicate sequence IDs. Thanks.

Okay, so one generally only uses try statements when trying to catch errors.

try:
  code (try this code, it might encounter an error)
except:
  code (run this code if the error happens)
finally:
  code (run this code if the error happens or not)

So, we should only use try statements when catching errors, as this is their intended purpose. Using them in your suggested context is a bit of a corner-case usage.

I had a conversation with the jet brains AI about generators (yields). The prior code above might work even without the finally. What happens is that all the variables in the generator function are frozen between iterations.

One design pattern I follow is never having an exit statement (return or yield) in the middle of a function. The exit statement in the middle makes it harder to debug when using step-through debugger software. So I generally like to have an exist statement at the end of a function. This makes it easy to step through and double-check the function's logic when debugging. Often, I will assign a return value to a variable and return it at the end.

One of the things the AI suggested and I thought of beforehand was to check for duplicate names on the fly rather than waiting until the end of the function execution. For example:

    sequence_names = set() # Sets use less memory than lists as they do not keep order.
    for record in load_fasta_sequences(input_fasta_filepath):
        new_sequence_name = record.name

        # Add duplicate error handling here so the generator breaks when we find any duplicate ID (the ID has been seen before).
        if new_sequence_name in sequence_names:
            error = RuntimeError(f'Duplicate sequence ID was detected in the input FastA file "{input_fasta_filepath}": '
                             f'{new_sequence_name}')
           logger.error(error)
           raise error

        sequence_names.add(new_sequence_name)

        if record.name in subset_sequence_ids:
            yield record

This has the following benefits:

  1. Yield is at the end, making debugging easier.
  2. You never yield a bad record (so if you got a catch statement (https://en.wikipedia.org/wiki/Exception_handling_(programming)) in a function above, your code never gets passed bad data). You can catch raised errors and keep the programming running.
  3. The code errors earlier before you spend all the CPU time iterating through a potentially faulty file (if there is one duplicate, there will likely be more).
  4. We don't have to use pandas at all.
  5. Less code overall.

@jmtsuji Thoughts?

jmtsuji commented 7 months ago

One of the things the AI suggested and I thought of beforehand was to check for duplicate names on the fly rather than waiting until the end of the function execution.

@jmtsuji Thoughts?

@LeeBergstrand Thanks for looking into this. I was also considering this check-on-the-fly approach but didn't want to waste CPU time by running the ID check every single iteration (I assume this will eventually become computationally expensive, e.g., for metagenomes). However, given that we are just using this tool for genomes at the moment, I agree that the code you added is much more straightforward and includes many other benefits that you mentioned (I learned from the list you sent -- thanks). I've added in the refactored version you suggested.

jmtsuji commented 7 months ago

@LeeBergstrand Thanks for the update in #13 . Now that #13 is merged in, would you like to further review this branch, or do you think this branch is OK to merge to develop? Thanks.

LeeBergstrand commented 7 months ago

@LeeBergstrand Thanks for the update in #13 . Now that #13 is merged in, would you like to further review this branch, or do you think this branch is OK to merge to develop? Thanks.

@jmtsuji We've made a lot of changes. Let's merge into develop so we are all synced up. Other than a few formatting things here and there the code looks good.

jmtsuji commented 7 months ago

@LeeBergstrand Sounds good, and thanks once again for all your input.