Basic working version of rotary-utils

jmtsuji commented 4 months ago

In this PR, the basic skeleton of the rotary-utils package was set up with two utilities, the repair utility and the rotate utility. (Main, where I want to merge this PR, currently just contains the raw repair.py file ported from rotary.)

Major updates included in this PR:

meta.yml, pyproject.toml, and environment.yml files are now present for installation.
The package and its installation and usage are described in the README. (FYI, some README text was ported over from the rotary README, because rotary-utils code is being brought over from there. The rotary README will eventually be simplified, to remove some of the ported text, after rotary-utils starts to be used for the utility code.)
A main CLI script, rotary_utils.py, was created.
Some basic utilities (to be shared between future modules) were moved from repair.py to utils.py
A new module, rotate.py, was added. This module handles basic operations for rotating circular DNA elements and will be helpful for future code modules I will add. For now, a basic working version is included with a couple stub functions that will be expanded in future. Some functionality from rotate.py is used in repair.py.

rotary-utils repair and rotary-utils rotate pass end-to-end tests on my machine.

@LeeBergstrand Looking forward to review. Once this basic framework for rotary-utils is in place, I can start to add other modules into rotary-utils.

LeeBergstrand commented 4 months ago

@jmtsuji, I noticed that you tended to space out some of your loops and statements.

   # Get the original (pre-stitched) contig sequence as a SeqRecord
    with open(input_fasta_filepath) as fasta_handle:

        for record in SeqIO.parse(fasta_handle, 'fasta'):

            sequence_names.append(record.name)

            if record.name in subset_sequence_ids:

                yield record

or

    if circlator_info['circularised'].sum() / circlator_info.shape[0] == 1:

        logger.debug('Everything is stitched.')
        result = True

    elif circlator_info['circularised'].sum() >= 0:

        logger.debug('Not everything is stitched.')
        result = False

    else:

        logger.error('File processing error. # of non-stitched contigs is not >=0. Exiting...')
        raise RuntimeError

When indenting, most developers I know do not put blank lines between each line of code. Instead, they blank lines to break out the code into logical blocks. The extra blanks make the code longer to scroll through, which takes time.

For Example:

There is only a blank line above the if statement to break it out from the rest of the loop. In this case, you could even remove that blank line, and the code would still be readable.

   # Get the original (pre-stitched) contig sequence as a SeqRecord
    with open(input_fasta_filepath) as fasta_handle:
        for record in SeqIO.parse(fasta_handle, 'fasta'):
            sequence_names.append(record.name)

            if record.name in subset_sequence_ids:
                yield record

or

Here, we remove the blank lines from the if elif else clause and break it out from the other code with lines. This makes it look like a switch statement, which it functionally is. As you can see, the indenting makes the code fairly readable, even with the blank lines removed.

    circlator_info = pd.read_csv(circlator_logfile, sep='\t')[['#Contig', 'circularised']]

    if circlator_info['circularised'].sum() / circlator_info.shape[0] == 1:
        logger.debug('Everything is stitched.')
        result = True
    elif circlator_info['circularised'].sum() >= 0:
        logger.debug('Not everything is stitched.')
        result = False
    else:
        logger.error('File processing error. # of non-stitched contigs is not >=0. Exiting...')
        raise RuntimeError

    return result

jmtsuji commented 4 months ago

@jmtsuji, I noticed that you tended to space out some of your loops and statements.

@LeeBergstrand Thanks for these detailed tips -- they are very helpful! I've tried making my code styling more compact in my new updates to this branch -- let me know your thoughts.

jmtsuji commented 4 months ago

@LeeBergstrand Thanks for your helpful last round of comments! I've made several updates to this PR after receiving your advice from the last code review and from #2. Most previous comments are now addressed (although you'll see a couple discussion threads above that are still active).

As one unexpected update, I ended up completing a basic working version of the rotate.py module and added it to this PR. (Because I accidentally introduced an incomplete version of this module into #2 as mentioned previously, I figured it would be better to just get a basic working version of the rotate.py module into this PR before this PR gets merged to main, rather than leave an incomplete version of this module in the code base.) The repair and rotate modules pass end-to-end testing on my end, based on almost the latest commit. I've updated the description at the top of this PR to reflect that rotate is now included.

Look forward to your review of this updated version -- thanks again for your input.

LeeBergstrand commented 4 months ago

def generate_bed_file(contig_seqrecord: SeqIO.SeqRecord, bed_filepath: str, length_threshold: int = 100000):
    """
    Generates a BED file for a desired region around the ends of a contig.

    :param contig_seqrecord: SeqRecord of the contig
    :param bed_filepath: desired output filepath for the BED file
    :param length_threshold: length (bp) around the contig end to target in the BED file.
                             Half of this length will be returned around each end of the contig.
    :return: writes the BED file to the bed_filepath
    """

    contig_name = contig_seqrecord.name
    contig_length = len(contig_seqrecord.seq)

    logger.debug('Making BED file with end proximity threshold of ' + str(length_threshold))

    if contig_length < length_threshold:
        logger.warning(f'Contig length ({contig_length}) is less than the supplied length threshold '
                       f'({length_threshold}), so the BED file will be for the whole contig.')

The :return: is generally only specified when the function returns a return value (i.e., the function returns a value later assigned to a variable in the calling program). Above and in several other places, you specify return value descriptions for functions that do not return values (i.e., there is no return at the end of the function). If we decide to do readthedocs down the road, the code for readthedocs parses the contents of the function doc string and turns it into a website. It could get confusing if the documentation starts specifying return values that don't exist. Instead, I would write the text you specify in the return value description to the function description.

def map_long_reads(contig_filepath: str, long_read_filepath: str, output_bam_filepath: str, log_filepath: str,
                   append_log: bool = True, threads: int = 1, threads_mem_mb: float = 1):
    """
    Maps long reads (via minimap2) to contigs and sorts/indexes the resulting BAM file. Writes output_bam_filepath and log_filepath to disk.
    """

jmtsuji commented 4 months ago

The :return: is generally only specified when the function returns a return value (i.e., the function returns a value later assigned to a variable in the calling program). Above and in several other places, you specify return value descriptions for functions that do not return values (i.e., there is no return at the end of the function). If we decide to do readthedocs down the road, the code for readthedocs parses the contents of the function doc string and turns it into a website. It could get confusing if the documentation starts specifying return values that don't exist. Instead, I would write the text you specify in the return value description to the function description.

@LeeBergstrand Thanks for the advice! I've now cleaned up the way that return is described in the docstrings, as you've suggested.

jmtsuji commented 4 months ago

@LeeBergstrand I've implemented the code changes you've suggested and look forward to receiving your updated review. I would be OK to wait on changing the package name until a separate PR, if this is OK with you. Thanks!

LeeBergstrand commented 4 months ago

@LeeBergstrand I've implemented the code changes you've suggested and look forward to receiving your updated review. I would be OK to wait on changing the package name until a separate PR, if this is OK with you. Thanks!

Changing the name in another PR works for me.

jmtsuji commented 4 months ago

Merging into master as discussed in #14 . Thanks for your help with these improvements @LeeBergstrand !

rotary-genomics / spokewrench

Basic working version of rotary-utils #1