Log dnadiff from the private-cli, and call this via snakemake

peterjc commented 3 days ago

Main points:

Expand https://github.com/pyani-plus/pyani-plus/blob/main/pyani_plus/private_cli.py with another function log_dnadiff (like the existing log_anim, log_anib, log_fastani) which will take the database, query FASTA, subject fasta, dnadiff output, and any parameters as argument. Test this.
Call this from snakemake in https://github.com/pyani-plus/pyani-plus/blob/main/pyani_plus/workflows/snakemake_dnadiff.smk (like the existing methods do), AND insert DB creation into https://github.com/pyani-plus/pyani-plus/blob/main/tests/snakemake/test_snakemake_dnadiff_workflow.py and confirm the final matrices match (like the existing snakemake tests)

kiepczi commented 3 days ago

I've been working on this on branch issue_131, and it made me wonder few things.

in method_dnadiff.py (and method_anim.py) there is code that I have written at earlier stages that I needed for testing, eg.:

class ComparisonResultDnadiff(NamedTuple):
    """A single comparison result for dnadiff."""

    qname: str  # query sequence name
    rname: str  # reference sequence name
    q_aligned_bases_with_gaps: int  # aligned base count of reference sequence with gaps
    avg_id: float  # average nucleotide identity (as a percentage)
    q_length: int  # total base count of query sequence
    r_length: int  # total base count of reference sequence
    alignment_gaps: int  # length of the gaps in the query alignment
    aligned_bases: int  # tota base count in an alignment
    q_cov: float  # query coverage

    def item(self, attribute: str) -> Any:  # noqa: ANN401
        """Return the value of the specified attribute."""
        try:
            return getattr(self, attribute)
        except AttributeError:
            msg = (
                f"Invalid attribute '{attribute}'. Valid attributes are: {self._fields}"
            )
            raise ValueError(msg)  # noqa: B904

or

def collect_dnadiff_results(
    mcoords_file: Path,
    qdiff_file: Path,
    indir: Path,
) -> ComparisonResultDnadiff:
    """Return a ComparisonResultDnadiff for a completed dnadiff comparison.

    The passed args should contain the Path to the mcoords and qdiff output files
    comparisons for a sungle run, and a set of input sequences.
    Files are parsed to obtain the count of aligned bases (with and without gaps)
    in the query, average nucleotide identity, and query genome coverage for each
    comparison.
    """
    files = {fasta.stem: fasta for fasta in indir.iterdir() if fasta.is_file()}

    rname, qname = mcoords_file.stem.split("_vs_")
    r_genome_length = get_genome_length(files[rname])
    q_genome_length = get_genome_length(files[qname])
    avg_identity, aligned_bases_with_gaps = parse_mcoords(mcoords_file)
    gaps = parse_qdiff(qdiff_file)

    return ComparisonResultDnadiff(
        rname=rname,
        qname=qname,
        avg_id=avg_identity,
        q_aligned_bases_with_gaps=aligned_bases_with_gaps,
        r_length=r_genome_length,
        q_length=q_genome_length,
        alignment_gaps=gaps,
        aligned_bases=aligned_bases_with_gaps - gaps,
        q_cov=(aligned_bases_with_gaps - gaps) / q_genome_length * 100,
    )

Since we don't use the above code to log our calculations/values to the database, and we test the congruence between the expected (matrices) and calculated values elsewhere, this code becomes somewhat repetitive. I think this could be removed.

While expanding the private-cli to record dnadiff comparisons in the database, I noticed a potential swap between the query and subject in log_anim. For example, it currently assumes that deltafilter.stem follows the format <query>_vs_<subject>.

# Allowing for some variation in the filename paths here... should we?
used_query, used_subject = deltafilter.stem.split("_vs_")

However, both delta and deltafilter actually follow the <subject>_vs_<query> format. In the rule_delta section of snakemake_anim.smk, we call the following, where genomeA is the subject and genomeB is the query, according to the delta-filter documentation:

USAGE: nucmer [options] <Reference> <Query>

peterjc commented 3 days ago

Yes, I was going to suggest removing this soon
That's the niggle I had to tweak in #129 to match the expected matrix output, it may need some more variable renaming?

kiepczi commented 3 days ago

Yes, I was going to suggest removing this soon

I will take care of this.

That's the niggle I had to tweak in Include ANIm self vs self in test fixtures; log ANIm from snakemake #129 to match the expected matrix output, it may need some more variable renaming?

I wasn't sure if you were aware of this, glad to hear its being dealt with. I have no suggestions how other naming could improve this 🤔

peterjc commented 3 days ago

I note the calculation needs both qdiff and mcoords files (see #133), and https://github.com/pyani-plus/pyani-plus/blob/main/pyani_plus/workflows/snakemake_dnadiff.smk calculates those two separately with two different and independent rules which I think can be run in either order (as they just need the .filter input).

We could I think either:

make one rule depend on the other, and add the DB logging to that rule so that it have access to both files. e.g. Say mcoords needs the qdiff, and it will also do DB logging.
combine the two rules into one, and add the DB logging to it. i.e. A single rule which generates .qdiff and .mcoords and does logging - and from the point of view of snakemake the target would be the second file generated

In the later case, we might go even further and use a single rule which runs nucmer, delta_filter, show_diff, show_coords and the DB logging. It seems simpler to me (solving the snakemake DAG ought to be faster, perhaps important on large runs?).

kiepczi commented 2 days ago

I see your point. When I was working on the dnadiff workflow, @widdowquinn and I discussed the best approach for handling this. We decided it might be better to have a single rule per file (e.g., delta, filter, mcoords, qdiff), as this would be easier to debug if something breaks or if a file is generated with different content. But... this was well before we were even thinking about DB logging.

peterjc commented 2 days ago

So we're agreed on a single rule for dnadiff? Unlike say ANIb, there is nothing which is query specific (fragmenting FASTA files), nor reference specific (making BLAST databases), everything is specific to the pair 😄

kiepczi commented 2 days ago

I'm easy on this one. I will put all my trust in the process and we can combine the two rules (.qdiff and .mcoords). Is it something that you are working on just now, or you want me to deal with this? I stil need to update the dnadiff snamake file and associated tests to call private_cli.py.

It would make sense to combine the two rules first. What do you think?

peterjc commented 2 days ago

Go for it. I'll have to add dnadiff to the public API at some point but that can wait.

pyani-plus / pyani-plus

Log dnadiff from the private-cli, and call this via snakemake #131