Open peterjc opened 3 days ago
I've been working on this on branch issue_131
, and it made me wonder few things.
method_dnadiff.py
(and method_anim.py
) there is code that I have written at earlier stages that I needed for testing, eg.:class ComparisonResultDnadiff(NamedTuple):
"""A single comparison result for dnadiff."""
qname: str # query sequence name
rname: str # reference sequence name
q_aligned_bases_with_gaps: int # aligned base count of reference sequence with gaps
avg_id: float # average nucleotide identity (as a percentage)
q_length: int # total base count of query sequence
r_length: int # total base count of reference sequence
alignment_gaps: int # length of the gaps in the query alignment
aligned_bases: int # tota base count in an alignment
q_cov: float # query coverage
def item(self, attribute: str) -> Any: # noqa: ANN401
"""Return the value of the specified attribute."""
try:
return getattr(self, attribute)
except AttributeError:
msg = (
f"Invalid attribute '{attribute}'. Valid attributes are: {self._fields}"
)
raise ValueError(msg) # noqa: B904
or
def collect_dnadiff_results(
mcoords_file: Path,
qdiff_file: Path,
indir: Path,
) -> ComparisonResultDnadiff:
"""Return a ComparisonResultDnadiff for a completed dnadiff comparison.
The passed args should contain the Path to the mcoords and qdiff output files
comparisons for a sungle run, and a set of input sequences.
Files are parsed to obtain the count of aligned bases (with and without gaps)
in the query, average nucleotide identity, and query genome coverage for each
comparison.
"""
files = {fasta.stem: fasta for fasta in indir.iterdir() if fasta.is_file()}
rname, qname = mcoords_file.stem.split("_vs_")
r_genome_length = get_genome_length(files[rname])
q_genome_length = get_genome_length(files[qname])
avg_identity, aligned_bases_with_gaps = parse_mcoords(mcoords_file)
gaps = parse_qdiff(qdiff_file)
return ComparisonResultDnadiff(
rname=rname,
qname=qname,
avg_id=avg_identity,
q_aligned_bases_with_gaps=aligned_bases_with_gaps,
r_length=r_genome_length,
q_length=q_genome_length,
alignment_gaps=gaps,
aligned_bases=aligned_bases_with_gaps - gaps,
q_cov=(aligned_bases_with_gaps - gaps) / q_genome_length * 100,
)
Since we don't use the above code to log our calculations/values to the database, and we test the congruence between the expected (matrices) and calculated values elsewhere, this code becomes somewhat repetitive. I think this could be removed.
log_anim
. For example, it currently assumes that deltafilter.stem
follows the format <query>_vs_<subject>
.# Allowing for some variation in the filename paths here... should we?
used_query, used_subject = deltafilter.stem.split("_vs_")
However, both delta
and deltafilter
actually follow the <subject>_vs_<query>
format. In the rule_delta
section of snakemake_anim.smk
, we call the following, where genomeA
is the subject and genomeB
is the query, according to the delta-filter documentation:
USAGE: nucmer [options] <Reference> <Query>
- Yes, I was going to suggest removing this soon
I will take care of this.
- That's the niggle I had to tweak in Include ANIm self vs self in test fixtures; log ANIm from snakemake #129 to match the expected matrix output, it may need some more variable renaming?
I wasn't sure if you were aware of this, glad to hear its being dealt with. I have no suggestions how other naming could improve this 🤔
I note the calculation needs both qdiff
and mcoords
files (see #133), and https://github.com/pyani-plus/pyani-plus/blob/main/pyani_plus/workflows/snakemake_dnadiff.smk calculates those two separately with two different and independent rules which I think can be run in either order (as they just need the .filter
input).
We could I think either:
.qdiff
and .mcoords
and does logging - and from the point of view of snakemake the target would be the second file generatedIn the later case, we might go even further and use a single rule which runs nucmer, delta_filter, show_diff, show_coords and the DB logging. It seems simpler to me (solving the snakemake DAG ought to be faster, perhaps important on large runs?).
I see your point. When I was working on the dnadiff workflow, @widdowquinn and I discussed the best approach for handling this. We decided it might be better to have a single rule per file (e.g., delta, filter, mcoords, qdiff), as this would be easier to debug if something breaks or if a file is generated with different content. But... this was well before we were even thinking about DB logging.
So we're agreed on a single rule for dnadiff? Unlike say ANIb, there is nothing which is query specific (fragmenting FASTA files), nor reference specific (making BLAST databases), everything is specific to the pair 😄
I'm easy on this one. I will put all my trust in the process and we can combine the two rules (.qdiff and .mcoords). Is it something that you are working on just now, or you want me to deal with this? I stil need to update the dnadiff snamake file and associated tests to call private_cli.py
.
It would make sense to combine the two rules first. What do you think?
Go for it. I'll have to add dnadiff to the public API at some point but that can wait.
Main points:
Expand https://github.com/pyani-plus/pyani-plus/blob/main/pyani_plus/private_cli.py with another function
log_dnadiff
(like the existinglog_anim
,log_anib
,log_fastani
) which will take the database, query FASTA, subject fasta, dnadiff output, and any parameters as argument. Test this.Call this from snakemake in https://github.com/pyani-plus/pyani-plus/blob/main/pyani_plus/workflows/snakemake_dnadiff.smk (like the existing methods do), AND insert DB creation into https://github.com/pyani-plus/pyani-plus/blob/main/tests/snakemake/test_snakemake_dnadiff_workflow.py and confirm the final matrices match (like the existing snakemake tests)