taxprofiler / taxpasta

TAXnomic Profile Aggregation and STAndardisation
https://taxpasta.readthedocs.io/
Apache License 2.0
34 stars 7 forks source link

[BUG] taxpasta deletes nodes/names.dmp from --taxonomy dir #87

Closed kdm9 closed 1 year ago

kdm9 commented 1 year ago

Is there an existing issue for this?

Problem description

Twice now, I've Ctrl-C'd taxpasta when I made some simple mistake, and then found that both nodes.dmp and names.dmp were deleted from the dir provided with --taxonomy. This does not happen when I let it complete successfully, and might happen upon a crash?

I dug in to see if I could find the issue, but AFAICT taxpasta uses the keep_files=True option to taxopy's TaxDB(), so in theory no files should be deleted. Perhaps there's a issue within taxopy whereby a KeyboardInterrupt causes the deletion to be triggered even when not asked for (e.g. an incorrect try/except block). Again I couldn't see anything obvious, but perhaps the author (@apcamargo) has wise words

Code sample

Code run:

taxpasta merge \
    --long  \
    --add-name  \
    --taxonomy /db/NCBI_taxonomy/2023-04-18  \
    --summarise-at species  \
    --output-format csv  \
    -o all.csv  \
    --profiler bracken output/kraken/*bracken.txt

Traceback:

(none)

Environment

version 0.3.0 from conda

Anything else?

No response

kdm9 commented 1 year ago

I should have made this clear: this doesn't happen every single time I ctrl-c taxpasta. I'd estimate about a quarter of the times

Midnighter commented 1 year ago

It indeed sounds like an upstream issue in taxopy to me. Not sure why this would happen when only reading content. @maxibor do you have some capacity to investigate?

apcamargo commented 1 year ago

I'm not exactly sure how taxopy is called within taxpasta, but this is the part of the code that controls the deletion of the files: https://github.com/apcamargo/taxopy/blob/437f94be6b592cc5638e45c10804760191ddffc8/taxopy/core.py#L128-L129

If nodes.dmp or names.dmp are manually supplied, the files shouldn't be deleted.

Midnighter commented 1 year ago

The taxopy.TaxDb is created in this piece of code:

class TaxopyTaxonomyService(TaxonomyService):
    """Define the taxonomy service based on taxopy."""

    def __init__(self, *, tax_db: taxopy.TaxDb, **kwargs) -> None:
        """Initialize a taxonomy service instance with a taxopy database."""
        super().__init__(**kwargs)
        self._tax_db = tax_db

    @classmethod
    def from_taxdump(cls, source: Path) -> TaxopyTaxonomyService:
        """Create a service instance from a directory path containing taxdump info."""
        merged = source / "merged.dmp"
        return cls(
            tax_db=taxopy.TaxDb(
                names_dmp=str(source / "names.dmp"),
                nodes_dmp=str(source / "nodes.dmp"),
                merged_dmp=str(merged) if merged.is_file() else None,
                keep_files=True,
            )
        )

So I don't see why the files should ever get deleted. Unless, @kdm9 you were using an earlier version of taxpasta which indeed had a bug in this code.

kdm9 commented 1 year ago

@Midnighter hmm, I did check the version of taxprofiler was up-to-date, but I had initially installed v0.2.3 and so it is possible that I checked the version e.g. in a different conda env from the jobs that rm'd the files.

Given that this was a recently fixed bug, I suspect that is what happened. I've recreated all conda envs to update taxprofiler, so we'll see if the issues recur.

In the mean time I'm happy to close this issue, and will reopen if I find this happening again with v0.3.0 or higher.