rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Re-consider the symlinking paradigm used in rotary #139

Open jmtsuji opened 5 months ago

jmtsuji commented 5 months ago

From #132 @LeeBergstrand

@jmtsuji I noticed that you use a Symlinking paradigm throughout the pipeline. For example:

# Conditional based on whether short read polishing was performed
rule pre_coverage_filter:
    input:
        "{sample}/polish/medaka/{sample}_consensus.fasta" if POLISH_WITH_SHORT_READS == False else "{sample}/polish/polca/{sample}_polca.fasta"
    output:
        temp("{sample}/polish/cov_filter/{sample}_pre_filtered.fasta")
    run:
        source_relpath = os.path.relpath(str(input),os.path.dirname(str(output)))
        os.symlink(source_relpath,str(output))

This was quite troublesome when I made the temp() file path as Snakemake would start deleting files that the symlinks would point to. I have found that Snakemake doesn't like symlinking because it relies on the presence and absence of files.

Perhaps the rebuilding issues that came up above are related to the symlinking? For example, Snakemake is having trouble tracking what files are used by what rules. I'm not sure. I would bring up this symlinking issue later, but I wanted to put it here now to ensure you know about it.

Initial response (@jmtsuji )

I've also run into symlinking issues while editing the pipeline and agree, it would be better to streamline these parts, if possible. Reducing the number of steps in the analysis this way might help with consistent DAG construction. (Just for reference for the future: for the polish steps, I think it's necessary to either copy or symlink the files before running the polish rules in order for the rules to be compatible with inputs from multiple modules. This is not the case for other rules, I think, so symlinking steps can probably be streamlined in many other cases.) ...