running insect after initial run through to LCA taxonomy does not use cached blast and lulu

cajwalsh commented 11 months ago

When I run eDNAFlow, I run the LCA taxonomy assignment using 4 different sets of values on 4 separate runs. After the first run, everything except the new LCA run with different parameters is found in the cache so that only the new LCA process is run. When I tried to do a fifth run using insect but with the otherwise exact same code, it used the cache for everything up to blast and lulu which it tried to run again.

Two workarounds for this are to:

if you know this is going to happen before you do it, use --skip-blast and --skip-lulu
if you have already tried the normal way, find your old blast results file and do a --standalone-taxonomy run using old blast results (either still symlinked in your results folder or in the work directory) and the cached zotu_table in the results directory that shouldn't have changed.

A few other minor things I noticed during this process:

the --standalone-taxonomy run still requires you to specify your run type (e.g. --paired), although it seems like this should not be necessary/could be confusing.
in the second paragraph of the phyloseq section of the readme, there is a typo "--asign-taxonomy"
While I appreciate the effort into making process numbering that accords to only the steps of the pipeline you run, this gets confusing/weird when you may want to use the same earlier cached steps to do multiple things down the road later like above. In the situation where I have "skipped" blast and lulu, I now have three taxonomy folders 08_taxonomy, 09_taxonomy, and just taxonomy from the standalone, as well as two 08s (blast and taxonomy). I think sticking with a system where everything is guaranteed to have its own unique number is easier, where everything that ends up in that folder may be different in file name. Especially when you get into the habit of taking the contents out of folders you remember the number/beginning of the name from previous runs, only to have it be different in later runs and lead to the small but possibly annoying need to search for something again. If you don't like the idea of potentially jumping numbers in a results folder, lettering instead could make it less likely for beginnier misinterpretation of them missing something (skipping step 2 might be less confusing than not needing process b?). I do like them having some order like with the numbering or the letters (rather than just derep or taxonomy, etc) but I think a constant system where the process doesn't change numbers between runs would be best.

mhoban commented 8 months ago

Still looking into this, but noting that the comments about needing to pass run type and the readme typo were broken off into #43 and #44, and fixed.

mhoban commented 8 months ago

@cajwalsh and @vwishingrad see #45 for a question about output folder/process numbering

mhoban commented 7 months ago

Thanks to the ability of nextflow to spit out a visual representation of its dependency graph, I think I have resolved this. There were some weirdly circular dependencies which should no longer exist. This is done in fd36970.

mhoban / rainbow_bridge

running insect after initial run through to LCA taxonomy does not use cached blast and lulu #39