Build Nextflow Pipeline

glstott commented 2 years ago

[x] Generate Alignment
[x] Generate Trees (two options, IQ-Tree and FastTree. The latter will be used for most testing)
[x] Generate CSVs from Trees
[x] Check if database is given. When not, instantiate new database with plugins, etc.
- [x] Make directories
- [x] Set administrator password
- [x] Generate indices
- [x] Attach plugins (note: downloading plugins automatically was a headache and probably not preferable for security reasons. Opting to go for including the jar files in a directory in git.)
[x] Check if a given file is in the source list within the database.
- [x] If not, load new nodes
- [x] Load new edges
[x] Calculate patristic distances.
- [x] #14
[x] Generate MSTN
- [x] generate MSTs
- [x] Merge MSTs
- [x] Delete individual MSTs from network

glstott commented 2 years ago

Updated Nextflow pipeline.

Now working for the alignment and tree generation steps. Uses SLURM config options because nextflow struggled to find all the available threads open to it when running in an interactive node. Error message was saying only 1 thread available for the process when there should have been 16.
It'll be worth building a local option as well that doesn't require SLURM configuration options, especially given that the graph database is currently not on the HPC cluster.

glstott commented 2 years ago

Updated to include an IQ Tree option
Corrected strange bug which led to 16*16=256 thread jobs instead of the 16 desired.
Began research into setting up Neo4j environment using the Nextflow pipeline. It will require some configuration, but the plan is to have nextflow pull down a docker image, spin it up, mount a user-supplied directory for files, execute all import and transformation tasks on the Neo4j database, then spin down the server. A user can then spin up the server when they want to access their data or do something outside the standard pipeline.

glstott commented 2 years ago

Progress made on the Singularity/Docker image front. A minimum viable set of commands found below. Note: for our environment, I ended up needing to forward both port 7687 and 7474 to access the browser. Neither one individually was sufficient.

singularity instance start     --bind $HOME/neo4j_empty/data:/data     --bind $HOME/neo4j_empty/logs:/logs     \
           --bind $HOME/neo4j_empty/import:/var/lib/neo4j/import     \
           --bind $HOME/neo4j_empty/plugins:/plugins --bind $HOME/neo4j_empty/conf:/var/lib/neo4j/conf \
           --bind $HOME/neo4j_empty/var:/var/lib/neo4j/run  --env-file env-file   docker://neo4j:latest test_neo
singularity shell instance://test_neo
neo4j start

In addition, to get plugins to work as expected, I needed to manually download jar files and place them in the plugin folder and whitelist the procedures in the conf file. The env-file contains the following for a test case:

NEO4J_AUTH=neo4j/test  
NEO4J_ACCEPT_LICENSE_AGREEMENT=yes

The Neo4J plugin line which would automatically load these is not yet working so is omitted here. I opened up a discussion in the community forums to see if this is something already resolved by the neo4j community.

glstott commented 2 years ago

Another strange integration bug. The script within the nextflow pipeline fails to wait on the previous command to finish. I wrapped it up manually in a script and it functions as expected. It seems like the script element in nextflow ignores sleep and wait for some reason.

glstott commented 2 years ago

This occurs because the docker image we're using will treat the database as an existing one if you provide the data folder as a command. Singularity will require me to include this data folder, hence this may not be resolvable. I think this may also be the source of our plugin problem so manual integration may be needed there as well.

glstott commented 2 years ago

Since singularity has a limited view into the filesystem which it is run, the default symbolic links to output files are insufficient. I modified it to duplicate the file in both work and the target directory.

publishDir(path: dbDir + '/import', 
        mode: 'copy', 
        overwrite: true)

This was able to correct the issue. We now have a pipeline which starts to load the database!

Next, we need to create a stdout to list elements which haven't been fully analyzed, then build the last step of the pipeline. Another to-do item, maybe automate the unzipping of GISAID tarballs and concatenating the resultant CSV and fasta files.

glstott commented 2 years ago

Wrapped up all of the most important elements. I need to do more thorough unit testing. Not a fan of how I ensure execution order at the moment. I think moving to DSL2 may be necessary sooner rather than later. Sequencing is particularly important for the first install of the database.

Important note, there is a bug which forces indexed relationship properties to act as floating points, preventing use of =. For now, I'm just using a conditional on either end, but I should look into how to fix this more cleanly. It may involve removing the index.

glstott commented 2 years ago

Successful end-to-end run. MST Network included. Slowest elements are extrinsic to the workflow. Now I just need to make the script more user-friendly and comment before pushing back up to main. Note: new commits not yet added. Waiting to push them to main until colleagues wrap up their elements.

glstott / PMeND

Build Nextflow Pipeline #10