glstott / PMeND

Phylogeny and Metadata Network Database
2 stars 0 forks source link

Build Nextflow Pipeline #10

Closed glstott closed 2 years ago

glstott commented 2 years ago
glstott commented 2 years ago

Updated Nextflow pipeline.

glstott commented 2 years ago
glstott commented 2 years ago

Progress made on the Singularity/Docker image front. A minimum viable set of commands found below. Note: for our environment, I ended up needing to forward both port 7687 and 7474 to access the browser. Neither one individually was sufficient.

singularity instance start     --bind $HOME/neo4j_empty/data:/data     --bind $HOME/neo4j_empty/logs:/logs     \
           --bind $HOME/neo4j_empty/import:/var/lib/neo4j/import     \
           --bind $HOME/neo4j_empty/plugins:/plugins --bind $HOME/neo4j_empty/conf:/var/lib/neo4j/conf \
           --bind $HOME/neo4j_empty/var:/var/lib/neo4j/run  --env-file env-file   docker://neo4j:latest test_neo
singularity shell instance://test_neo
neo4j start

In addition, to get plugins to work as expected, I needed to manually download jar files and place them in the plugin folder and whitelist the procedures in the conf file. The env-file contains the following for a test case:

NEO4J_AUTH=neo4j/test  
NEO4J_ACCEPT_LICENSE_AGREEMENT=yes 

The Neo4J plugin line which would automatically load these is not yet working so is omitted here. I opened up a discussion in the community forums to see if this is something already resolved by the neo4j community.

glstott commented 2 years ago

Another strange integration bug. The script within the nextflow pipeline fails to wait on the previous command to finish. I wrapped it up manually in a script and it functions as expected. It seems like the script element in nextflow ignores sleep and wait for some reason.

glstott commented 2 years ago

This occurs because the docker image we're using will treat the database as an existing one if you provide the data folder as a command. Singularity will require me to include this data folder, hence this may not be resolvable. I think this may also be the source of our plugin problem so manual integration may be needed there as well.

glstott commented 2 years ago

Since singularity has a limited view into the filesystem which it is run, the default symbolic links to output files are insufficient. I modified it to duplicate the file in both work and the target directory.

publishDir(path: dbDir + '/import', 
        mode: 'copy', 
        overwrite: true)

This was able to correct the issue. We now have a pipeline which starts to load the database!

Next, we need to create a stdout to list elements which haven't been fully analyzed, then build the last step of the pipeline. Another to-do item, maybe automate the unzipping of GISAID tarballs and concatenating the resultant CSV and fasta files.

glstott commented 2 years ago

Wrapped up all of the most important elements. I need to do more thorough unit testing. Not a fan of how I ensure execution order at the moment. I think moving to DSL2 may be necessary sooner rather than later. Sequencing is particularly important for the first install of the database.

Important note, there is a bug which forces indexed relationship properties to act as floating points, preventing use of =. For now, I'm just using a conditional on either end, but I should look into how to fix this more cleanly. It may involve removing the index.

glstott commented 2 years ago

Successful end-to-end run. MST Network included. Slowest elements are extrinsic to the workflow. Now I just need to make the script more user-friendly and comment before pushing back up to main. Note: new commits not yet added. Waiting to push them to main until colleagues wrap up their elements.