Closed glstott closed 2 years ago
Updated Nextflow pipeline.
Progress made on the Singularity/Docker image front. A minimum viable set of commands found below. Note: for our environment, I ended up needing to forward both port 7687 and 7474 to access the browser. Neither one individually was sufficient.
singularity instance start --bind $HOME/neo4j_empty/data:/data --bind $HOME/neo4j_empty/logs:/logs \
--bind $HOME/neo4j_empty/import:/var/lib/neo4j/import \
--bind $HOME/neo4j_empty/plugins:/plugins --bind $HOME/neo4j_empty/conf:/var/lib/neo4j/conf \
--bind $HOME/neo4j_empty/var:/var/lib/neo4j/run --env-file env-file docker://neo4j:latest test_neo
singularity shell instance://test_neo
neo4j start
In addition, to get plugins to work as expected, I needed to manually download jar files and place them in the plugin folder and whitelist the procedures in the conf file. The env-file contains the following for a test case:
NEO4J_AUTH=neo4j/test
NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
The Neo4J plugin line which would automatically load these is not yet working so is omitted here. I opened up a discussion in the community forums to see if this is something already resolved by the neo4j community.
Another strange integration bug. The script within the nextflow pipeline fails to wait on the previous command to finish. I wrapped it up manually in a script and it functions as expected. It seems like the script element in nextflow ignores sleep and wait for some reason.
This occurs because the docker image we're using will treat the database as an existing one if you provide the data folder as a command. Singularity will require me to include this data folder, hence this may not be resolvable. I think this may also be the source of our plugin problem so manual integration may be needed there as well.
Since singularity has a limited view into the filesystem which it is run, the default symbolic links to output files are insufficient. I modified it to duplicate the file in both work and the target directory.
publishDir(path: dbDir + '/import',
mode: 'copy',
overwrite: true)
This was able to correct the issue. We now have a pipeline which starts to load the database!
Next, we need to create a stdout to list elements which haven't been fully analyzed, then build the last step of the pipeline. Another to-do item, maybe automate the unzipping of GISAID tarballs and concatenating the resultant CSV and fasta files.
Wrapped up all of the most important elements. I need to do more thorough unit testing. Not a fan of how I ensure execution order at the moment. I think moving to DSL2 may be necessary sooner rather than later. Sequencing is particularly important for the first install of the database.
Important note, there is a bug which forces indexed relationship properties to act as floating points, preventing use of =
. For now, I'm just using a conditional on either end, but I should look into how to fix this more cleanly. It may involve removing the index.
Successful end-to-end run. MST Network included. Slowest elements are extrinsic to the workflow. Now I just need to make the script more user-friendly and comment before pushing back up to main. Note: new commits not yet added. Waiting to push them to main until colleagues wrap up their elements.