Bioinformatics Pipeline Inoperative: Package Dependencies Updated?

ttschulze commented 3 months ago

Hi all,

We are trying to get this pipeline working with the included control dataset so we can use it to characterize a variety of novel bacteriophages that we think may be circularly permuted with terminal repeats.

However, after cloning the pipeline from git and following the tutorial to test the included control dataset we are running into errors indicating the pipeline may be broken.

Our first error: snakemake: error: unrecognized arguments: -r

Which seems to be caused by the conda package request for snakemake to the newest version where the "-r" flag is no longer used:

We did find that by editing the environment.yml file to use the originally intended version of snakemake.. changing from ">= 5.14.0" to "= 5.14.0" we were able to get past this error

However, we are currently stuck at the following error:

The log file shows the following:

We are assuming this is an issue with newer packages breaking the pipeline, but any thoughts on how to resolve these issues would be greatly appreciated. Please let me know if I can detail anything further.

mubashirhanif commented 3 months ago

Hey, Just saw your Upwork post. I think the problem is with the dependencies. I managed to get the test to start and complete about 10%, but then I receive an error about the data being corrupted, due to the first job not being able to process the sequences to the .tsv file, I suspect this also to be the dependency issue with minimap2 package. I am also using the arm64/aarch64 versions. Furthermore, I did not use conda but plain old virtualenv with python 3.9 on ubuntu. The steps to follow where I am right now. sudo apt -y update && sudo apt install seqkit pip install virtualenv python -m venv venv source ./venv/bin/activate pip install -r requirements.txt # See the attached requirements.txt Here is where I am at.

requirements.txt

jmeppley commented 3 months ago

Thanks for checking out our workflow. I'm sorry it's not working for either of you. I've only just begun to look into your issues, but it looks like at least two different problems.

It looks like there's a dependency problem caused by continuing development on snakemake. I will try to get a more current set of dependencies posted this week.

It looks like @mubashirhanif found a workaround with virtualenv and ran into a separate issue where their run failed to find any concatemers. @mubashirhanif, did you use the test data for this or your own data?

mubashirhanif commented 3 months ago

@jmeppley I used the test data. But it seems like yet another dependency issue, relating to seqkit/minimap2.

jmeppley commented 3 months ago

@ttschulze , I updated the environment definition files, but I don't seem to have write access to this repository anymore. I'm no longer actively collaborating with ONT, so I guess my privileges expired.

So the pull request is not yet merged, in the meantime you can try cloning my version of the repo: https://github.com/jmeppley/DTR-phage-pipeline

I had to fall back to an older version of snakemake, which is a little flaky. Sometimes, you have to run the workflow twice for it to get all the way through. I would like to update it to the latest snakemake, but that's going to take more time.

@mubashirhanif, I can't reproduce your error. Can you try the updated conda environments and let me know if that changes anything?

update: I was able to get it working on snakemake 8. It seems a bit more stable. (This is still only on my fork)

dnev1551 commented 3 months ago

Hi @jmeppley , I am working with @ttschulze on this. I tried to clone your version of the repo...

This did get us further than we have, but it eventually returned this error: log file: 2024-03-12T143718.253640.snakemake.log

Were you able to successfully run?

jmeppley commented 3 months ago

@dnev1551. OK, I that's progress, but I'm sorry it's not working. It looks like the conda installer for medaka is not pulling in the correct dependencies for you. Can you give me a little more information?

what OS are you running?
What version of python and conda?
what version of snakemake?
what's installed in the snakemake generated conda environment used for the failed rule. You can find out by running:

ls /home/aneville/DTR-phage-pipeline/.snakemake/conda/8dcd12a1216777f640f664fa75df4bc8_/conda-meta

I'm particularly interested in the versions of numpy, h5py, and python, but it would be best to just cut and paste the whole list.

dnev1551 commented 3 months ago

@jmeppley, let me know if you need any other info. Thank you!

OS:
PRETTY_NAME="Ubuntu 22.04.4 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.4 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian
My global Python default version 3.10.12
However, in the 'env_name' I created for DTR-phage-pipeline, version 3.12.2
in the environment "env_name", snakemake version is 8.6.0
for the snakemake generated conda environment:

ls /home/aneville/DTR-phage-pipeline/.snakemake/conda/8dcd12a1216777f640f664fa75df4bc8_/conda-meta

jmeppley commented 3 months ago

That's odd. The versions of h5py and numpy in that environment should be compatible. I just tested on my system.

What version of conda are you using?

What's the output of this:

conda activate /home/aneville/DTR-phage-pipeline/.snakemake/conda/8dcd12a1216777f640f664fa75df4bc8_
python -c "import numpy; print(numpy.__version__); from importlib import reload; print(reload(numpy));"

Try replacing the contents of envs/medaka-0.11.0.yml with:

channels:
  - conda-forge
  - bioconda
dependencies:
   - medaka>=1.4.3
   - h5py>=3.9
   - numpy>=1.24

then try running snakemake again.

dnev1551 commented 3 months ago

@jmeppley See below responses. That change to envs/medaka-0.11.0.yml successfully ran the entire pipeline, using the test data. Thank you so much! I am relatively inexperienced in using the command line, so please bear with me.

One issue that I could see in many instances when executing the code: No kaiju DB found in /media/sfodata/databases/kaiju/nr_euk. Skipping Kaiju No kaiju DB found in /media/sfodata/databases/kaiju/nr_euk. Skipping Kaiju

Are there any potential solutions? There is an image of a few instances at the end of this post.

That's odd. The versions of h5py and numpy in that environment should be compatible. I just tested on my system.

What version of conda are you using?

Before attempting any fixes, I installed Anaconda3-2024.02-1-Linux-x86_64.sh

However, when I asked for conda -V it returned with:: Error while loading conda entry point: conda-libmamba-solver (libarchive.so.13: cannot open shared object file: No such file or directory) conda 23.7.4

What's the output of this:

conda activate /home/aneville/DTR-phage-pipeline/.snakemake/conda/8dcd12a1216777f640f664fa75df4bc8_
python -c "import numpy; print(numpy.__version__); from importlib import reload; print(reload(numpy));"

The output was:

1.24.4 /home/aneville/DTR-phage-pipeline/.snakemake/conda/8dcd12a1216777f640f664fa75df4bc8_/lib/python3.8/importlib/__init__.py:169: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged. _bootstrap._exec(spec, module) <module 'numpy' from '/home/aneville/.local/lib/python3.8/site-packages/numpy/__init__.py'>

Try replacing the contents of envs/medaka-0.11.0.yml with:
channels:
  - conda-forge
  - bioconda
dependencies:
   - medaka>=1.4.3
   - h5py>=3.9
   - numpy>=1.24
then try running snakemake again.

This allowed the execution of the pipeline without erroring out! However, as mentioned above, it is skipping kaiju.

Thank you for your time and help; we really appreciate it!!

jmeppley commented 3 months ago

One issue that I could see in many instances when executing the code: No kaiju DB found in /media/sfodata/databases/kaiju/nr_euk. Skipping Kaiju No kaiju DB found in /media/sfodata/databases/kaiju/nr_euk. Skipping Kaiju

The workflow can run kaiju on your reads to estimate the composition of your sample. It's purely informational, but it can be useful. However, the kaiju database is too big to distribute with the code, so it's not part of the test. To configure it, you'll have to download a Kaiju database (there should be a link in the README) and update config.yml.

Before attempting any fixes, I installed Anaconda3-2024.02-1-Linux-x86_64.sh

However, when I asked for conda -V it returned with:: Error while loading conda entry point: conda-libmamba-solver (libarchive.so.13: cannot open shared object file: No such file or directory) conda 23.7.4

What's the output of this: conda activate /home/aneville/DTR-phage-pipeline/.snakemake/conda/8dcd12a1216777f640f664fa75df4bc8_; python -c "import numpy; print(numpy.__version__); from importlib import reload; print(reload(numpy));"

The output was:

1.24.4 /home/aneville/DTR-phage-pipeline/.snakemake/conda/8dcd12a1216777f640f664fa75df4bc8_/lib/python3.8/importlib/__init__.py:169: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged. <module 'numpy' from '/home/aneville/.local/lib/python3.8/site-packages/numpy/__init__.py'>

There is something funky with your conda. The numpy module should have been imported from within the conda environment (.../.snakemake/conda/8d...) not your base dir (.../.local/lib/...). The imported a second time warning is most likely a symptom of the same problem. I would guess it's connected to the warning you got from conda --version.

The updated medaka.yaml works because it's using a suite of package versions that happen to work fine with the numpy you have installed in your base environment. Normally, this wouldn't be necessary and the original package specifications would have worked.

It's possible you have two competing conda installations configured in your shell. If you are using bash, check your ~/.bashrc and ~/.bash_profile files. There should only be one block of code setting up conda. (Although I use miniconda and I'm not as familiar with Anaconda). It might be worth deleting everything conda related in your bash config files and trying to install miniconda from scratch. You could also reach out to the conda team for help.

dnev1551 commented 3 months ago

@jmeppley Okay, I think I understand where that competing conda issue arose from. Thank you so much for your time and help! We really appreciate it. I am wondering if a possible error might arise as we used the newer base caller, Dorado, as opposed to the outdated Guppy. I will reach out if we run into trouble, but again, we REALLY appreciate all of your time and help!!!

dnev1551 commented 3 months ago

Hello @jmeppley , we ran an entire ONT seq run, consisting of just one specific novel phage on a Minion r9.4.1 flow cell, using the LSK-109 kit, which you had used in your publication (obviously this was overkill, but was our first time doing gDNA with the minion). We previously sequenced this novel phage using Illumina short-read seq & analyzed that data(which had convincing evidence of the presence of concatamer & DTRs; and a genome size of ~97 kb). If I am understanding correctly, this pipeline is mostly for metagenomic analysis (we had previously read your publication and all the detailed supplementary info/methods as well, which is why we chose to start with this pipeline for the ONT long-reads). Very informative and impressive publication! However, I thought it would still work even when running a single isolated phage. I was able to run the entire pipeline (including Kaiju), but am wondering:

if there are any parameters I need to specifically modify, considering we know we are inputting only one phage
that fact that it is a novel phage which doesn't currently have and public sequence, will this cause any special care for interpretation of results when using the pipeline?
should I be running Flye prior to running this pipeline, due to implications for Medaka, as I noticed Medaka's use of Racoon is depracated, but your pipeline executed it just fine. https://github.com/nanoporetech/medaka/tree/master

as written in your clone, does the virsorter automatically run (or do i need to download the db)?

Thank you! Sorry for all of the questions, just want to make sure I am using the pipeline correctly, so I can interpret the results accurately!

jbeaulaurier commented 3 months ago

Hi Andrew,

Thanks for your interest in the pipeline. I would highly recommend just running the output of your sequencing through the Flye assembler and working with what comes out of that. I don't think our pipeline makes a ton of sense for working with a single phage and it will likely upset certain assumptions made in the workflow.

If you're looking for the genome and annotation of genes, etc, a simple de novo assembly of the sequencing data should be sufficient. Once you have an assembly, our pipeline might be of help in just suggesting what steps you might want to run manually on your assembly (like Prodigal, etc).

Hope this helps, John

dnev1551 commented 3 months ago

@jbeaulaurier Okay, thank you for clarifying. I appreciate it!

ttschulze commented 3 months ago

@jmeppley Thank you so much for your assistance with the technical troubleshooting; we were able to make it all the way through. I was hoping to ask a follow-up about the DTR sequences. In our case, the outputs are indeed supporting the presence of a DTR in the polished genomes- I am curious if there is a method in the pipeline to obtain the actual DTR sequence (to isolate it basically to examine the sequence). I am expecting a 566 bp DTR, and I see the outputs that bin the reads containing the DTR, but I can't find the DTR sequence by itself for mapping/analysis etc. Would very much appreciate if you had any thoughts here, if I'm simply missing it.

Example output for the DTR alignment. It seems to support a fixed DTR well- we are curious if anything looks out of the ordinary to you: 0_62 ref_read dtr aligns

jmeppley commented 3 months ago

No, I'm sorry. If I recall correctly, we do not extract the DTR sequence. The pipeline just runs minimap2 and inspects the PAF output (which does not include sequences, only alignment locations).

nanoporetech / DTR-phage-pipeline

Bioinformatics Pipeline Inoperative: Package Dependencies Updated? #13