Open gmkov opened 5 months ago
Thanks for writing. I'd like to diagnose this with you and it seems to be two parts:
r-flexdashboard
is not being detected
This is really weird because the package is installed into the workflow's R environment
I'm not sure why this would be happening. Neither is it an issue in any of Harpy's test suite (example). I could try to add, as a failsafe, a required
section to the top of the Rmd files that will install it under the hood, just in case, although that's a pretty hacky solution. All the logic currently suggests it works and I'm at a loss as to why your system is fussing with it. Of the environments snakemake creates in the .snakemake/
directory, can you check in the .snakemake/conda
folder that the R environment (it will have some kind of checksum name) has r-flexdashboard
listed in the deps? Or the contents of .harpy_envs/r-env.yml
. Also, can you verify that you installed harpy using both bioconda and conda-forge channels via mamba/conda create -n envname -c bioconda -c conda-forge harpy
?
{outputdir}/workflow/report/
directory.
To diagnose that, can you please provide me with the basic directory structure you have? From the harpy
call, I can tell you're not specifying an output directory, meaning the output should be Impute/
. Therefore:
harpy impute
?Impute/
already exist and what is inside of it?I'm sure we can get to the bottom of this!
Hi- thank you very much for the fast reply!
this arose while running the preflight bam check I did solve it by installing flexdashboard manually within the conda environment, but as you say this might point towards installation issues.
looking into snakemake conda dir, this is the R env dependencies list and r-flexdashboard is there:
(harpy) [mgm49@login-n-1 harpy]$ less .snakemake/conda/6f556b53477f5b6b15e08502da2b35c8_.yaml
name: r-env
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- bioconductor-complexheatmap
- r-circlize
- r-dt
- r-flexdashboard
- r-ggplot2
- r-ggridges
- r-plotly
- r-tidyr
- r-stitch
and checking .harpy_envs/r-env.yml , flexdashboard is also there:
(harpy) [mgm49@login-n-1 harpy]$ less .harpy_envs/r-env.yaml
name: r-env
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- bioconductor-complexheatmap
- r-circlize
- r-dt
- r-flexdashboard
- r-ggplot2
- r-ggridges
- r-plotly
- r-tidyr
- r-stitch
- Ive checked in R within the harpy env and none of the packages above are installed unless I had them installed previously. So yes, something went wrong during installation of R dependencies
### Installation
I had a couple of issues installing mamba. But my successful installation commands were as follows
Using conda 24.3.0, mamba 1.5.8, :
conda install conda-forge::mamba mamba create -n harpy -c bioconda -c conda-forge harpy mamba activate harpy
During the first preflight checks I got this error: "Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorit"
Which I fixed as advised:
less ~/.condarc conda config --set channel_priority strict conda update --all
### problem 2 - directory error
I attach a tree of the current harpy dir, im testing with 6 bam files [harpy.dir.tree.txt](https://github.com/pdimens/harpy/files/14947832/harpy.dir.tree.txt)
1. the directory from which im running harpy impute is called... harpy
2. see the attached tree
3. the dir Impute exists, but I rm -R before each run (i know this seems a bit dodgy) and still get the error
Note that I first ran harpy impute in a subdirectory of harpy called harpy/stitch/, but I have since deleted that and run everything in the base harpy/ since it looked that that's how it prefers to work and where the SNP/ and bams/ are.
Could I clean up snakemake somehow to restart harpy impute from scratch? Thanks again!
As an additional test, I ran the module harpy phase with the following command:
nice harpy phase --threads 20 --vcf SNP/mpileup/variants.raw.bcf bams 2> log.phase &
And it exited with
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-04-11T162236.855891.snakemake.log
WorkflowError:
At least one job did not complete successfully.
Found an error in the snakelog in the linkeFragments step:
Error in rule linkFragments:
jobid: 34
input: Phase/workflow/input/CAM046072.bam, Phase/input/CAM046072.het.bcf, Phase/extractHairs/CAM046072.unlinked.frags
output: Phase/linkFragments/CAM046072.linked.frags
log: Phase/linkFragments/logs/CAM046072.linked.log (check log file(s) for error details)
26c75f5fe6300b0faecde6366_
shell:
ags --out Phase/linkFragments/CAM046072.linked.frags -d 100000 > Phase/linkFragments/logs/CAM046072.linked.log 2>&1
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
So went to the linked log, and found that the issue is pysam!! issue with dependencies again.
(harpy) [mgm49@login-n-1 harpy]$ less Phase/linkFragments/logs/CAM046072.linked.log
Traceback (most recent call last):
File "/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/.snakemake/conda/0dd50bd26c75f>
import pysam
ImportError: No module named 'pysam'
I'd be happy to make a clean install of harpy if you could advice on how to do that without risking incompatibilities. thanks again!
I'm still thinking about the first issues, but I think the issue with pysam
etc is that I have a shebang at the top of the scripts that tells it to use the system's python interpreter, not the one in the conda env. I'm making those changes to the development version right now. I can follow up this afternoon on installing the dev version to test if removing that fixes it.
@gmkov there are some huge internal API changes in that branch I was working with. I'm trying to merge it into dev
but it's failing several tests. Once I hammer out these issues (I'll ping you), you can install the dev branch locally and we can see what comes from it then. To install harpy locally:
# get the git repo
> git clone --depth 1 https://github.com/pdimens/harpy.git
> cd harpy
# switch to dev branch
> git checkout dev
# create conda/mamba env
> mamba env create --name harpydev --file resources/harpyenv.yaml
> mamba activate harpy
# install everything
bash resources/buildlocal.sh
@gmkov Your mamba installation was not done the recommended way. See the official mamba docs on how to install it correctly via miniforge or micromamba: https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html#fresh-install-recommended It could be that the conda install is the issue. You can also skip mamba altogether (remove it) and use the latest version of conda, which uses the libmamba solver.
Try starting from there and see if it still creates issues. I'm almost ready with the developmental version.
Hi
Yes i (and someone else) tried to installed mamba in that way and kept getting errors (sorry didnt keep track of those)- hence my attempt with conda-forge::mamba
I have uninstalled the conda conda-forge mamba by running conda uninstall mamba
and it seems to have worked.
Then i've updated conda using, now have conda 24.3.0 (although I think it was already updated) :
(base) [mgm49@login-n-1 vcf]$ conda update -n base conda
Channels:
- conda-forge
- bioconda
- defaults
- anaconda
Then ran this to install harpy with conda:
` (base) [mgm49@login-n-1 vcf]$ conda create -n harpy-conda -c bioconda -c conda-forge harpy
`
And the following errors come up, something with limbamba:
(base) [mgm49@login-n-1 vcf]$ conda create -n harpy-conda -c bioconda -c conda-forge harpy
Channels:
- bioconda
- conda-forge
- defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: - warning libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
...
warning libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY failed
LibMambaUnsatisfiableError: Encountered problems while solving:
- package harpy-0.8.0-py312h9ee0642_0 requires bcftools 1.19.*, but none of the providers can be installed
Could not solve for environment specs
The following packages are incompatible
└─ harpy is not installable because there are no viable options
├─ harpy [0.1.1|0.2.0|0.3.0|0.4.0] would require
│ └─ r-stitch, which requires
│ └─ r-data.table >=1.11.8 , which conflicts with any installable versions previously reported;
├─ harpy [0.5.0|0.6.0|0.6.1|0.7.0|0.7.3] would require
│ └─ pysam 0.22.* , which requires
│ └─ libdeflate >=1.18,<1.19.0a0 , which conflicts with any installable versions previously reported;
└─ harpy [0.8.0|0.9.1] would require
└─ bcftools 1.19.* but there are no viable options
├─ bcftools 1.19 would require
│ └─ htslib >=1.19,<1.20.0a0 but there are no viable options
│ ├─ htslib 1.19 would require
│ │ └─ libdeflate >=1.18,<1.20.0a0 , which conflicts with any installable versions previously reported;
│ └─ htslib 1.19.1 would require
│ └─ libdeflate >=1.18,<1.19.0a0 , which conflicts with any installable versions previously reported;
└─ bcftools 1.19 would require
└─ htslib >=1.19.1,<1.20.0a0 , which cannot be installed (as previously explained).
can try with harpy dev as explained above. thank you
Woah, that's a new one. Can you undo the conda strict priority that you implemented before? Snakemake likes to complain about that but it honestly hasn't been an issue so I never actually set it to strict. Also, remove anaconda
from the channels, I think.
ok changed strict priority again, updated all:
(base) [mgm49@login-n-1 home]$ conda config --set channel_priority flexible
(base) [mgm49@login-n-1 home]$ conda update --all
Retrieving notices: ...working... done
Channels:
- conda-forge
- bioconda
- defaults
- anaconda
Platform: linux-64
so now my condarc looks like this:
(base) [mgm49@login-n-1 home]$ less ~/.condarc
channels:
- conda-forge
- bioconda
- defaults
channel_priority: flexible
weirdly, the anaconda channel is NOT in ~/.condarc , Now I run again the conda install:
conda create -n harpy-conda -c bioconda -c conda-forge harpy
and things installed nicely!
TEST1: impute
Now I re ran stitch in the new harpy-conda env:
nice harpy impute --threads 10 --parameters stitch.params \ --vcf SNP/mpileup/variants.raw.bcf \ bams 2> log.impute &
but sadly still get "IsADirectoryError: [Errno 21] Is a directory:"
TEST2: preflight check
new problems
(harpy-conda) [mgm49@login-n-1 harpy]$ harpy preflight bam bams
╭─ Harpy preflight bam ───────────────────────────────────────────────────╮
│ Samples: 6 │
│ Output Directory: Preflight/bam/ │
╰─────────────────────────── Running Workflow ────────────────────────────╯
Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
CreateCondaEnvironmentException:
The 'mamba' command is not available in the shell /usr/bin/bash that will be used by Snakemake. You have to ensure that it is in your PATH, e.g., first activating the conda base environment with `conda activate base`.The mamba package manager (https://github.com/mamba-org/mamba) is a fast and robust conda replacement. It is the recommended way of using Snakemake's conda integration. It can be installed with `conda install -n base -c conda-forge mamba`. If you still prefer to use conda, you can enforce that by setting `--conda-frontend conda`.
it's realised there is no mamba and is not happy about it. could try and install in the way described in the error and reinstall? conda install -n base -c conda-forge mamba
That's progress, of a sort!
Try adding -s "--conda-frontend conda"
to the harpy call. Let's see what happens.
nice harpy impute --threads 10 --parameters stitch.params \
-s "--conda-frontend conda" \
--vcf SNP/mpileup/variants.raw.bcf \
bams 2> log.impute &
hi, unfortunately same isadirectoryerror appears. Even though I delete the whole Impute/ dir before running harpy, just in case
(harpy-conda) [mgm49@login-n-1 harpy]$
[1]+ Exit 1 nice harpy impute --threads 10 --parameters stitch.params -s "--conda-frontend conda" --vcf SNP/mpileup/variants.raw.bcf bams 2> log.impute
(harpy-conda) [mgm49@login-n-1 harpy]$ less log.impute
Traceback (most recent call last):
File "/home/mgm49/miniconda3/envs/harpy-conda/bin/harpy", line 10, in <module>
sys.exit(main())
^^^^^^
File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/harpy/__main__.py", line 383, in main
cli()
File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/rich_click/rich_command.py", line 126, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/harpy/impute.py", line 66, in impute
fetch_file(f"{i}.Rmd", f"{workflowdir}/report/")
File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/harpy/helperfunctions.py", line 124, in fetch_file
shutil.copy2(result, destination)
File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/shutil.py", line 475, in copy2
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/shutil.py", line 260, in copyfile
with open(src, 'rb') as fsrc:
^^^^^^^^^^^^^^^
IsADirectoryError: [Errno 21] Is a directory: '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/Impute'
but when running the preflight with conda-frontend it runs fine but complains about channel priorities (which i reverted to flexible), and fails at the end because flexdashboard not installed
(harpy-conda) [mgm49@login-n-1 harpy]$ harpy preflight bam bams -s "--conda-frontend conda"
╭─ Harpy preflight bam ───────────────────────────────────────────────────╮
│ Samples: 6 │
│ Output Directory: Preflight/bam/ │
╰─────────────────────────── Running Workflow ────────────────────────────╯
Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job stats:
job count
------------ -------
all 1
checkBam 6
createReport 1
log_runtime 1
mergeChecks 1
total 10
preflight bams error:
Activating conda environment: .snakemake/conda/6f556b53477f5b6b15e08502da2b35c8_
Error in loadNamespace(name) : there is no package called ‘flexdashboard’
Calls: <Anonymous> ... loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart
Execution halted
any other ideas? thanks
This is so very confusing, honestly. For starters, and this may not be relevant, the bams
folder (and any unnamed inputs) must be the last parts of the command line call to harpy.
Second, I merged my branch into dev
. Try to install the dev version of harpy into a fresh conda environment like I described above and see what happens. I'm almost positive the issue is with your conda installation but can't quite put my finger on where exactly. Hopefully the dev branch fixes the IsADirectory error with the new package structure.
You can also try to install mamba
into whatever conda env harpy is installed into and scrap using --conda-frontend
.
I need to caveat that the dev
branch is very much a developmental branch and there are things on it I broke (the snp
modules) and things that dont work yet (simulate reads
). The newer internal logic may overcome the Impute
issue though.
Changing the preflight bams folder input to the end of the command did not fix the issue with the flexdashboard not installed. But if I install it it works fine.
Sadly I cant seem to get into the dev branch following the steps above
[mgm49@login-n-1 programmes]$ git clone --depth 1 https://github.com/pdimens/harpy.git
[mgm49@login-n-1 programmes]$ cd harpy/
[mgm49@login-n-1 harpy]$ git checkout dev
error: pathspec 'dev' did not match any file(s) known to git
anything I can change? thank you
EDIT 1
I removed the --depth 1 and it seems to have worked ??
[mgm49@login-n-1 programmes]$ git clone https://github.com/pdimens/harpy.git
[mgm49@login-n-1 programmes]$ cd harpy/
[mgm49@login-n-1 harpy]$ git checkout dev
branch 'dev' set up to track 'origin/dev'.
Switched to a new branch 'dev'
had to change a few things from the commands you had sent to install dev, so i put them all below:
# get the git repo
git clone https://github.com/pdimens/harpy.git
cd harpy
# switch to dev branch
git checkout dev
# create conda/mamba env - im using conda, it's harpy.yaml (no env), and activate harpydev (not harpy)
conda env create --name harpydev --file resources/harpy.yaml
# it asked me to run conda init before conda activate, weird.
conda init
# showed it had changed my bashrc and added lines by conda init. had to source
source ~/.bashrc.
# now can activate env
conda activate harpydev
# install everything - Successfully installed harpy-0.10.0
bash resources/buildlocal.sh
going to test Impuptation
Sorry about that, the --depth 1
wasn't necessary, as you discovered. I think it's very weird/suspicious that you had to conda init
.
@gmkov Please let me know if the IsADirectoryError persists.
Hi, im back in action! sorry had to change priorities. I also had to reinstall miniconda3 in a completely new location (+ get rid of all the old instal), as our home directory in the cluster is only 50gb and with all these environments I was quickly running out of quota. So now we have a fresh start to install harpy. I still dont want to use mamba, sorry, it messed things up last time and im scared.
#### installed fresh
conda create -n harpy -c bioconda -c conda-forge harpy
#### run preflight test. note that adding --conda-frontend conda without the -s didnt work.
harpy preflight bam -s "--conda-frontend conda" bams
# success (no issues with flexdashboard, yay). only complains about conda channels, so changed
conda config --set channel_priority strict
#### call variants with mpileup, tell it use conda. 6 samples, 20 threads - took 1h 20min
# it needs bam.bai as well in the input directory
genome=/home/mgm49/rds/rds-cj107-jiggins-rds/genomes/Harm_2021_China/sequences.fa
nohup nice harpy snp mpileup -s "--conda-frontend conda" --threads 20 --genome $genome bams/ > log.snp &
#### manually curate vcf
# No duplicate positions
# get positions
bcftools query -f "%CHROM\t%POS\t%REF\t%ALT\n" variants.raw.bcf > variants.raw.bcf.pos
# Keep only biallelic positions. (Delete the allele with lower frequency)
# goes down from 430M to 427M
sed -i 's/,.//g' variants.raw.bcf.pos
# No duplicate sample names - check - nope
bcftools query -l SNP/mpileup/variants.raw.bcf
#### run impute
# add positions file with only biallelic sites
nice harpy impute --threads 20 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
-x '--posfile SNP/mpileup/variants.raw.bcf.pos' \
-s "--conda-frontend conda" \
bams > log.impute &
# noooo: :( "IsADirectoryError: [Errno 21] Is a directory: '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/Impute'"
sadly hitting isadirectoryerror again :( but everything else ran well, so hoping harpy dev works
### install harpydev
# get the git repo
git clone https://github.com/pdimens/harpy.git
cd harpy
# switch to dev branch
git checkout dev
# create conda/mamba env - im using conda, it's harpy.yaml (no env), and activate harpydev (not harpy)
conda env create --name harpydev --file resources/harpy.yaml
# it NO LONGER asks me to run conda init before conda activate- GOOD (conda behaving correctly now)
# now can activate env
conda activate harpydev
# install everything - Successfully installed harpy-0.10.0
bash resources/buildlocal.sh
#### preflight checks
harpy preflight bam -s "--conda-frontend conda" bams
# "Error in loadNamespace(name) : there is no package called ‘flexdashboard’
# Calls: <Anonymous> ... loadNamespace → withRestarts → withOneRestart → doWithOneRestart
# Execution halted
# install flexeboard manually in R....
# try preflight bams again.
# new error, harpydev struggling to create reports
Error in parse_block(g[-1], g[1], params.src, markdown_mode) :
Duplicate chunk label 'status_df', which has been used for the chunk:
data2 <- data %>%
mutate(
nameMismatch = ifelse(nameMismatch > 0, "fail", "pass"),
noMI = ifelse(noMI == alignments, "fail", "pass"),
noBX = ifelse(noBX == alignments, "fail", "pass"),
badBX = ifelse(badBX > 0, "fail", "pass"),
bxNotLast = ifelse(bxNotLast > 0, "fail", "pass")
) %>%
select(-3) %>%
#rename("1" = "nameMismatch", "2" = "noMI", "3" = "noBX", "4" = "badBX", "5" = "bxNotLast") %>%
pivot_longer(-1, names_to = "category", values_to = "Status")
#data2$category <- as.integer(data2$category)
Calls: <Anonymous> ... process_file -> split_file -> lapply -> FUN -> parse_block
Execution halted
RuleException:
CalledProcessError in file /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/Preflight/bam/workflow/preflight-bam.smk, line 90:
Command 'source /home/mgm49/rds/hpc-work/home/miniconda3/envs/harpydev/bin/activate '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/conda/0362ca51d52d7128b23d37b26ad58301_'; set -euo pipefail; Rscript --vanilla -e 'rmarkdown::render("/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/scripts/tmpmp793p2d.PreflightBam.Rmd", output_file="/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/Preflight/bam/filecheck.bam.html", quiet=TRUE, knit_root_dir = "/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev", params = list(rmd="/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/scripts/tmpmp793p2d.PreflightBam.Rmd"))'' returned non-zero exit status 1.
#### jump ahead to call variants with mpileup
# it needs bam.bai as well in the input directory
genome=/home/mgm49/rds/rds-cj107-jiggins-rds/genomes/Harm_2021_China/sequences.fa
nohup nice harpy snp mpileup -s "--conda-frontend conda" --threads 20 --genome $genome bams/ > log.snp &
# error with some other library not installed
Quitting from lines at lines 59-70 [load environment] (tmp0ss9qmfd.BcftoolsStats.Rmd)
Error in `library()`:
! there is no package called 'highcharter'
Backtrace:
1. base::library(highcharter)
Execution halted
RuleException:
CalledProcessError in file /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/SNP/mpileup/workflow/snp-mpileup.smk, line 282:
Command 'source /home/mgm49/rds/hpc-work/home/miniconda3/envs/harpydev/bin/activate '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/conda/0362ca51d52d7128b23d37b26ad58301_'; set -euo pipefail; Rscript --vanilla -e 'rmarkdown::render("/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/scripts/tmp0ss9qmfd.BcftoolsStats.Rmd", output_file="/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/SNP/mpileup/reports/variants.raw.html", quiet=TRUE, knit_root_dir = "/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev", params = list(rmd="/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/scripts/tmp0ss9qmfd.BcftoolsStats.Rmd"))'' returned non-zero exit status 1.
[Mon Apr 29 13:59:09 2024]
Error in rule bcf_report:
jobid: 14254
input: SNP/mpileup/reports/data/variants.raw.stats
output: SNP/mpileup/reports/variants.raw.html
conda-env: /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/conda/0362ca51d52d7128b23d37b26ad58301_
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
╭─ harpy snp mpileup ──────────────────────────────────────────────────────────╮
│ The workflow has terminated due to an error. See the log file below for more │
│ details. │
╰──────────────────────────────────────────────────────────────────────────────╯
snps is gonna take a while and need to change computer. will update this comment once snp module of harpydev has run, and if possible will test impute
thanks
.harpy_envs
folder and rerunning whichever modules are giving library errors. Harpy wont overwrite the that folder, but I think maybe I should change that behavior. That should fix the highcharter
issue.dev
branch is a combination of an emotional rollercoaster and the wild west right now. Not all parts of it work yet.@gmkov The current version of dev
now:
.harpy_env
contents on a module rulepreflight
issue.harpy_envs
are now explicitly listed with channel::package
e.g. bioconda::pysam
You're welcome to try the changes by doing:
> git pull
# make sure you are in the harpytest environment
> bash resources/buildlocal.sh
FWIW, I never configure the conda --strict thing and just ignore the messages about that
Also, the workflow/module should index the input BAM files if indices aren't present.
Hello
Ok so ive done the following to carry on testing harpydev
Note: I first tried the steps below while the channel priority was set to STRICT and when running preflight it gave errors saying cant find packages or create enviroments (e.g. - package r-flexdashboard-0.5.1.1-r341h6115d3f_0 requires r-htmlwidgets >=0.6, but none of the providers can be installed - nothing provides r 3.2.2* needed by r-magrittr-1.5-r3.2.2_0
). So changed the channel priority back to flexible
# unstrict
conda config --set channel_priority flexible
# removed environments to start fresh
rm -R .harpy_envs
# pull updates from harpydev
git pull
# cd to ~/rds/hpc-work/home/bin/harpy
bash resources/buildlocal.sh
# manually uninstalled flexeboard from R to test
#(harpydev) [mgm49@login-n-1 harpy-dev]$ which harpy
#~/rds/hpc-work/home/miniconda3/envs/harpydev/bin/harpy
#### preflight checks
harpy preflight bam -s "--conda-frontend conda" bams
It is stuck in the "Downloading and installing remote packages." step, checked .harpy_envs and the last file was updated 6 minutes ago, so I think it's a dead end. I'm gonna leave it running on a screen and will update later
thank you
eventually it ran and showed a couple of errors:
[W::hts_idx_load3] The index file is older than the data file: Preflight/bam/workflow/input/CAM046075.bam.bai
- i dont think this affects the run
and eventually hit the flexdashboard issue again
Error in loadNamespace(name) : there is no package called ‘flexdashboard’
Calls: <Anonymous> ... loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart
Execution halted
Sometimes on HPC's it takes unusually long to do the internal conda things. I'm not sure why that is, might have to do with how many files are being changed at once and how the RAID array has to manage that over however many drives.
The R package situation is annoying to say the least. I'm sorry about that and I'm still not sure why it's lying to you about package absence. I'm not crazy about this solution, but I'll write some require
statements in the Rmd files to get them installed one way or another. You can opt to --skip-reports
in the meantime if your focus is more the processing and less the reports for the other workflows.
@gmkov I added section to the R code in dev
to install any missing packages. I don't love this solution, but it might help your edge case. All the testing on Github Actions, which uses fresh environments each time without any preexisting programs works just fine, so I'm convinced it's something specific about your system but I can't quite figure out what it might be. Hopefully the new code helps your case, please let me know how that plays out. You won't need to delete .harpy_envs
anymore. It should just be git pull
, then bash resources/buildlocal.sh
, then harpy ....
@gmkov as a heads up, the most recent commit to dev
removes the short-flag -s
for --snakemake
for two reasons:
--snakemake
is an option that, if used, should be very unambiguous, much like --skipreports
-s
is too precious a letter to reserve for --snakemake
Hi
Re- r packages. Well the weird thing is that the fresh harpy bioconda installation is NOT throwing any errors around libraries, whereas harpy dev is :/ Sadly i pulled the latest changes and harpydev still cant find flexdashboard. Now running imputation with harpydev:
# pull updates from harpydev
git pull
# cd to ~/rds/hpc-work/home/bin/harpy
bash resources/buildlocal.sh
# manually uninstalled flexeboard from R to test
#### preflight checks
rm -R Preflight
harpy preflight bam --snakemake "--conda-frontend conda" bams
# faster module loading
# but STILL no flexdashboard :/
# use --skip-reports next
#### jump ahead to call variants with mpileup
# already done previously
#### manually curate vcf
# No duplicate positions
# get positions
bcftools query -f "%CHROM\t%POS\t%REF\t%ALT\n" SNP/mpileup/variants.raw.bcf > SNP/mpileup/variants.raw.bcf.pos
# Keep only biallelic positions. (Delete the allele with lower frequency)
# goes down from 430M to 427M
sed -i 's/,.//g' SNP/mpileup/variants.raw.bcf.pos
# No duplicate sample names - check - nope
bcftools query -l SNP/mpileup/variants.raw.bcf
#### run imputation
# add positions file with only biallelic sites
nice harpy impute --threads 20 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
-x '--posfile SNP/mpileup/variants.raw.bcf.pos' \
--skipreports --snakemake "--conda-frontend conda" \
bams > log.impute &
harpydev imputation threw error about indexes [E::idx_find_and_load] Could not retrieve index file for 'Impute/workflow/input/vcf/input.sorted.bcf' Failed to read from Impute/workflow/input/vcf/input.sorted.bcf: could not load index Failed to read from standard input: unknown file type
And now I realised that my harpy impute command was incorrect (as i was adding as an extra parameter the positions file, when this is the MAIN file for stitch, and one that presumably harpy produces internally). harpy wants bcf/vcf input even though stitch only requires a position file. I assume this is because harpy wants to curate the vcf to an extent (biallelic SNPs and sort the input VCF file) but not to check for duplicate positions/sample names. So instead now:
# No duplicate positions - check - nope
bcftools norm --rm-dup all variants.raw.bcf -o variants.raw.nodups.bcf
# Lines total/split/joined/realigned/skipped: 29222149/0/0/0/0
# No duplicate sample names - check - nope
bcftools query -l SNP/mpileup/variants.raw.bcf
#### run imputation
# run imputation with raw bcf
nice harpy impute --threads 20 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
--skipreports --snakemake "--conda-frontend conda" \
bams > log.impute &
But still getting this issue [E::idx_find_and_load] Could not retrieve index file for 'Impute/workflow/input/vcf/input.sorted.bcf' Failed to read from Impute/workflow/input/vcf/input.sorted.bcf: could not load index Failed to read from standard input: unknown file type
what could be going on? Im going to try other modules with the fresh bioconda harpy install
I can also see that the harpydev Impute/workflow dir structure looks different to harpy Impute/workflow
harpydev:
Impute/
└── workflow
├── config.yml
├── impute.smk
├── input
│ ├── alignments
│ │ ├── CAM046070.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046070.bam
│ │ ├── CAM046070.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046070.bam.bai
│ │ ├── CAM046071.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046071.bam
│ │ ├── CAM046071.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046071.bam.bai
│ │ ├── CAM046072.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046072.bam
│ │ ├── CAM046072.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046072.bam.bai
│ │ ├── CAM046074.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046074.bam
│ │ ├── CAM046074.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046074.bam.bai
│ │ ├── CAM046075.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046075.bam
│ │ ├── CAM046075.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046075.bam.bai
│ │ ├── CAM046076.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046076.bam
│ │ └── CAM046076.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046076.bam.bai
│ ├── samples.list
│ └── vcf
│ ├── input.sorted.bcf
│ └── input.sorted.log
├── report
│ ├── Impute.Rmd
│ └── StitchCollate.Rmd
├── scripts
│ └── stitch_impute.R
└── variants.raw.bcf.biallelic
harpy:
could that be the issue?
Impute/
└── workflow
├── impute.smk
├── input
│ ├── CAM046070.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046070.bam
│ ├── CAM046070.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046070.bam.bai
│ ├── CAM046071.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046071.bam
│ ├── CAM046071.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046071.bam.bai
│ ├── CAM046072.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046072.bam
│ ├── CAM046072.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046072.bam.bai
│ ├── CAM046074.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046074.bam
│ ├── CAM046074.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046074.bam.bai
│ ├── CAM046075.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046075.bam
│ ├── CAM046075.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046075.bam.bai
│ ├── CAM046076.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046076.bam
│ └── CAM046076.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046076.bam.bai
├── report
└── stitch_impute.R
3 directories, 14 files
There's a few things that might be going on here. Also, is the bcf file sufficiently small enough that I could download it and do some testing on my end (in addition to you having permission to share it)?
dev
version of the documentation, which doesn't get published, clarifies that, which I recognize isn't super helpful now but will hopefully prevent that kind of misunderstanding in the future:impute
workflow directory structure is indeed different in the dev version. It's more consistent with other workflows and a bit cleaner than it was. To avoid unexpected errors, you'll want to delete the Impute
directory and run the workflow again, or set -o some_new_folder
requires
logic in the Rmarkdown files isn't playing nice with your system. The setup now installs the R packages into that harpy sub-environment (as defined by .harpy_envs/r-env.yaml
), which has been, up until this point, foolproof, and additionally has the R install inside that environment manually install the packages in the unlikely event they are absent. It might be worth emailing one of your system admins to ask them what might be up because that is truly peculiar behavior.bcftools norm --rm-dup all variants.raw.bcf -o variants.raw.nodups.bcf
Make sure you specify -Ob
to force output to BCF format. I feel like it's overkill but I sometimes run into an issue (here and elsewhere) that my output file ends in .bcf
but it's in fact an uncompressed VCF file, which throws everything into chaos.
@gmkov I think I found why the .csi
file was missing. May you please try pulling the dev branch, and running that version of harpy? Again, for simplitiy:
git pull
bash resources/buildlocal.sh
nice harpy impute --threads 20 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
--skipreports --snakemake "--conda-frontend conda" \
bams > log.impute &
hi,
thanks for explaining how it works internally. I had actually understood from the current version of the documentation that those two steps were done by harpy, and i had to do the other two. However no duplicates were found, so was still using the raw bcf.
Running impute now after pulling harpydev updates- IT'S RUNNING!!! and i was writing this it crashed
RuleException:
CalledProcessError in file /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/Impute/workflow/impute.smk, line 129:
Command 'source /home/mgm49/rds/hpc-work/home/miniconda3/envs/harpydev/bin/activate '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/conda/247c51c9e7b19ce3bdff0f0b16c2ebc3_'; set -euo pipefail; Rscript --vanilla /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/scripts/tmp2nfwa97x.stitch_impute.R' returned non-zero exit status 1.
......
Exiting because a job execution failed. Look above for error message
if i look in the stitch log
tail Impute/modeldiploid_usebxTrue_bxlimit50000_k40_s1_ngen500/contigs/17/17.log
Error in check_mclapply_OK(single_iteration_results) :
An error occured during STITCH. The first such error is above
Calls: do.call ... <Anonymous> -> completeSampleIteration -> check_mclapply_OK
In addition: Warning messages:
1: In mclapply(sampleRanges, mc.cores = nCores, FUN = subset_of_complete_iteration, :
scheduled core 4 did not deliver a result, all values of the job will be affected
2: In mclapply(sampleRanges, mc.cores = nCores, FUN = subset_of_complete_iteration, :
scheduled core 3 encountered error in user code, all values of the job will be affected
Execution halted
It was using a TON of memory, so maybe it just maxed out and crashed? gonna try with fewer threads
PS thank you for all your help, im super excited about your pipeline and really hope we can get it to work as it'd be super useful for me and others (feels like we are close)
I'm really glad we are moving the needle on this 🙂. Thanks for sticking this out with me, it's resulted in some very good software improvements. Also, thank you for your kind words of support.
Looking at the STITCH docs, there are options to only look at specific regions on a chromosome. While adopting that setting would likely reduce memory consumption, it would likely hamper the size of the haplotypes STITCH builds internally and may reduce the quality of the results. I'll reach out to Robbie about it and tag you in it so we can both get some insight about it.
Another option might be to set this for extra-params
inputBundleBlockSize How many sample input files to bundle together to reduce number of temporary files. Default NA or not used. Recommended to set to 100 or greater when using large sample sizes (> ~5000)
Hi
I reduced the threads to 4 and it run fine for a bit.
so my command was: nice harpy impute --threads 4 --parameters stitch.params --vcf SNP/mpileup/variants.raw.bcf --skipreports --snakemake "--conda-frontend conda" bams > log.impute &
error was:
Error in loadNamespace(name) : there is no package called ‘flexdashboard’
Calls: <Anonymous> ... loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart
Execution halted
RuleException:
CalledProcessError in file /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/Impute/workflow/impute.smk, line 155:
Command 'source /home/mgm49/rds/hpc-work/home/miniconda3/envs/harpydev/bin/activate '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/conda/247c51c9e7b19ce3bdff0f0b16c2ebc3_'; set -euo pipefail; Rscript --vanilla -e 'rmarkdown::render("/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/scripts/tmpfwcj6w4t.StitchCollate.Rmd", output_file="/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/Impute/modeldiploid_usebxFalse_bxlimit50000_k30_s1_ngen500/contigs/20/20.STITCH.html", quiet=TRUE, knit_root_dir = "/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev", params = list(rmd="/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/scripts/tmpfwcj6w4t.StitchCollate.Rmd"))'' returned non-zero exit status 1.
[Wed May 1 12:48:50 2024]
Error in rule collate_stitch_reports:
jobid: 269
input: Impute/modeldiploid_usebxFalse_bxlimit50000_k30_s1_ngen500/contigs/20/20.stats
output: Impute/modeldiploid_usebxFalse_bxlimit50000_k30_s1_ngen500/contigs/20/20.STITCH.html
conda-env: /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/conda/247c51c9e7b19ce3bdff0f0b16c2ebc3_
output:
Impute/
├── modeldiploid_usebxFalse_bxlimit50000_k30_s1_ngen500
│ └── contigs
│ └── 20
│ ├── 20.log
│ ├── 20.stats
│ ├── 20.vcf.gz
│ ├── 20.vcf.gz.tbi
│ ├── input
│ │ ├── sample.1.input.20.RData
│ │ ├── sample.2.input.20.RData
│ │ ├── sample.3.input.20.RData
│ │ ├── sample.4.input.20.RData
│ │ ├── sample.5.input.20.RData
│ │ └── sample.6.input.20.RData
│ ├── plots
│ │ ├── alphaMat.20.all.s.1.png
│ │ ├── alphaMat.20.normalized.s.1.png
│ │ ├── hapSum.20.s.1.png
│ │ ├── hapSum_log.20.s.1.png
│ │ ├── metricsForPostImputationQC.20.sample.jpg
│ │ ├── metricsForPostImputationQCChromosomeWide.20.sample.jpg
│ │ └── r2.20.goodonly.jpg
│ ├── RData
│ │ ├── EM.all.20.RData
│ │ ├── end.20.RData
│ │ ├── endEM.20.RData
│ │ ├── sampleNames.20.RData
│ │ ├── start.20.RData
│ │ └── startEM.20.RData
│ └── tmp
└── workflow
├── config.yml
├── impute.smk
├── input
│ ├── alignments
│ │ ├── CAM046070.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046070.bam
│ │ ├── CAM046070.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046070.bam.bai
│ │ ├── CAM046071.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046071.bam
│ │ ├── CAM046071.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046071.bam.bai
│ │ ├── CAM046072.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046072.bam
│ │ ├── CAM046072.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046072.bam.bai
│ │ ├── CAM046074.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046074.bam
│ │ ├── CAM046074.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046074.bam.bai
│ │ ├── CAM046075.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046075.bam
│ │ ├── CAM046075.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046075.bam.bai
│ │ ├── CAM046076.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046076.bam
│ │ └── CAM046076.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046076.bam.bai
│ ├── samples.list
│ ├── stitch
│ │ ├── 13.stitch
│ │ ├── 14.stitch
│ │ ├── 15.stitch
│ │ ├── 20.stitch
│ │ ├── 28.stitch
│ │ ├── 30.stitch
│ │ └── 5.stitch
│ └── vcf
│ ├── input.sorted.bcf
│ ├── input.sorted.bcf.csi
│ └── input.sorted.log
├── report
│ ├── Impute.Rmd
│ └── StitchCollate.Rmd
├── scripts
│ └── stitch_impute.R
└── variants.raw.bcf.biallelic
14 directories, 52 files
so it seems that the skipreports option is not working, and because it cant put together the reports it leaves all the plots behind. im going to manually install flexdashboard and rerun. I dont think it has run fully, weirdly seems to have started in chromosome 20 (this might just be chance due to snakemake) and in the second model i specified in the stitch.params table :
(harpydev) [mgm49@login-n-1 harpy-dev]$ less stitch.params
model usebx bxlimit k s ngen
diploid TRUE 50000 30 1 500
diploid FALSE 50000 30 1 500
diploid TRUE 50000 40 1 500
diploid FALSE 50000 40 1 500
diploid FALSE 50000 30 2 500
at least it seems to have produced an imputed vcf for contig 20 :)
@gmkov before I post anything on the STITCH GH, does this help at all? https://github.com/rwdavies/STITCH/issues/56#issuecomment-949375537
Snakemake, as you've said, will randomly choose things to run. So, there won't ever be a sensible run order.
Interesting about collate_reports
not working. That rule doesn't so much create reports by analyzing data, more just stitching the graphics STITCH produces into a single document to make life easier. and deleting all the other stuff to remove clutter. That's why I didn't configure it to be disabled for --skipreports
. I can configure the workflow to skip that with --skipreports
too.
@gmkov the dev
branch now skips collating STITCH metrics into a report with --skipreports
Good morning from Cambridge.
Ok so I pulled the harpydev git last night, reinstalled, and first run impute without skipping reports after having manually installed flexdashboard (just in case) and a new error came up:
Quitting from lines at lines 80-98 [setup environment] (tmpsegmxwof.StitchCollate.Rmd)
Error in `contrib.url()`:
! trying to use CRAN without setting a mirror
Backtrace:
1. global using(...)
2. utils::install.packages(need)
4. utils::contrib.url(repos, type)
Execution halted
But then I ran impute with 4 threads only, k=3 or 4 in the stitch parameters, --skipreports and voila!!! it's running! It's nearly done (ran in most contigs with all four models i set in stitch parameters). This is wonderful
1) Reports: I would love to be able to assemble the reports, as it is one really useful feature for parameter tuning. I have all the plots, but of course one plot per chromosome x 4 models is quite hard for the eye. Do you have any ideas on how to fix the above?
2) Memory. 4 models with low k=3/4, 4 threads and 6 samples (x31 chromosomes) has taken about 12h. It could be because the program 'is struggling' to do its job, as with 6 individuals it must be hard to impute! I have 60 individuals, so will test if increasing the number from 6 to the whole dataset actually makes it faster. I'd also like to increase K, because this is a wild population with huge effective population sizes. I'll test and let you know :D
This is big, we've made progress! I know how to fix the R issue and will let you know when it's ready. That'll be in a few hours because it's only 5a here and I happened to check my phone when I woke up to get some water. I think you can bump up the thread count, though.
It's so strange that the other R packages are installed (clearly STITCH works), but flexdashboard won't
@gmkov you can try the dev
branch now, I hardcoded a repository for the package installs
testing with preflight bams. silly question: can i run harpydev simultaneously on the same conda env for different jobs? in different directories. or will snakemake get confused... thanks!
Harpy passes the --nolock
option to the internal SM call, making it possible to run multiple Harpy things simultaneously in your project directory. As an example, I'm currently demultiplexing 4 different library pools by running harpy demultiplex
4x out of the same directory. Just make sure that if you're running the same module, to set different output directories for concurrent runs with -o dirname
Ok great, good to know -o can help with concurrent runs (want to divide some steps by populations AND it might help if running analyses by chromosome in the cluster for speed). And running harpy in different project directories shouldn't be an issue either right, because .snakemake is local to the project directory? I think the reason why I just had issues is because I pulled the harpydev version while some harpydev jobs were running - i knew this was potentially dangerous
bad news. harpydev preflight bams ran perfectly until the end...
Error in loadNamespace(name) : there is no package called ‘flexdashboard’
Calls: <Anonymous> ... loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart
Execution halted
if i install it manually it's fine. sorry this is so complicated!
and tested prefligth fastqs after having manually installed flexdashboard and got this, in case its helpful (some issue with libxml which i also had when trying to install high charter / xml manually):
trying URL 'https://cloud.r-project.org/src/contrib/XML_3.99-0.16.1.tar.gz'
Content type 'application/x-gzip' length 971985 bytes (949 KB)
==================================================
downloaded 949 KB
trying URL 'https://cloud.r-project.org/src/contrib/rlist_0.4.6.2.tar.gz'
Content type 'application/x-gzip' length 72870 bytes (71 KB)
==================================================
downloaded 71 KB
trying URL 'https://cloud.r-project.org/src/contrib/highcharter_0.9.4.tar.gz'
Content type 'application/x-gzip' length 1073325 bytes (1.0 MB)
==================================================
downloaded 1.0 MB
* installing *source* package ‘XML’ ...
** package ‘XML’ successfully unpacked and MD5 sums checked
** using staged installation
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether the compiler supports GNU C... yes
checking whether gcc accepts -g... yes
checking for gcc option to enable C11 features... none needed
checking how to run the C preprocessor... gcc -E
checking for sed... /usr/bin/sed
checking for pkg-config... /usr/bin/pkg-config
checking for xml2-config... /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/1686eeb18ea1bc448d81b17caa922fb1_/bin/xml2-config
USE_XML2 = yes
SED_EXTENDED_ARG: -E
Minor 12, Patch 6 for 2.12.6
Located parser file -I/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/1686eeb18ea1bc448d81b17caa922fb1_/include/libxml2 -I/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/1686eeb18ea1bc448d81b17caa922fb1_/include/parser.h
Checking for 1.8: -I/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/1686eeb18ea1bc448d81b17caa922fb1_/include/libxml2 -I/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/1686eeb18ea1bc448d81b17caa922fb1_/include
Using libxml2.*
checking for gzopen in -lz... yes
checking for xmlParseFile in -lxml2... no
checking for xmlParseFile in -lxml... no
configure: error: "libxml not found"
ERROR: configuration failed for package ‘XML’
* removing ‘/home/mgm49/R/x86_64-pc-linux-gnu-library/4.0/XML’
ERROR: dependency ‘XML’ is not available for package ‘rlist’
* removing ‘/home/mgm49/R/x86_64-pc-linux-gnu-library/4.0/rlist’
ERROR: dependency ‘rlist’ is not available for package ‘highcharter’
* removing ‘/home/mgm49/R/x86_64-pc-linux-gnu-library/4.0/highcharter’
The downloaded source packages are in
'/tmp/RtmpoJCGsN/downloaded_packages'
Quitting from lines at lines 58-87 [load environment] (tmp1bnmdn9s.PreflightFastq.Rmd)
Error in `library()`:
! there is no package called 'highcharter'
Backtrace:
1. base::library(highcharter)
Execution halted
I added libxml2
to the r-env metadata. Hopefully conda will install that and make the problems go away?
FWIW, a long-term solution will eventually be to containerize the environments and have the software distribution method be Singularity/Docker.
arg sorry, same! ive pulled new changes
after preflight fastq
(harpydev) [mgm49@login-n-1 47.all.brazil.haplotagging.n93]$
project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/6bc0ac5f6a7d973c423a2cee60cee52a_/include/parser.h
Checking for 1.8: -I/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/6bc0ac5f6a7d973c423a2cee60cee52a_/include/libxml2 -I/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/6bc0ac5f6a7d973c423a2cee60cee52a_/include
Using libxml2.*
checking for gzopen in -lz... yes
checking for xmlParseFile in -lxml2... no
checking for xmlParseFile in -lxml... no
configure: error: "libxml not found"
ERROR: configuration failed for package ‘XML’
* removing ‘/home/mgm49/R/x86_64-pc-linux-gnu-library/4.0/XML’
ERROR: dependency ‘XML’ is not available for package ‘rlist’
* removing ‘/home/mgm49/R/x86_64-pc-linux-gnu-library/4.0/rlist’
ERROR: dependency ‘rlist’ is not available for package ‘highcharter’
* removing ‘/home/mgm49/R/x86_64-pc-linux-gnu-library/4.0/highcharter’
The downloaded source packages are in
'/tmp/Rtmp55Isqu/downloaded_packages'
Quitting from lines at lines 58-87 [load environment] (tmpw8s3x_7g.PreflightFastq.Rmd)
Error in `library()`:
! there is no package called 'highcharter'
Backtrace:
1. base::library(highcharter)
Execution halted
I swapped the xml dependency as described here
from conda-forge::libxml2
to r-xml
Does that improve things?
Hi
new error when running preflight bams, slightly different:
Could not create conda environment from /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.harpy_envs/r-env.yaml:
Command:
conda env create --quiet --no-default-packages --file "/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/0ed40cb3b057aecef3c4eec51d32d608_.yaml" --prefix "/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/0ed40cb3b057aecef3c4eec51d32d608_"
Output:
Channels:
- bioconda
- conda-forge
- defaults
- r
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
ERROR conda.core.link:_execute(949): An error occurred while installing package 'conda-forge::r-rcpp-1.0.12-r42h7df8631_0'.
Rolling back transaction: ...working... done
[Errno 2] No such file or directory: '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/0ed40cb3b057aecef3c4eec51d32d608_/lib/R/library/Rcpp/include/Rcpp/platform/compiler.h'
when running align bwa, it runs well (produces bams) but also struggles to make reports:
Activating conda environment: .snakemake/conda/0ed40cb3b057aecef3c4eec51d32d608_
EnvironmentLocationNotFound: Not a conda environment: /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/0ed40cb3b057aecef3c4eec51d32d608_
Error: pandoc version 1.12.3 or higher is required and was not found (see the help page ?rmarkdown::pandoc_available).
Execution halted
RuleException:
bit weird will try again
I'm at a loss with this. I've begun trying to figure out how to achieve this using apptainer but it might be a bit because my initial attempts are not succesful.
Describe the bug
Thank you for this great resource. I realise it still under development, so unsure about whether to expect modules to run smoothly.
After some initial teething issues with sample names in my own vcf file created with bcftools mpileup outside harpy, I ran into a cryptic conda issue. So I went a step back, and used the harpy snp mpileup module to obtain a bcf, which finished with an error (i think just unable to compile the bcf html report) but the files looked ok within SNP/mpileup/. So then I tried imputation with harpy impute and using this file as input.
my command is:
my stitch param file is:
This produces the same cryptic conda error as before, which I paste below.
I also ran the
harpy preflight bam bams
and whole snakemake build a DAG correctly and seems to create/activate conda environments fine, and runs correctly (after I manually installed install.packages("flexdashboard"), this was the error inititally). So the preflight checks with the bam files were ok.Harpy Version
0.9.1
File that triggers the error (if applicable)
No response
Harpy error log
Before submitting