pdimens / harpy

Process raw haplotagging data, from raw sequences to phased haplotypes, batteries included.
https://pdimens.github.io/harpy
GNU General Public License v3.0
11 stars 1 forks source link

harpy impute error IsADirectoryError: #68

Open gmkov opened 5 months ago

gmkov commented 5 months ago

Describe the bug

Thank you for this great resource. I realise it still under development, so unsure about whether to expect modules to run smoothly.

After some initial teething issues with sample names in my own vcf file created with bcftools mpileup outside harpy, I ran into a cryptic conda issue. So I went a step back, and used the harpy snp mpileup module to obtain a bcf, which finished with an error (i think just unable to compile the bcf html report) but the files looked ok within SNP/mpileup/. So then I tried imputation with harpy impute and using this file as input.

my command is:

harpy impute --threads 10 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
bams 2> log.impute

my stitch param file is:

model   usebx   bxlimit k       s       ngen
diploid TRUE    50000   30      1       500
diploid FALSE   50000   30      1       500
diploid TRUE    50000   40      1       500
diploid FALSE   50000   40      1       500

This produces the same cryptic conda error as before, which I paste below.

I also ran the harpy preflight bam bams and whole snakemake build a DAG correctly and seems to create/activate conda environments fine, and runs correctly (after I manually installed install.packages("flexdashboard"), this was the error inititally). So the preflight checks with the bam files were ok.

Harpy Version

0.9.1

File that triggers the error (if applicable)

No response

Harpy error log

(harpy) [mgm49@login-n-1 harpy]$ harpy impute --threads 10 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
bams 2> log.impute

(harpy) [mgm49@login-n-1 harpy]$ less log.impute 
Traceback (most recent call last):
  File "/home/mgm49/miniconda3/envs/harpy/bin/harpy", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/__main__.py", line 383, in main
    cli()
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/rich_click/rich_command.py", line 126, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/impute.py", line 66, in impute
    fetch_file(f"{i}.Rmd", f"{workflowdir}/report/")
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/site-packages/harpy/helperfunctions.py", line 124, in fetch_file
    shutil.copy2(result, destination)
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/shutil.py", line 475, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/home/mgm49/miniconda3/envs/harpy/lib/python3.12/shutil.py", line 260, in copyfile
    with open(src, 'rb') as fsrc:
         ^^^^^^^^^^^^^^^
IsADirectoryError: [Errno 21] Is a directory: '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/Impute'

Before submitting

pdimens commented 5 months ago

Thanks for writing. I'd like to diagnose this with you and it seems to be two parts:

  1. r-flexdashboard is not being detected This is really weird because the package is installed into the workflow's R environment

I'm not sure why this would be happening. Neither is it an issue in any of Harpy's test suite (example). I could try to add, as a failsafe, a required section to the top of the Rmd files that will install it under the hood, just in case, although that's a pretty hacky solution. All the logic currently suggests it works and I'm at a loss as to why your system is fussing with it. Of the environments snakemake creates in the .snakemake/ directory, can you check in the .snakemake/conda folder that the R environment (it will have some kind of checksum name) has r-flexdashboard listed in the deps? Or the contents of .harpy_envs/r-env.yml. Also, can you verify that you installed harpy using both bioconda and conda-forge channels via mamba/conda create -n envname -c bioconda -c conda-forge harpy?

  1. The requisite files are not being copied/linked into the {outputdir}/workflow/report/ directory. To diagnose that, can you please provide me with the basic directory structure you have? From the harpy call, I can tell you're not specifying an output directory, meaning the output should be Impute/. Therefore:
    1. what is the name of the current working directory from where you called harpy impute?
    2. what other directories are in that folder?
    3. does Impute/ already exist and what is inside of it?

I'm sure we can get to the bottom of this!

gmkov commented 5 months ago

Hi- thank you very much for the fast reply!

problem 1 - flexdashboard

this arose while running the preflight bam check I did solve it by installing flexdashboard manually within the conda environment, but as you say this might point towards installation issues.

- Ive checked in R within the harpy env and none of the packages above are installed unless I had them installed previously. So yes, something went wrong during installation of R dependencies

### Installation

I had a couple of issues installing mamba. But my  successful installation commands were as follows
 Using conda 24.3.0, mamba 1.5.8, :

conda install conda-forge::mamba mamba create -n harpy -c bioconda -c conda-forge harpy mamba activate harpy


During the first preflight checks I got this error: "Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorit"

Which I fixed as advised: 

less ~/.condarc conda config --set channel_priority strict conda update --all


### problem 2 - directory error

I attach a tree of the current harpy dir, im testing with 6 bam files [harpy.dir.tree.txt](https://github.com/pdimens/harpy/files/14947832/harpy.dir.tree.txt)

1. the directory from which im running harpy impute is called... harpy 
2. see the attached tree
3. the dir Impute exists, but I rm -R before each run (i know this seems a bit dodgy) and still get the error 

Note that I first ran harpy impute in a subdirectory of harpy called harpy/stitch/, but I have since deleted that and run everything in the base harpy/ since it looked that that's how it prefers to work and where the SNP/ and bams/ are.

Could I clean up snakemake somehow to restart harpy impute from scratch? Thanks again!
gmkov commented 5 months ago

As an additional test, I ran the module harpy phase with the following command:

nice harpy phase --threads 20 --vcf SNP/mpileup/variants.raw.bcf bams 2> log.phase &

And it exited with

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-04-11T162236.855891.snakemake.log
WorkflowError:
At least one job did not complete successfully.

Found an error in the snakelog in the linkeFragments step:

Error in rule linkFragments:
    jobid: 34
    input: Phase/workflow/input/CAM046072.bam, Phase/input/CAM046072.het.bcf, Phase/extractHairs/CAM046072.unlinked.frags
    output: Phase/linkFragments/CAM046072.linked.frags
    log: Phase/linkFragments/logs/CAM046072.linked.log (check log file(s) for error details)
26c75f5fe6300b0faecde6366_
    shell:
ags --out Phase/linkFragments/CAM046072.linked.frags -d 100000 > Phase/linkFragments/logs/CAM046072.linked.log 2>&1
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

So went to the linked log, and found that the issue is pysam!! issue with dependencies again.

(harpy) [mgm49@login-n-1 harpy]$ less Phase/linkFragments/logs/CAM046072.linked.log
Traceback (most recent call last):
  File "/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/.snakemake/conda/0dd50bd26c75f>
    import pysam
ImportError: No module named 'pysam'

I'd be happy to make a clean install of harpy if you could advice on how to do that without risking incompatibilities. thanks again!

pdimens commented 5 months ago

I'm still thinking about the first issues, but I think the issue with pysam etc is that I have a shebang at the top of the scripts that tells it to use the system's python interpreter, not the one in the conda env. I'm making those changes to the development version right now. I can follow up this afternoon on installing the dev version to test if removing that fixes it.

pdimens commented 5 months ago

@gmkov there are some huge internal API changes in that branch I was working with. I'm trying to merge it into dev but it's failing several tests. Once I hammer out these issues (I'll ping you), you can install the dev branch locally and we can see what comes from it then. To install harpy locally:

# get the git repo
> git clone --depth 1 https://github.com/pdimens/harpy.git
> cd harpy

# switch to dev branch
> git checkout dev

# create conda/mamba env
> mamba env create --name harpydev --file resources/harpyenv.yaml
> mamba activate harpy

# install everything 
bash resources/buildlocal.sh
pdimens commented 5 months ago

@gmkov Your mamba installation was not done the recommended way. See the official mamba docs on how to install it correctly via miniforge or micromamba: https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html#fresh-install-recommended It could be that the conda install is the issue. You can also skip mamba altogether (remove it) and use the latest version of conda, which uses the libmamba solver.

Try starting from there and see if it still creates issues. I'm almost ready with the developmental version.

gmkov commented 5 months ago

Hi

Yes i (and someone else) tried to installed mamba in that way and kept getting errors (sorry didnt keep track of those)- hence my attempt with conda-forge::mamba

I have uninstalled the conda conda-forge mamba by running conda uninstall mamba and it seems to have worked.

Then i've updated conda using, now have conda 24.3.0 (although I think it was already updated) :

(base) [mgm49@login-n-1 vcf]$ conda update -n base conda
Channels:
 - conda-forge
 - bioconda
 - defaults
 - anaconda

Then ran this to install harpy with conda:

` (base) [mgm49@login-n-1 vcf]$ conda create -n harpy-conda -c bioconda -c conda-forge harpy

`

And the following errors come up, something with limbamba:

(base) [mgm49@login-n-1 vcf]$ conda create -n harpy-conda -c bioconda -c conda-forge harpy
Channels:
 - bioconda
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: - warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
...
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY                                                                                                                            failed

LibMambaUnsatisfiableError: Encountered problems while solving:
  - package harpy-0.8.0-py312h9ee0642_0 requires bcftools 1.19.*, but none of the providers can be installed

Could not solve for environment specs
The following packages are incompatible
└─ harpy is not installable because there are no viable options
   ├─ harpy [0.1.1|0.2.0|0.3.0|0.4.0] would require
   │  └─ r-stitch, which requires
   │     └─ r-data.table >=1.11.8 , which conflicts with any installable versions previously reported;
   ├─ harpy [0.5.0|0.6.0|0.6.1|0.7.0|0.7.3] would require
   │  └─ pysam 0.22.* , which requires
   │     └─ libdeflate >=1.18,<1.19.0a0 , which conflicts with any installable versions previously reported;
   └─ harpy [0.8.0|0.9.1] would require
      └─ bcftools 1.19.*  but there are no viable options
         ├─ bcftools 1.19 would require
         │  └─ htslib >=1.19,<1.20.0a0  but there are no viable options
         │     ├─ htslib 1.19 would require
         │     │  └─ libdeflate >=1.18,<1.20.0a0 , which conflicts with any installable versions previously reported;
         │     └─ htslib 1.19.1 would require
         │        └─ libdeflate >=1.18,<1.19.0a0 , which conflicts with any installable versions previously reported;
         └─ bcftools 1.19 would require
            └─ htslib >=1.19.1,<1.20.0a0 , which cannot be installed (as previously explained).

can try with harpy dev as explained above. thank you

pdimens commented 5 months ago

Woah, that's a new one. Can you undo the conda strict priority that you implemented before? Snakemake likes to complain about that but it honestly hasn't been an issue so I never actually set it to strict. Also, remove anaconda from the channels, I think.

gmkov commented 5 months ago

ok changed strict priority again, updated all:

(base) [mgm49@login-n-1 home]$ conda config --set channel_priority flexible
(base) [mgm49@login-n-1 home]$ conda update --all
Retrieving notices: ...working... done
Channels:
 - conda-forge
 - bioconda
 - defaults
 - anaconda
Platform: linux-64

so now my condarc looks like this:

(base) [mgm49@login-n-1 home]$ less ~/.condarc                                                                                                                                                               
channels:                                                                                                                                                                                                    
  - conda-forge                                                                                                                                                                                              
  - bioconda                                                                                                                                                                                                 
  - defaults                                                                                                                                                                                                 
channel_priority: flexible 

weirdly, the anaconda channel is NOT in ~/.condarc , Now I run again the conda install:

conda create -n harpy-conda -c bioconda -c conda-forge harpy

and things installed nicely!


TEST1: impute

Now I re ran stitch in the new harpy-conda env: nice harpy impute --threads 10 --parameters stitch.params \ --vcf SNP/mpileup/variants.raw.bcf \ bams 2> log.impute &

but sadly still get "IsADirectoryError: [Errno 21] Is a directory:"


TEST2: preflight check

new problems

(harpy-conda) [mgm49@login-n-1 harpy]$ harpy preflight bam bams

╭─ Harpy preflight bam ───────────────────────────────────────────────────╮
│ Samples: 6                                                              │
│ Output Directory: Preflight/bam/                                        │
╰─────────────────────────── Running Workflow ────────────────────────────╯
Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
CreateCondaEnvironmentException:
The 'mamba' command is not available in the shell /usr/bin/bash that will be used by Snakemake. You have to ensure that it is in your PATH, e.g., first activating the conda base environment with `conda activate base`.The mamba package manager (https://github.com/mamba-org/mamba) is a fast and robust conda replacement. It is the recommended way of using Snakemake's conda integration. It can be installed with `conda install -n base -c conda-forge mamba`. If you still prefer to use conda, you can enforce that by setting `--conda-frontend conda`.

it's realised there is no mamba and is not happy about it. could try and install in the way described in the error and reinstall? conda install -n base -c conda-forge mamba

pdimens commented 5 months ago

That's progress, of a sort! Try adding -s "--conda-frontend conda" to the harpy call. Let's see what happens.

nice harpy impute --threads 10 --parameters stitch.params  \ 
    -s "--conda-frontend conda"  \
    --vcf SNP/mpileup/variants.raw.bcf  \
    bams 2> log.impute &  
gmkov commented 5 months ago

hi, unfortunately same isadirectoryerror appears. Even though I delete the whole Impute/ dir before running harpy, just in case

(harpy-conda) [mgm49@login-n-1 harpy]$
[1]+  Exit 1                  nice harpy impute --threads 10 --parameters stitch.params -s "--conda-frontend conda" --vcf SNP/mpileup/variants.raw.bcf bams 2> log.impute
(harpy-conda) [mgm49@login-n-1 harpy]$ less log.impute
Traceback (most recent call last):
  File "/home/mgm49/miniconda3/envs/harpy-conda/bin/harpy", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/harpy/__main__.py", line 383, in main
    cli()
  File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/rich_click/rich_command.py", line 126, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/harpy/impute.py", line 66, in impute
    fetch_file(f"{i}.Rmd", f"{workflowdir}/report/")
  File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/site-packages/harpy/helperfunctions.py", line 124, in fetch_file
    shutil.copy2(result, destination)
  File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/shutil.py", line 475, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/home/mgm49/miniconda3/envs/harpy-conda/lib/python3.12/shutil.py", line 260, in copyfile
    with open(src, 'rb') as fsrc:
         ^^^^^^^^^^^^^^^
IsADirectoryError: [Errno 21] Is a directory: '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/Impute'

but when running the preflight with conda-frontend it runs fine but complains about channel priorities (which i reverted to flexible), and fails at the end because flexdashboard not installed


(harpy-conda) [mgm49@login-n-1 harpy]$ harpy preflight bam bams  -s "--conda-frontend conda"

╭─ Harpy preflight bam ───────────────────────────────────────────────────╮
│ Samples: 6                                                              │
│ Output Directory: Preflight/bam/                                        │
╰─────────────────────────── Running Workflow ────────────────────────────╯
Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job stats:
job             count
------------  -------
all                 1
checkBam            6
createReport        1
log_runtime         1
mergeChecks         1
total              10

preflight bams error:

Activating conda environment: .snakemake/conda/6f556b53477f5b6b15e08502da2b35c8_
Error in loadNamespace(name) : there is no package called ‘flexdashboard’
Calls: <Anonymous> ... loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart
Execution halted

any other ideas? thanks

pdimens commented 5 months ago

This is so very confusing, honestly. For starters, and this may not be relevant, the bams folder (and any unnamed inputs) must be the last parts of the command line call to harpy.

Second, I merged my branch into dev. Try to install the dev version of harpy into a fresh conda environment like I described above and see what happens. I'm almost positive the issue is with your conda installation but can't quite put my finger on where exactly. Hopefully the dev branch fixes the IsADirectory error with the new package structure.

You can also try to install mamba into whatever conda env harpy is installed into and scrap using --conda-frontend.

I need to caveat that the dev branch is very much a developmental branch and there are things on it I broke (the snp modules) and things that dont work yet (simulate reads). The newer internal logic may overcome the Impute issue though.

gmkov commented 5 months ago

Changing the preflight bams folder input to the end of the command did not fix the issue with the flexdashboard not installed. But if I install it it works fine.

Sadly I cant seem to get into the dev branch following the steps above

[mgm49@login-n-1 programmes]$ git clone --depth 1 https://github.com/pdimens/harpy.git
[mgm49@login-n-1 programmes]$ cd harpy/
[mgm49@login-n-1 harpy]$ git checkout dev
error: pathspec 'dev' did not match any file(s) known to git

anything I can change? thank you


EDIT 1

I removed the --depth 1 and it seems to have worked ??

[mgm49@login-n-1 programmes]$ git clone  https://github.com/pdimens/harpy.git
[mgm49@login-n-1 programmes]$ cd harpy/
[mgm49@login-n-1 harpy]$ git checkout dev
branch 'dev' set up to track 'origin/dev'.
Switched to a new branch 'dev'

had to change a few things from the commands you had sent to install dev, so i put them all below:

# get the git repo
git clone https://github.com/pdimens/harpy.git
cd harpy

# switch to dev branch
git checkout dev

# create conda/mamba env - im using conda, it's harpy.yaml (no env), and activate harpydev (not harpy)
conda env create --name harpydev --file resources/harpy.yaml

# it asked me to run conda init before conda activate, weird.
conda init

# showed it had changed my bashrc and added lines by conda init. had to source
source ~/.bashrc.

# now can activate env
conda activate harpydev

# install everything - Successfully installed harpy-0.10.0
bash resources/buildlocal.sh

going to test Impuptation

pdimens commented 5 months ago

Sorry about that, the --depth 1 wasn't necessary, as you discovered. I think it's very weird/suspicious that you had to conda init.

pdimens commented 5 months ago

@gmkov Please let me know if the IsADirectoryError persists.

gmkov commented 4 months ago

Hi, im back in action! sorry had to change priorities. I also had to reinstall miniconda3 in a completely new location (+ get rid of all the old instal), as our home directory in the cluster is only 50gb and with all these environments I was quickly running out of quota. So now we have a fresh start to install harpy. I still dont want to use mamba, sorry, it messed things up last time and im scared.

Testing harpy (not dev)

#### installed fresh
conda create -n harpy -c bioconda -c conda-forge harpy

#### run preflight test. note that adding --conda-frontend conda without the -s didnt work.
harpy preflight bam -s "--conda-frontend conda" bams 

# success (no issues with flexdashboard, yay). only complains about conda channels, so changed
conda config --set channel_priority strict

#### call variants with mpileup, tell it use conda. 6 samples, 20 threads - took 1h 20min
# it needs bam.bai as well in the input directory
genome=/home/mgm49/rds/rds-cj107-jiggins-rds/genomes/Harm_2021_China/sequences.fa
nohup nice harpy snp mpileup -s "--conda-frontend conda" --threads 20 --genome $genome bams/ > log.snp &

#### manually curate vcf
# No duplicate positions
# get positions 
bcftools query -f "%CHROM\t%POS\t%REF\t%ALT\n" variants.raw.bcf > variants.raw.bcf.pos

# Keep only biallelic positions. (Delete the allele with lower frequency)
# goes down from 430M to  427M
sed -i 's/,.//g' variants.raw.bcf.pos

# No duplicate sample names - check - nope
bcftools query -l SNP/mpileup/variants.raw.bcf

#### run impute
# add positions file with only biallelic sites
nice harpy impute --threads 20 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
-x '--posfile SNP/mpileup/variants.raw.bcf.pos' \
-s "--conda-frontend conda" \
bams > log.impute &

# noooo: :( "IsADirectoryError: [Errno 21] Is a directory: '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/Impute'"

sadly hitting isadirectoryerror again :( but everything else ran well, so hoping harpy dev works

Testing harpy dev

### install harpydev
# get the git repo
git clone https://github.com/pdimens/harpy.git
cd harpy

# switch to dev branch
git checkout dev

# create conda/mamba env - im using conda, it's harpy.yaml (no env), and activate harpydev (not harpy)
conda env create --name harpydev --file resources/harpy.yaml

# it NO LONGER asks me to run conda init before conda activate- GOOD (conda behaving correctly now)

# now can activate env
conda activate harpydev

# install everything - Successfully installed harpy-0.10.0
bash resources/buildlocal.sh

#### preflight checks
harpy preflight bam -s "--conda-frontend conda" bams 

# "Error in loadNamespace(name) : there is no package called ‘flexdashboard’
# Calls: <Anonymous> ... loadNamespace → withRestarts → withOneRestart → doWithOneRestart
# Execution halted

# install flexeboard manually in R....

# try preflight bams again.
# new error, harpydev struggling to create reports
Error in parse_block(g[-1], g[1], params.src, markdown_mode) : 
  Duplicate chunk label 'status_df', which has been used for the chunk:
data2 <- data %>% 
  mutate(
    nameMismatch = ifelse(nameMismatch > 0, "fail", "pass"),
    noMI = ifelse(noMI == alignments, "fail", "pass"),
    noBX = ifelse(noBX == alignments, "fail", "pass"),
    badBX = ifelse(badBX > 0, "fail", "pass"),
    bxNotLast = ifelse(bxNotLast > 0, "fail", "pass")
    ) %>% 
  select(-3) %>% 
  #rename("1" = "nameMismatch", "2" = "noMI", "3" = "noBX", "4" = "badBX", "5" = "bxNotLast") %>% 
  pivot_longer(-1, names_to = "category", values_to = "Status")
#data2$category <- as.integer(data2$category)
Calls: <Anonymous> ... process_file -> split_file -> lapply -> FUN -> parse_block
Execution halted
RuleException:
CalledProcessError in file /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/Preflight/bam/workflow/preflight-bam.smk, line 90:
Command 'source /home/mgm49/rds/hpc-work/home/miniconda3/envs/harpydev/bin/activate '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/conda/0362ca51d52d7128b23d37b26ad58301_'; set -euo pipefail;  Rscript --vanilla -e 'rmarkdown::render("/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/scripts/tmpmp793p2d.PreflightBam.Rmd", output_file="/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/Preflight/bam/filecheck.bam.html", quiet=TRUE, knit_root_dir = "/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev", params = list(rmd="/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/scripts/tmpmp793p2d.PreflightBam.Rmd"))'' returned non-zero exit status 1.

#### jump ahead to call variants with mpileup 
# it needs bam.bai as well in the input directory
genome=/home/mgm49/rds/rds-cj107-jiggins-rds/genomes/Harm_2021_China/sequences.fa
nohup nice harpy snp mpileup -s "--conda-frontend conda" --threads 20 --genome $genome bams/ > log.snp &

# error with some other library not installed
Quitting from lines  at lines 59-70 [load environment] (tmp0ss9qmfd.BcftoolsStats.Rmd)
Error in `library()`:
! there is no package called 'highcharter'
Backtrace:
 1. base::library(highcharter)
Execution halted
RuleException:
CalledProcessError in file /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/SNP/mpileup/workflow/snp-mpileup.smk, line 282:
Command 'source /home/mgm49/rds/hpc-work/home/miniconda3/envs/harpydev/bin/activate '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/conda/0362ca51d52d7128b23d37b26ad58301_'; set -euo pipefail;  Rscript --vanilla -e 'rmarkdown::render("/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/scripts/tmp0ss9qmfd.BcftoolsStats.Rmd", output_file="/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/SNP/mpileup/reports/variants.raw.html", quiet=TRUE, knit_root_dir = "/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev", params = list(rmd="/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/scripts/tmp0ss9qmfd.BcftoolsStats.Rmd"))'' returned non-zero exit status 1.
[Mon Apr 29 13:59:09 2024]
Error in rule bcf_report:
    jobid: 14254
    input: SNP/mpileup/reports/data/variants.raw.stats
    output: SNP/mpileup/reports/variants.raw.html
    conda-env: /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/conda/0362ca51d52d7128b23d37b26ad58301_

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
╭─ harpy snp mpileup ──────────────────────────────────────────────────────────╮
│ The workflow has terminated due to an error. See the log file below for more │
│ details.                                                                     │
╰──────────────────────────────────────────────────────────────────────────────╯

snps is gonna take a while and need to change computer. will update this comment once snp module of harpydev has run, and if possible will test impute

thanks

pdimens commented 4 months ago
  1. Try deleting your .harpy_envs folder and rerunning whichever modules are giving library errors. Harpy wont overwrite the that folder, but I think maybe I should change that behavior. That should fix the highcharter issue.
  2. The dev branch is a combination of an emotional rollercoaster and the wild west right now. Not all parts of it work yet.
  3. I get the impression the conda strict thing is creating problems. I'm gonna try to force channels in the environment files to hopefully make it as unambiguous to the solver as possible.
  4. Thanks for your continued investment in getting this solved!
pdimens commented 4 months ago

@gmkov The current version of dev now:

  1. overwrites the .harpy_env contents on a module rule
  2. fixed the preflight issue
  3. the packages in .harpy_envs are now explicitly listed with channel::package e.g. bioconda::pysam

You're welcome to try the changes by doing:

> git pull

# make sure you are in the harpytest environment
> bash resources/buildlocal.sh
pdimens commented 4 months ago

FWIW, I never configure the conda --strict thing and just ignore the messages about that

Also, the workflow/module should index the input BAM files if indices aren't present.

gmkov commented 4 months ago

Hello

Ok so ive done the following to carry on testing harpydev

Note: I first tried the steps below while the channel priority was set to STRICT and when running preflight it gave errors saying cant find packages or create enviroments (e.g. - package r-flexdashboard-0.5.1.1-r341h6115d3f_0 requires r-htmlwidgets >=0.6, but none of the providers can be installed - nothing provides r 3.2.2* needed by r-magrittr-1.5-r3.2.2_0). So changed the channel priority back to flexible

# unstrict
conda config --set channel_priority flexible

# removed environments to start fresh
rm -R .harpy_envs

# pull updates from harpydev
git pull

# cd to ~/rds/hpc-work/home/bin/harpy
bash resources/buildlocal.sh

# manually uninstalled flexeboard from R to test

#(harpydev) [mgm49@login-n-1 harpy-dev]$ which harpy
#~/rds/hpc-work/home/miniconda3/envs/harpydev/bin/harpy

#### preflight checks
harpy preflight bam -s "--conda-frontend conda" bams 

It is stuck in the "Downloading and installing remote packages." step, checked .harpy_envs and the last file was updated 6 minutes ago, so I think it's a dead end. I'm gonna leave it running on a screen and will update later

thank you

gmkov commented 4 months ago

eventually it ran and showed a couple of errors:

[W::hts_idx_load3] The index file is older than the data file: Preflight/bam/workflow/input/CAM046075.bam.bai - i dont think this affects the run

and eventually hit the flexdashboard issue again

Error in loadNamespace(name) : there is no package called ‘flexdashboard’
Calls: <Anonymous> ... loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart
Execution halted
pdimens commented 4 months ago

Sometimes on HPC's it takes unusually long to do the internal conda things. I'm not sure why that is, might have to do with how many files are being changed at once and how the RAID array has to manage that over however many drives.

pdimens commented 4 months ago

The R package situation is annoying to say the least. I'm sorry about that and I'm still not sure why it's lying to you about package absence. I'm not crazy about this solution, but I'll write some require statements in the Rmd files to get them installed one way or another. You can opt to --skip-reports in the meantime if your focus is more the processing and less the reports for the other workflows.

pdimens commented 4 months ago

@gmkov I added section to the R code in dev to install any missing packages. I don't love this solution, but it might help your edge case. All the testing on Github Actions, which uses fresh environments each time without any preexisting programs works just fine, so I'm convinced it's something specific about your system but I can't quite figure out what it might be. Hopefully the new code helps your case, please let me know how that plays out. You won't need to delete .harpy_envs anymore. It should just be git pull, then bash resources/buildlocal.sh, then harpy ....

pdimens commented 4 months ago

@gmkov as a heads up, the most recent commit to dev removes the short-flag -s for --snakemake for two reasons:

gmkov commented 4 months ago

Hi

Re- r packages. Well the weird thing is that the fresh harpy bioconda installation is NOT throwing any errors around libraries, whereas harpy dev is :/ Sadly i pulled the latest changes and harpydev still cant find flexdashboard. Now running imputation with harpydev:

# pull updates from harpydev
git pull

# cd to ~/rds/hpc-work/home/bin/harpy
bash resources/buildlocal.sh

# manually uninstalled flexeboard from R to test

#### preflight checks
rm -R Preflight
harpy preflight bam --snakemake "--conda-frontend conda" bams

# faster module loading
# but STILL no flexdashboard :/

# use --skip-reports next

#### jump ahead to call variants with mpileup 
# already done previously

#### manually curate vcf
# No duplicate positions
# get positions 
bcftools query -f "%CHROM\t%POS\t%REF\t%ALT\n" SNP/mpileup/variants.raw.bcf > SNP/mpileup/variants.raw.bcf.pos

# Keep only biallelic positions. (Delete the allele with lower frequency)
# goes down from 430M to  427M
sed -i 's/,.//g' SNP/mpileup/variants.raw.bcf.pos

# No duplicate sample names - check - nope
bcftools query -l SNP/mpileup/variants.raw.bcf

#### run imputation 
# add positions file with only biallelic sites
nice harpy impute --threads 20 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
-x '--posfile SNP/mpileup/variants.raw.bcf.pos' \
--skipreports --snakemake "--conda-frontend conda" \
bams > log.impute &

harpydev imputation threw error about indexes [E::idx_find_and_load] Could not retrieve index file for 'Impute/workflow/input/vcf/input.sorted.bcf' Failed to read from Impute/workflow/input/vcf/input.sorted.bcf: could not load index Failed to read from standard input: unknown file type

And now I realised that my harpy impute command was incorrect (as i was adding as an extra parameter the positions file, when this is the MAIN file for stitch, and one that presumably harpy produces internally). harpy wants bcf/vcf input even though stitch only requires a position file. I assume this is because harpy wants to curate the vcf to an extent (biallelic SNPs and sort the input VCF file) but not to check for duplicate positions/sample names. So instead now:

# No duplicate positions - check - nope
bcftools norm --rm-dup all variants.raw.bcf -o variants.raw.nodups.bcf
# Lines total/split/joined/realigned/skipped: 29222149/0/0/0/0

# No duplicate sample names - check - nope
bcftools query -l SNP/mpileup/variants.raw.bcf

#### run imputation 
# run imputation with raw bcf
nice harpy impute --threads 20 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
--skipreports --snakemake "--conda-frontend conda" \
bams > log.impute &

But still getting this issue [E::idx_find_and_load] Could not retrieve index file for 'Impute/workflow/input/vcf/input.sorted.bcf' Failed to read from Impute/workflow/input/vcf/input.sorted.bcf: could not load index Failed to read from standard input: unknown file type

what could be going on? Im going to try other modules with the fresh bioconda harpy install

gmkov commented 4 months ago

I can also see that the harpydev Impute/workflow dir structure looks different to harpy Impute/workflow

harpydev:

Impute/
└── workflow
    ├── config.yml
    ├── impute.smk
    ├── input
    │   ├── alignments
    │   │   ├── CAM046070.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046070.bam
    │   │   ├── CAM046070.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046070.bam.bai
    │   │   ├── CAM046071.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046071.bam
    │   │   ├── CAM046071.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046071.bam.bai
    │   │   ├── CAM046072.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046072.bam
    │   │   ├── CAM046072.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046072.bam.bai
    │   │   ├── CAM046074.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046074.bam
    │   │   ├── CAM046074.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046074.bam.bai
    │   │   ├── CAM046075.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046075.bam
    │   │   ├── CAM046075.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046075.bam.bai
    │   │   ├── CAM046076.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046076.bam
    │   │   └── CAM046076.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046076.bam.bai
    │   ├── samples.list
    │   └── vcf
    │       ├── input.sorted.bcf
    │       └── input.sorted.log
    ├── report
    │   ├── Impute.Rmd
    │   └── StitchCollate.Rmd
    ├── scripts
    │   └── stitch_impute.R
    └── variants.raw.bcf.biallelic

harpy:


could that be the issue?
Impute/
└── workflow
    ├── impute.smk
    ├── input
    │   ├── CAM046070.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046070.bam
    │   ├── CAM046070.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046070.bam.bai
    │   ├── CAM046071.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046071.bam
    │   ├── CAM046071.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046071.bam.bai
    │   ├── CAM046072.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046072.bam
    │   ├── CAM046072.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046072.bam.bai
    │   ├── CAM046074.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046074.bam
    │   ├── CAM046074.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046074.bam.bai
    │   ├── CAM046075.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046075.bam
    │   ├── CAM046075.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046075.bam.bai
    │   ├── CAM046076.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046076.bam
    │   └── CAM046076.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy/bams/CAM046076.bam.bai
    ├── report
    └── stitch_impute.R

3 directories, 14 files
pdimens commented 4 months ago

There's a few things that might be going on here. Also, is the bcf file sufficiently small enough that I could download it and do some testing on my end (in addition to you having permission to share it)?

  1. The curation of the bcf/vcf file only requires you to have a resulting vcf file, as you've surmised, not the kind of input STITCH expects. Harpy will internally take your input V/BCF and get things prepped for and ultimately run STITCH. It's a lot more convenient that way b/c it lets us parallelize over contigs and parameter-sets. The dev version of the documentation, which doesn't get published, clarifies that, which I recognize isn't super helpful now but will hopefully prevent that kind of misunderstanding in the future:

image

  1. The impute workflow directory structure is indeed different in the dev version. It's more consistent with other workflows and a bit cleaner than it was. To avoid unexpected errors, you'll want to delete the Impute directory and run the workflow again, or set -o some_new_folder
  2. Still very perplexing why the new requires logic in the Rmarkdown files isn't playing nice with your system. The setup now installs the R packages into that harpy sub-environment (as defined by .harpy_envs/r-env.yaml), which has been, up until this point, foolproof, and additionally has the R install inside that environment manually install the packages in the unlikely event they are absent. It might be worth emailing one of your system admins to ask them what might be up because that is truly peculiar behavior.
  3. In the event you are using the file from
    bcftools norm --rm-dup all variants.raw.bcf -o variants.raw.nodups.bcf

    Make sure you specify -Ob to force output to BCF format. I feel like it's overkill but I sometimes run into an issue (here and elsewhere) that my output file ends in .bcf but it's in fact an uncompressed VCF file, which throws everything into chaos.

pdimens commented 4 months ago

@gmkov I think I found why the .csi file was missing. May you please try pulling the dev branch, and running that version of harpy? Again, for simplitiy:

git pull

bash resources/buildlocal.sh

nice harpy impute --threads 20 --parameters stitch.params \
--vcf SNP/mpileup/variants.raw.bcf \
--skipreports --snakemake "--conda-frontend conda" \
bams > log.impute &
gmkov commented 4 months ago

hi,

thanks for explaining how it works internally. I had actually understood from the current version of the documentation that those two steps were done by harpy, and i had to do the other two. However no duplicates were found, so was still using the raw bcf.

Running impute now after pulling harpydev updates- IT'S RUNNING!!! and i was writing this it crashed

RuleException:
CalledProcessError in file /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/Impute/workflow/impute.smk, line 129:
Command 'source /home/mgm49/rds/hpc-work/home/miniconda3/envs/harpydev/bin/activate '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/conda/247c51c9e7b19ce3bdff0f0b16c2ebc3_'; set -euo pipefail;  Rscript --vanilla /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/scripts/tmp2nfwa97x.stitch_impute.R' returned non-zero exit status 1.

......
Exiting because a job execution failed. Look above for error message

if i look in the stitch log

tail Impute/modeldiploid_usebxTrue_bxlimit50000_k40_s1_ngen500/contigs/17/17.log

Error in check_mclapply_OK(single_iteration_results) : 
  An error occured during STITCH. The first such error is above
Calls: do.call ... <Anonymous> -> completeSampleIteration -> check_mclapply_OK
In addition: Warning messages:
1: In mclapply(sampleRanges, mc.cores = nCores, FUN = subset_of_complete_iteration,  :
  scheduled core 4 did not deliver a result, all values of the job will be affected
2: In mclapply(sampleRanges, mc.cores = nCores, FUN = subset_of_complete_iteration,  :
  scheduled core 3 encountered error in user code, all values of the job will be affected
Execution halted

It was using a TON of memory, so maybe it just maxed out and crashed? gonna try with fewer threads

PS thank you for all your help, im super excited about your pipeline and really hope we can get it to work as it'd be super useful for me and others (feels like we are close)

pdimens commented 4 months ago

I'm really glad we are moving the needle on this 🙂. Thanks for sticking this out with me, it's resulted in some very good software improvements. Also, thank you for your kind words of support.

Looking at the STITCH docs, there are options to only look at specific regions on a chromosome. While adopting that setting would likely reduce memory consumption, it would likely hamper the size of the haplotypes STITCH builds internally and may reduce the quality of the results. I'll reach out to Robbie about it and tag you in it so we can both get some insight about it.

Another option might be to set this for extra-params

inputBundleBlockSize How many sample input files to bundle together to reduce number of temporary files. Default NA or not used. Recommended to set to 100 or greater when using large sample sizes (> ~5000)

gmkov commented 4 months ago

Hi

I reduced the threads to 4 and it run fine for a bit.

so my command was: nice harpy impute --threads 4 --parameters stitch.params --vcf SNP/mpileup/variants.raw.bcf --skipreports --snakemake "--conda-frontend conda" bams > log.impute &

error was:

Error in loadNamespace(name) : there is no package called ‘flexdashboard’
Calls: <Anonymous> ... loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart
Execution halted

RuleException:
CalledProcessError in file /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/Impute/workflow/impute.smk, line 155:
Command 'source /home/mgm49/rds/hpc-work/home/miniconda3/envs/harpydev/bin/activate '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/conda/247c51c9e7b19ce3bdff0f0b16c2ebc3_'; set -euo pipefail;  Rscript --vanilla -e 'rmarkdown::render("/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/scripts/tmpfwcj6w4t.StitchCollate.Rmd", output_file="/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/Impute/modeldiploid_usebxFalse_bxlimit50000_k30_s1_ngen500/contigs/20/20.STITCH.html", quiet=TRUE, knit_root_dir = "/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev", params = list(rmd="/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/scripts/tmpfwcj6w4t.StitchCollate.Rmd"))'' returned non-zero exit status 1.
[Wed May  1 12:48:50 2024]
Error in rule collate_stitch_reports:
    jobid: 269
    input: Impute/modeldiploid_usebxFalse_bxlimit50000_k30_s1_ngen500/contigs/20/20.stats
    output: Impute/modeldiploid_usebxFalse_bxlimit50000_k30_s1_ngen500/contigs/20/20.STITCH.html
    conda-env: /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/.snakemake/conda/247c51c9e7b19ce3bdff0f0b16c2ebc3_

output:

Impute/
├── modeldiploid_usebxFalse_bxlimit50000_k30_s1_ngen500
│   └── contigs
│       └── 20
│           ├── 20.log
│           ├── 20.stats
│           ├── 20.vcf.gz
│           ├── 20.vcf.gz.tbi
│           ├── input
│           │   ├── sample.1.input.20.RData
│           │   ├── sample.2.input.20.RData
│           │   ├── sample.3.input.20.RData
│           │   ├── sample.4.input.20.RData
│           │   ├── sample.5.input.20.RData
│           │   └── sample.6.input.20.RData
│           ├── plots
│           │   ├── alphaMat.20.all.s.1.png
│           │   ├── alphaMat.20.normalized.s.1.png
│           │   ├── hapSum.20.s.1.png
│           │   ├── hapSum_log.20.s.1.png
│           │   ├── metricsForPostImputationQC.20.sample.jpg
│           │   ├── metricsForPostImputationQCChromosomeWide.20.sample.jpg
│           │   └── r2.20.goodonly.jpg
│           ├── RData
│           │   ├── EM.all.20.RData
│           │   ├── end.20.RData
│           │   ├── endEM.20.RData
│           │   ├── sampleNames.20.RData
│           │   ├── start.20.RData
│           │   └── startEM.20.RData
│           └── tmp
└── workflow
    ├── config.yml
    ├── impute.smk
    ├── input
    │   ├── alignments
    │   │   ├── CAM046070.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046070.bam
    │   │   ├── CAM046070.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046070.bam.bai
    │   │   ├── CAM046071.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046071.bam
    │   │   ├── CAM046071.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046071.bam.bai
    │   │   ├── CAM046072.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046072.bam
    │   │   ├── CAM046072.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046072.bam.bai
    │   │   ├── CAM046074.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046074.bam
    │   │   ├── CAM046074.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046074.bam.bai
    │   │   ├── CAM046075.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046075.bam
    │   │   ├── CAM046075.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046075.bam.bai
    │   │   ├── CAM046076.bam -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046076.bam
    │   │   └── CAM046076.bam.bai -> /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/bams/CAM046076.bam.bai
    │   ├── samples.list
    │   ├── stitch
    │   │   ├── 13.stitch
    │   │   ├── 14.stitch
    │   │   ├── 15.stitch
    │   │   ├── 20.stitch
    │   │   ├── 28.stitch
    │   │   ├── 30.stitch
    │   │   └── 5.stitch
    │   └── vcf
    │       ├── input.sorted.bcf
    │       ├── input.sorted.bcf.csi
    │       └── input.sorted.log
    ├── report
    │   ├── Impute.Rmd
    │   └── StitchCollate.Rmd
    ├── scripts
    │   └── stitch_impute.R
    └── variants.raw.bcf.biallelic

14 directories, 52 files

so it seems that the skipreports option is not working, and because it cant put together the reports it leaves all the plots behind. im going to manually install flexdashboard and rerun. I dont think it has run fully, weirdly seems to have started in chromosome 20 (this might just be chance due to snakemake) and in the second model i specified in the stitch.params table :

(harpydev) [mgm49@login-n-1 harpy-dev]$ less stitch.params 
model   usebx   bxlimit k       s       ngen
diploid TRUE    50000   30      1       500
diploid FALSE   50000   30      1       500
diploid TRUE    50000   40      1       500
diploid FALSE   50000   40      1       500
diploid FALSE   50000   30      2       500

at least it seems to have produced an imputed vcf for contig 20 :)

pdimens commented 4 months ago

@gmkov before I post anything on the STITCH GH, does this help at all? https://github.com/rwdavies/STITCH/issues/56#issuecomment-949375537

pdimens commented 4 months ago

Snakemake, as you've said, will randomly choose things to run. So, there won't ever be a sensible run order.

Interesting about collate_reports not working. That rule doesn't so much create reports by analyzing data, more just stitching the graphics STITCH produces into a single document to make life easier. and deleting all the other stuff to remove clutter. That's why I didn't configure it to be disabled for --skipreports. I can configure the workflow to skip that with --skipreports too.

pdimens commented 4 months ago

@gmkov the dev branch now skips collating STITCH metrics into a report with --skipreports

gmkov commented 4 months ago

Good morning from Cambridge.

harpydev impute --skipreports works!

Ok so I pulled the harpydev git last night, reinstalled, and first run impute without skipping reports after having manually installed flexdashboard (just in case) and a new error came up:

Quitting from lines at lines 80-98 [setup environment] (tmpsegmxwof.StitchCollate.Rmd)
Error in `contrib.url()`:
! trying to use CRAN without setting a mirror
Backtrace:
 1. global using(...)
 2. utils::install.packages(need)
 4. utils::contrib.url(repos, type)
Execution halted

But then I ran impute with 4 threads only, k=3 or 4 in the stitch parameters, --skipreports and voila!!! it's running! It's nearly done (ran in most contigs with all four models i set in stitch parameters). This is wonderful

Next steps

1) Reports: I would love to be able to assemble the reports, as it is one really useful feature for parameter tuning. I have all the plots, but of course one plot per chromosome x 4 models is quite hard for the eye. Do you have any ideas on how to fix the above?

2) Memory. 4 models with low k=3/4, 4 threads and 6 samples (x31 chromosomes) has taken about 12h. It could be because the program 'is struggling' to do its job, as with 6 individuals it must be hard to impute! I have 60 individuals, so will test if increasing the number from 6 to the whole dataset actually makes it faster. I'd also like to increase K, because this is a wild population with huge effective population sizes. I'll test and let you know :D

pdimens commented 4 months ago

This is big, we've made progress! I know how to fix the R issue and will let you know when it's ready. That'll be in a few hours because it's only 5a here and I happened to check my phone when I woke up to get some water. I think you can bump up the thread count, though.

pdimens commented 4 months ago

It's so strange that the other R packages are installed (clearly STITCH works), but flexdashboard won't

pdimens commented 4 months ago

@gmkov you can try the dev branch now, I hardcoded a repository for the package installs

gmkov commented 4 months ago

testing with preflight bams. silly question: can i run harpydev simultaneously on the same conda env for different jobs? in different directories. or will snakemake get confused... thanks!

pdimens commented 4 months ago

Harpy passes the --nolock option to the internal SM call, making it possible to run multiple Harpy things simultaneously in your project directory. As an example, I'm currently demultiplexing 4 different library pools by running harpy demultiplex 4x out of the same directory. Just make sure that if you're running the same module, to set different output directories for concurrent runs with -o dirname

gmkov commented 4 months ago

Ok great, good to know -o can help with concurrent runs (want to divide some steps by populations AND it might help if running analyses by chromosome in the cluster for speed). And running harpy in different project directories shouldn't be an issue either right, because .snakemake is local to the project directory? I think the reason why I just had issues is because I pulled the harpydev version while some harpydev jobs were running - i knew this was potentially dangerous

gmkov commented 4 months ago

bad news. harpydev preflight bams ran perfectly until the end...

Error in loadNamespace(name) : there is no package called ‘flexdashboard’
Calls: <Anonymous> ... loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart
Execution halted

if i install it manually it's fine. sorry this is so complicated!

gmkov commented 4 months ago

and tested prefligth fastqs after having manually installed flexdashboard and got this, in case its helpful (some issue with libxml which i also had when trying to install high charter / xml manually):

trying URL 'https://cloud.r-project.org/src/contrib/XML_3.99-0.16.1.tar.gz'
Content type 'application/x-gzip' length 971985 bytes (949 KB)
==================================================
downloaded 949 KB

trying URL 'https://cloud.r-project.org/src/contrib/rlist_0.4.6.2.tar.gz'
Content type 'application/x-gzip' length 72870 bytes (71 KB)
==================================================
downloaded 71 KB

trying URL 'https://cloud.r-project.org/src/contrib/highcharter_0.9.4.tar.gz'
Content type 'application/x-gzip' length 1073325 bytes (1.0 MB)
==================================================
downloaded 1.0 MB

* installing *source* package ‘XML’ ...
** package ‘XML’ successfully unpacked and MD5 sums checked
** using staged installation
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether the compiler supports GNU C... yes
checking whether gcc accepts -g... yes
checking for gcc option to enable C11 features... none needed
checking how to run the C preprocessor... gcc -E
checking for sed... /usr/bin/sed
checking for pkg-config... /usr/bin/pkg-config
checking for xml2-config... /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/1686eeb18ea1bc448d81b17caa922fb1_/bin/xml2-config
USE_XML2 = yes
SED_EXTENDED_ARG: -E
Minor 12, Patch 6 for 2.12.6
Located parser file -I/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/1686eeb18ea1bc448d81b17caa922fb1_/include/libxml2 -I/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/1686eeb18ea1bc448d81b17caa922fb1_/include/parser.h
Checking for 1.8:  -I/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/1686eeb18ea1bc448d81b17caa922fb1_/include/libxml2 -I/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/1686eeb18ea1bc448d81b17caa922fb1_/include
Using libxml2.*
checking for gzopen in -lz... yes
checking for xmlParseFile in -lxml2... no
checking for xmlParseFile in -lxml... no
configure: error: "libxml not found"
ERROR: configuration failed for package ‘XML’
* removing ‘/home/mgm49/R/x86_64-pc-linux-gnu-library/4.0/XML’
ERROR: dependency ‘XML’ is not available for package ‘rlist’
* removing ‘/home/mgm49/R/x86_64-pc-linux-gnu-library/4.0/rlist’
ERROR: dependency ‘rlist’ is not available for package ‘highcharter’
* removing ‘/home/mgm49/R/x86_64-pc-linux-gnu-library/4.0/highcharter’

The downloaded source packages are in
        '/tmp/RtmpoJCGsN/downloaded_packages'

Quitting from lines  at lines 58-87 [load environment] (tmp1bnmdn9s.PreflightFastq.Rmd)
Error in `library()`:
! there is no package called 'highcharter'
Backtrace:
 1. base::library(highcharter)
Execution halted
pdimens commented 4 months ago

I added libxml2 to the r-env metadata. Hopefully conda will install that and make the problems go away?

pdimens commented 4 months ago

FWIW, a long-term solution will eventually be to containerize the environments and have the software distribution method be Singularity/Docker.

gmkov commented 4 months ago

arg sorry, same! ive pulled new changes

after preflight fastq

(harpydev) [mgm49@login-n-1 47.all.brazil.haplotagging.n93]$ 
project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/6bc0ac5f6a7d973c423a2cee60cee52a_/include/parser.h
Checking for 1.8:  -I/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/6bc0ac5f6a7d973c423a2cee60cee52a_/include/libxml2 -I/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/6bc0ac5f6a7d973c423a2cee60cee52a_/include
Using libxml2.*
checking for gzopen in -lz... yes
checking for xmlParseFile in -lxml2... no
checking for xmlParseFile in -lxml... no
configure: error: "libxml not found"
ERROR: configuration failed for package ‘XML’
* removing ‘/home/mgm49/R/x86_64-pc-linux-gnu-library/4.0/XML’
ERROR: dependency ‘XML’ is not available for package ‘rlist’
* removing ‘/home/mgm49/R/x86_64-pc-linux-gnu-library/4.0/rlist’
ERROR: dependency ‘rlist’ is not available for package ‘highcharter’
* removing ‘/home/mgm49/R/x86_64-pc-linux-gnu-library/4.0/highcharter’

The downloaded source packages are in
        '/tmp/Rtmp55Isqu/downloaded_packages'

Quitting from lines  at lines 58-87 [load environment] (tmpw8s3x_7g.PreflightFastq.Rmd)
Error in `library()`:
! there is no package called 'highcharter'
Backtrace:
 1. base::library(highcharter)
Execution halted
pdimens commented 4 months ago

I swapped the xml dependency as described here from conda-forge::libxml2 to r-xml Does that improve things?

gmkov commented 4 months ago

Hi

new error when running preflight bams, slightly different:

Could not create conda environment from /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.harpy_envs/r-env.yaml:
Command:
conda env create --quiet --no-default-packages --file "/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/0ed40cb3b057aecef3c4eec51d32d608_.yaml" --prefix "/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/0ed40cb3b057aecef3c4eec51d32d608_"
Output:
Channels:
 - bioconda
 - conda-forge
 - defaults
 - r
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
ERROR conda.core.link:_execute(949): An error occurred while installing package 'conda-forge::r-rcpp-1.0.12-r42h7df8631_0'.
Rolling back transaction: ...working... done

[Errno 2] No such file or directory: '/rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/0ed40cb3b057aecef3c4eec51d32d608_/lib/R/library/Rcpp/include/Rcpp/platform/compiler.h'

when running align bwa, it runs well (produces bams) but also struggles to make reports:


Activating conda environment: .snakemake/conda/0ed40cb3b057aecef3c4eec51d32d608_

EnvironmentLocationNotFound: Not a conda environment: /rds/project/cj107/rds-cj107-jiggins-rds/projects/mgm49/helicoverpa/09.snakemake.batch1.haplo.parse.barcodes/harpy-dev/47.all.brazil.haplotagging.n93/.snakemake/conda/0ed40cb3b057aecef3c4eec51d32d608_

Error: pandoc version 1.12.3 or higher is required and was not found (see the help page ?rmarkdown::pandoc_available).
Execution halted
RuleException:

bit weird will try again

pdimens commented 4 months ago

I'm at a loss with this. I've begun trying to figure out how to achieve this using apptainer but it might be a bit because my initial attempts are not succesful.