omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
96 stars 22 forks source link

Issue when computing PRS via polypred #108

Closed mkoromina closed 2 years ago

mkoromina commented 2 years ago

Hi @omerwe ,

Quick question with regards to the calculation of PRS via polypred. All is running well until I reach the final step, where we compute PRS by using the combined SNP effect sizes. Specifically, I run the following:

python /path/to/polypred.py \    
 --predict \     
--betas /path/to/output/polypred.betas \   
  --output-prefix /path/to/output/polypred.predictions \   
  --plink-exe ~/path/to/plink/plink \    
 /path/to/my/.*.bed   

However, the jobs exit with an error of 'duplicate snps found in the input files'. Do you know how we can efficiently address this issue? There is a high chance that these are multi-allelic SNPs that could contribute though to PRS scores. Any thoughts on that?

Many thanks,
Maria

p.s= One more thing: Do you know if polufun will be available as a singularity container via docker hub?

omerwe commented 2 years ago

Hi @mkoromina,

the code should handle multi-allelic SNPs fine... Can you run the example in the wiki without a problem? Unfortunately, it will be difficult to diagnose without a reproducible example. Can you please try to create one?

You could start by e.g. using only the first 100 SNPs in each chromosome and see if the problem persists. If it doesn't, you can expand to the top 1000, the top 10000, etc., until the problem replicates. Perhaps it will help you figure out where the problem is coming from. If you can create a short reproducible example, can you please send it to me at oweissbrod@hsph.harvard.edu?

Re Docker: It's a good idea, but I'm not sure it would work well in practice because Conda packages are (unfortunately) not sufficiently cross-platform. I guess I could do it restricting to a specific platform, but I don't have the bandwidth right now. If any PolyFun users can help, it would be highly appreciated!

mkoromina commented 2 years ago

HI @omerwe,

Many thanks for your quick reply! I was able to get the wiki example running fine, so it should be an issue with one of the .bed files that I am using (I will check that there are no other issues with it and come back to you if needed).

One more (a bit more generic) question: in the wiki steps 3 and 4 of computing PRS with Polypred, do the. bed files and the pheno.txt correspond to the ones for the examined cohort (i.e. the cohort for which we are computing PRS scores)? Or is step 3 more like a "tuning" step for which we use the .bed files of a 'training' cohort?

The reason asking for this is that on step 3 of the wiki, I run:

python /path/to/polypred.py \
    --combine-betas \
    --betas /path/to/other/method/effect_sizes/stats.gz, /path/to/polyfun.agg.txt.gz \
    --pheno /path/to/pheno.txt  \
    --output-prefix polypred_combined_effects \
    --plink-exe ~/plink/plink \
    /path/to/cohort.bed

However, this crashes with the following error: polypred.py unrecognized argument: /path/to/cohort.bed It seems to me that it is something very simple that could easily be fixed, but you could kindly let me know what could be wrong here?

With my best wishes, Maria

jdblischak commented 2 years ago

@mkoromina How did you want to use the Dockerfile? Do you need it to only provide the software dependencies? Or do you want to also be able to include the scripts inside the Dockerfile so that you can call them directly?

This Dockerfile would install the provided conda environment in polyfun.yml. It uses the image condaforge/mambaforge

from condaforge/mambaforge:4.12.0-0

COPY polyfun.yml .

RUN mamba env create --file polyfun.yml

RUN echo ". ${CONDA_DIR}/etc/profile.d/conda.sh && conda activate polyfun" >> ~/.bashrc

I tested it locally to confirm that the polyfun environment is activated. The trickiest part is the last line, which I adapted from here

mkoromina commented 2 years ago

Hi @jdblischak ,

many thanks for this! I actually want the latter option, i.e. both software dependencies and the scripts inside the Dockerfile. I shall test the instructions as you kindly provided above and come back to you asap!

Many thanks, Maria

jdblischak commented 2 years ago

Running the scripts from inside the container gets more complicated since the conda environment needs to be activated properly. I found this post from pythonspeed.com to be very helpful.

Here's the Dockerfile:

from condaforge/mambaforge:4.12.0-0

COPY polyfun.yml .

RUN mamba env create --file polyfun.yml && conda clean -ay

COPY *.py LICENSE ukb_regions.tsv.gz ./

COPY ldsc_polyfun ./ldsc_polyfun

COPY ldstore ./ldstore

RUN echo ". ${CONDA_DIR}/etc/profile.d/conda.sh && conda activate polyfun" >> ~/.bashrc && \
    echo ". ${CONDA_DIR}/etc/profile.d/conda.sh && conda activate polyfun" >> /etc/skel/.bashrc

ENTRYPOINT ["conda", "run", "--no-capture-output", "-n", "polyfun"]

CMD ["/bin/bash"]

If you run it interactively with -it, the "polyfun" environment is automatically activated upon login. If you instead run a command, it will run it inside of the "polyfun" environment thanks to the conda run ENTRYPOINT.

# clone the repo
git clone https://github.com/omerwe/polyfun.git
cd polyfun

# Build the container (after saving the Dockerfile above to root of polyfun repo)
docker build -t pf .

# Run it interactively
docker run --rm -it pf

# Execute a single command
docker run --rm pf python finemapper.py --help
# (note: the above is only really useful if you mount a local directory with -v)

Note that I only copied over the files required to run the PolyFun scripts. Copying the example data files would unnecessarily increase the size of the Docker image.

mkoromina commented 2 years ago

Many many thanks @jdblischak for all these instructions! I will go through these within the next 2 days and come back to you asap! Truly appreciated!

mkoromina commented 2 years ago

Hi @jdblischak , Docker is not supported as an option in the server I am using but singularity is. Would the steps then be the same? Many thanks, Maria

jdblischak commented 2 years ago

You'll need to build the Docker image on your personal computer. This will be easiest with some variety of Linux (e.g. Ubuntu, WSL2, etc). Docker Desktop is easy to run on Windows and macOS, but it has license restrictions (that I can't advise you on). You could also try an alternative like minikube.

Once you've created a Docker image, you have multiple options for using it on your HPC:

  1. You could upload it to DockerHub, and then have it automatically converted to a Singularity container with something like singularity run docker://account/name:version, e.g. singularity run docker://condaforge/mambaforge:4.12.0-0. You'll likely need to login to DockerHub since their API has download limits
  2. You could convert it to a singularity image on your local machine with singularity build --remote (will also require creating an account on https://cloud.sylabs.io/), and then scp the singularity image to your HPC

Needless to say, there are lots of options, and the best solution for you will depend on your employer (ie can you use Docker Desktop or not) and your infrastructure (local computer and HPC server).

Thus it might be a good idea to step back and ask: what exactly are you trying to accomplish by running PolyFun in a singularity container? Is this absolutely required for your computational setup?

mkoromina commented 2 years ago

Hi @jdblischak,

Many thanks for all this useful advice! I will see if I can proceed my analysis by running Polyfun via a conda environment and if not follow your instructions and do this via singularity. In any case, thank you once again for all your advice; it is truly appreciated!