statgen / pheweb

A tool to build a website to browse hundreds or thousands of GWAS.
MIT License
154 stars 65 forks source link

Easier Way to Use Custom LD? #204

Open ttbek opened 1 year ago

ttbek commented 1 year ago

If I'm reading the code correctly, in order to use my own LD I would need to actually change the source passed to locuszoom.js, find the calls used while populating in locuszoom.js, and then set up our own restful API server to give the json response?

Does the following match the call format used in locuszoom.js? That is, does it use the region call like that?
https://portaldev.sph.umich.edu/ld/genome_builds/GRCh37/references/1000G/populations/AFR/regions?correlation=r&chrom=X&start=67544032&stop=67544350

Is there an easier way to load from a local precalculated LD file?

Maybe customize the locuszoom.js (means loading the changed version from our server instead of the currently set source) by changing their populate function?

I'm a bit unsure what the best approach would be here. Preferably we would also still be able to load the 1000 genomes LD values but we want to also display some custom ones.

abought commented 1 year ago

Thanks for your question.

First, the fancy answer to your question: if you're comfortable running the infrastructure, then the code to calculate custom LD is open source and you can run a compatible API locally: https://github.com/statgen/LDServer/ https://github.com/statgen/LDServer/ And you can experiment with API query syntax for that LD server here: https://portaldev.sph.umich.edu/playground https://portaldev.sph.umich.edu/playground

If you only want to use precalculated LD from a file: it is possible to load from pre-calculated LD, but it's kind of unwieldy. The main issue is that pre-calculated LD files can be extremely large to store and query; you can try to reduce the size by only including LD relative to a preset list of lead variants.... but for phewas-scale data, you might have a lot of lead variants (=bigger files).

The newest version of Locuszoom.js includes some features designed to help read such LD files (it's what we use for LocalZoom & my.locuszoom.org http://my.locuszoom.org/). But PheWeb uses an older version of Locuszoom (see a rough draft of what PheWeb code would need to change https://github.com/statgen/pheweb/pull/185 to work with LocusZoom.js 0.14.0, which added the "use LD from local file" helper code). There's no strong technical reason why PheWeb isn't updated for LZ.js 14, except that the original developer of PheWeb moved on and the task got lost in the shuffle.

Some LZ.js demos show how you would modify the plot creation code to specify LD from a local file. We use tabix to make the queries (slightly) more manageable, but custom LD can still be a very big file. PLINK can be kind of slow calculating that much LD the first time, but the demo shows expected file format so you can substitute other tools of your choosing. https://statgen.github.io/locuszoom/examples/ext/tabix_tracks.html https://statgen.github.io/locuszoom/examples/ext/tabix_tracks.html https://github.com/statgen/locuszoom/blob/develop/examples/ext/tabix_tracks.html#L203-L207 https://github.com/statgen/locuszoom/blob/develop/examples/ext/tabix_tracks.html#L203-L207

Anyway, I hope this helps! We always wanted to have more LD options, but in a reusable internet tool, there aren't a lot of good LD panels that people are allowed to share publicly. Every now and then I ask, in hopes that something has changed. :)

-Andy Boughton @.***

Applications Programmer/Analyst, Lead Center for Statistical Genetics University of Michigan

On Jan 2, 2023, at 5:15 AM, ttbek @.***> wrote:

If I'm reading the code correctly, in order to use my own LD I would need to actually change the source passed to locuszoom.js, find the calls used while populating in locuszoom.js, and then set up our own restful API server to give the json response?

Does the following match the call format used in locuszoom.js? That is, does it use the region call like that? https://portaldev.sph.umich.edu/ld/genome_builds/GRCh37/references/1000G/populations/AFR/regions?correlation=r&chrom=X&start=67544032&stop=67544350 https://portaldev.sph.umich.edu/ld/genome_builds/GRCh37/references/1000G/populations/AFR/regions?correlation=r&chrom=X&start=67544032&stop=67544350 Is there an easier way to load from a local precalculated LD file?

Maybe customize the locuszoom.js (means loading the changed version from our server instead of the currently set source) by changing their populate function?

I'm a bit unsure what the best approach would be here. Preferably we would also still be able to load the 1000 genomes LD values but we want to also display some custom ones.

— Reply to this email directly, view it on GitHub https://github.com/statgen/pheweb/issues/204, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWR6EITXS36NOXSXPOLRYTWQKTEXANCNFSM6AAAAAATOV6XMA. You are receiving this because you are subscribed to this thread.

ttbek commented 1 year ago

Sorry for taking so long to get back to this. I'm attempting to setup the server and I'm getting some output from the Raremetal one, but the ld server is just giving: "The connection was reset"

Output is looking like this:

sudo docker-compose down && sudo docker-compose up Removing ldserver_ldserver_1 ... done Removing ldserver_raremetal_1 ... done Removing ldserver_redis_1 ... done Creating ldserver_redis_1 ... done Creating network "ldserver_default" with the default driver Creating ldserver_redis_1 ... Creating ldserver_ldserver_1 ... done Creating ldserver_raremetal_1 ... done Attaching to ldserver_redis_1, ldserver_raremetal_1, ldserver_ldserver_1 ldserver_1 | Running startup flask add commands... raremetal_1 | [2023-02-27 15:42:11,078] INFO in model: Added genotype file: var/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.bcf raremetal_1 | [2023-02-27 15:42:11 +0000] [1] [INFO] Starting gunicorn 20.1.0 raremetal_1 | [2023-02-27 15:42:11 +0000] [1] [INFO] Listening at: http://0.0.0.0:4545 (1) raremetal_1 | [2023-02-27 15:42:11 +0000] [1] [INFO] Using worker: gthread raremetal_1 | [2023-02-27 15:42:11 +0000] [12] [INFO] Booting worker with pid: 12 raremetal_1 | [2023-02-27 15:42:12 +0000] [13] [INFO] Booting worker with pid: 13

So the only think coming from the LD container is that it is running the startup commands. I'm not seeing anything problematic in the gunicorn or Redis logs (I think at least). Do we expect the startup commands to take a long time? That is, am I just being too impatient and I'll probably get different output when it is ready, or is there probably a problem?

ttbek commented 1 year ago

Is gunicorn supposed to be running on 8000? The .env I have says 4546 and docker is mapping 4546, but the gunicorn log has this:

[2023-02-27 15:42:24 +0000] [1] [INFO] Listening at: http://0.0.0.0:8000 (1) [2023-02-27 15:42:24 +0000] [1] [INFO] Using worker: gevent [2023-02-27 15:42:24 +0000] [32] [INFO] Booting worker with pid: 32 [2023-02-27 15:42:24 +0000] [33] [INFO] Booting worker with pid: 33

My .env file:

LDSERVER_PORT=4546 LDSERVER_CONFIG_SCRIPT=/home/ldserver/startup.sh LDSERVER_WORKERS=2 RAREMETAL_CONFIG_DATA=var/config.yaml RAREMETAL_WORKERS=2 RAREMETAL_PORT=4545 OMP_NUM_THREADS=2 OPENBLAS_NUM_THREADS=2

ttbek commented 1 year ago

Ah... the example docker override file wasn't putting the port even though they showed modifying the command, so changing from this:

  gunicorn -b 0.0.0.0 -w $$LDSERVER_WORKERS -k gevent \
    --access-logfile /data/logs/gunicorn.access.log \
    --error-logfile /data/logs/gunicorn.error.log \
    --pythonpath rest 'ldserver:create_app()'"

to this:

  gunicorn -b 0.0.0.0:4546 -w $$LDSERVER_WORKERS -k gevent \
    --access-logfile /data/logs/gunicorn.access.log \
    --error-logfile /data/logs/gunicorn.error.log \
    --pythonpath rest 'ldserver:create_app()'"

Allows me to reach the endpoints http://localhost:8084/correlations (8084 is the locally mapped ssh forwarded port, it is localhost:4546 on the server side) and http://localhost:8084/genome_builds with the expected results. However, something like http://localhost:8084/genome_builds/GRCh37/references/1000G/populations/AFR/regions?correlation=r&chrom=20&start=60343&stop=65000 gives me "Internal Server Error" and looking in the gunicorn log on the server shows:

[2023-02-28 10:07:15,376] ERROR in app: Exception on /genome_builds/GRCh37/references/1000G/populations/AFR/regions [GET] Traceback (most recent call last): File "/home/ldserver/.local/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app response = self.full_dispatch_request() File "/home/ldserver/.local/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request rv = self.handle_user_exception(e) File "/home/ldserver/.local/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception reraise(exc_type, exc_value, tb) File "/home/ldserver/.local/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise raise value File "/home/ldserver/.local/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request rv = self.dispatch_request() File "/home/ldserver/.local/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request return self.view_functionsrule.endpoint File "/home/ldserver/rest/ldserver/api.py", line 165, in get_region_ld ldserver.compute_region_ld(str(args['chrom']), args['start'], args['stop'], correlation_type(args['correlation']), result, str(population_name)) RuntimeError: Error while reading a cell from Redis cache

The Redis log shows:

1:C 28 Feb 2023 10:19:52.343 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 28 Feb 2023 10:19:52.343 # Redis version=5.0.14, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 28 Feb 2023 10:19:52.343 # Configuration loaded 1:M 28 Feb 2023 10:19:52.346 Running mode=standalone, port=6379. 1:M 28 Feb 2023 10:19:52.346 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 1:M 28 Feb 2023 10:19:52.346 # Server initialized 1:M 28 Feb 2023 10:19:52.347 DB loaded from disk: 0.000 seconds 1:M 28 Feb 2023 10:19:52.347 * Ready to accept connections

I did take a shot at fixing that warning earlier and currently if I:

cat /proc/sys/net/core/somaxconn 512

Well, it looks fixed there. Maybe this parameter needs to be done inside the container? I thought it was a kernel parameter and would be outside though. It was originally 128 as the message suggests, but it has been changed to 512 and Redis has been restarted several times since then. I don't think that would be the issue, but it's all the Redis log is complaining about.

ttbek commented 1 year ago

Turns out the container may be more restricted than the kernel value, but it can be set in the dockerfile, so I added:

sysctls:
  net.core.somaxconn: 512

To the section for the alpine Redis image. It fixes that last warning in the Redis log... but no dice on the error in the gunicorn log, still get it.

ttbek commented 1 year ago

Ah, ok, ok,

warning For docker, you must change CACHE_REDIS_HOSTNAME to redis.

For some reason I read this as changing the left side, to redis, that is the text 'CACHE_REDIS_HOSTNAME' to the text 'redis'. I know, makes no sense. Fixed now. So I guess the next step is that I need to move my Pheweb over to the production server and... point it to this new LD server somehow.

Regarding available LD Panels, true indeed. Unfortunately we aren't putting out a new data set with this. The population in our Pheweb is Arab and not well represented in 1KG, so even though it is small we wanted to also show the LD from the 108 Qatari genomes published here: https://genome.cshlp.org/content/26/2/151.full.html They're publicly available on the Sequence Read Archive, just need to use the toolkit to download them due to the size.