Feature request: multithreaded `pheweb matrix`

jgrundstad commented 5 years ago

Would it be possible to perform the matrix processing stage on a per-chromosome basis to allow distribution of the work?

Thanks!

pjvandehaar commented 5 years ago

There's two ways to deal with the single-threaded time taken by pheweb matrix:

(a) Parallelize pheweb matrix.

(b) Instead of running pheweb process, run

    pheweb phenolist verify &&
    pheweb parse-input-files &&
    pheweb sites &&
    pheweb make-gene-aliases-trie &&
    pheweb add-rsids &&
    pheweb add-genes &&
    pheweb make-tries &&
    pheweb augment-phenos

and then, in parallel, run both pheweb matrix and

    pheweb manhattan &&
    pheweb qq &&
    pheweb bgzip-phenos &&
    pheweb top-hits &&
    pheweb phenotypes &&
    pheweb gather-pvalues-for-each-gene

If the single threaded pheweb matrix finishes first, this solution is likely faster.

I'm not sure how to implement this option well. If pheweb used multiprocessing to run pheweb matrix in parallel with the other commands, debugging and terminal output will be harder. Perhaps I could add an option pheweb process --matrix-in-parallel, in which pheweb matrix output would be redirected to a log file?

Simply because of the debugging complexity, I like option (a), like you proposed.

Currently, pheweb makes the matrix from the un-compressed per-phenotype annotated files. (pheno/* on https://github.com/statgen/pheweb/blob/master/etc/detailed-internal-dataflow.md). Since these are uncompressed, they're not indexed.

To read a single chromosome from an individual phenotype file for use in pheweb matrix, either:

Each reader does a bisect to find the line to start reading on
Read directly from the tabixed files (which might be slow)

What do you think is the right approach? Are you interested in implementing this feature?

jgrundstad commented 5 years ago

I'm not sure what the best route is, still getting a feeling for how this whole process works :) Unfortunately bandwidth limits will keep me in a user role for the time being.

I'm willing to try option 1.b. in the near future - I didn't realize stages could be run concurrently.

I'm currently processing a set of 12m snps, 2000 phenotypes, early calculations indicate about 100hours of processing for the pheweb matrix stage. It's running one core at ~35% load and should be done in 3.5 days. How long would you anticipate the remaining steps should take?

pjvandehaar commented 5 years ago

How many cores on the machine?

pheweb process -h will show all the steps it runs. The only heavy one remaining is pheweb gather-pvalues-for-each-gene, which is 4x faster than pheweb matrix with 40 cores. I've never run a large dataset with fewer, but hopefully it works quickly. Once it starts it should give a good estimate.

Do you predict runtime with something like gzip -cd .generated-by-pheweb/tmp/matrix.tsv.gz | wc -l?

jgrundstad commented 5 years ago

I'm working with 32 cores. Good to know about gather-pvalues-for-each-gene - I assume that one is multithreaded?

Yeah, I estimated runtime based on time to generate output in matrix.tsv.gz.

pjvandehaar commented 5 years ago

Yep, it's multithreaded.

hanayik commented 4 years ago

Hi, does anyone in this thread know of a branch or fork that has implemented a parallelized pheweb matrix?

pjvandehaar commented 4 years ago

@hanayik I'm not aware of such a fork. I'd love to have a parallel matrix feature and I can help if you'd like to work on it. Changes would be needed in https://github.com/statgen/pheweb/blob/master/pheweb/load/matrix.py and https://github.com/statgen/pheweb/blob/master/pheweb/load/cffi/x.cpp .

It seems like most of the time goes to gzipping the output. I just changed the gzip compression level from 6 to 2 which will hopefully be much (3x?) faster. But there's still a lot to gain from parallelization.

Some considerations:

Parallelization will have to happen by genomic region, but which regions should be used? To start with, I think each process should handle one chromosome. That would be a 10x speedup. Later, this could be changed to ranges covering one num_variants / config.num_cpus variants according to generated-by-pheweb/sites/sites.tsv. Even when just parallelizing by chromosome, it's probably worth having the master process scan through generated-by-pheweb/sites/sites.tsv to check which chromosomes occur.
Parallelization should happen in Python, and each multiprocessing-spawned thread should call C++ through cffi.
Each chromosome-handling-child-process will need to find the byte-offset in generated-by-pheweb/sites/sites.tsv and in each generated-by-pheweb/pheno/* file at which to start reading its chromosome. To start, I think python should just do a bisect search on each file. That's easy since every row begins with chrom, pos, ref, alt.
Each process will write a file like generated-by-pheweb/tmp/matrix-chr12.tsv.gz (with no bgzip-end-of-file-marking empty block), and then the master process will concatenate these and append a single bgzip-end-of-file-marking empty block to make matrix.tsv.gz.

Shicheng-Guo commented 3 years ago

Looks not so bad. I have 1262 phenotypes and, with single thread, pheweb matrix only takes 62880 seconds. I have a question: Is there any way to speed up pheweb bgzip-phenos step?

==> Starting `pheweb phenolist verify`
The 1262 phenotypes in '/projects/ps-janssen4/dsci-csb/user/sguo2/pheweb/pheno-list.json' look good.
==> Completed in 0 seconds

==> Starting `pheweb sites`
The list of sites is up-to-date!
==> Completed in 1 seconds

==> Starting `pheweb make-gene-aliases-trie`
gene aliases are at '/home/sguo2/.pheweb/cache/gene_aliases-v29-hg19.marisa_trie'
==> Completed in 0 seconds

==> Starting `pheweb add-rsids`
rsid annotation is up-to-date!
==> Completed in 0 seconds

==> Starting `pheweb add-genes`
gene annotation is up-to-date!
==> Completed in 0 seconds

==> Starting `pheweb make-tries`
tries are up-to-date!
==> Completed in 0 seconds

==> Starting `pheweb augment-phenos`
Output files are all newer than input files, so there's nothing to do.
==> Completed in 1 seconds

==> Starting `pheweb manhattan`
Output files are all newer than input files, so there's nothing to do.
==> Completed in 1 seconds

==> Starting `pheweb qq`
Output files are all newer than input files, so there's nothing to do.
==> Completed in 2 seconds

==> Starting `pheweb matrix`
==> Completed in 62880 seconds

==> Starting `pheweb bgzip-phenos`
Processing 1262 phenos

pjvandehaar commented 3 years ago

@Shicheng-Guo Yeah, if your system is CPU-limited (rather than IO-limited), then reducing the gzip compression level will probably make pheweb bgzip-phenos 2x-3x faster. Unfortunately, I don't think pysam lets you choose gzip compression level. It looks easy to switch from pysam.tabix_compress to the code in load/cffi/x.cpp, but I personally am not going to do it, because I only run pheweb on machines with plenty of CPU. If you'd like to work on this, please start a new github issue and I'll be happy to give advice.

statgen / pheweb

Feature request: multithreaded `pheweb matrix` #115