iDrDex / star_api

API access to STARGEO: stargeo.org
2 stars 2 forks source link

Optimize permutations #11

Open Suor opened 8 years ago

Suor commented 8 years ago

Analysis with permutations became extremely slow. Dexter gave me a task to optimize it. The most visible issue was slow fold changes.

I optimized fold changes here and here. I believe meta-analysis should be a bottleneck now, @idrdex please confirm or deny.

The other issue is that I can't really test this with mygene_filter as I don't have CSV file you are using (dengue_perm_analysis.csv) in analysis. Please supply it or give me a code to generate it.

Suor commented 8 years ago

I also explored meta-analysis optimization opportunities. Here are options I see:

  1. Reduce number of genes. E.g. analyse all genes for small number or permutations, then select interesting genes and do more permutations, repeat until required precision is reached. This is what you were talking about. Variant: make first crude filtering based on normal meta-analysis without permutations.
  2. Only calculate TE_fixed and TE_random for permutation analysis, other fields, like confidence intervals, etc, are never used. This is easy to implement, but cuts time only by third.
  3. Reimplement all of the meta-analysis in Cython, will take few days, can speed up 10x. Code will become significantly less readable, so future modifications will be hard.
  4. Parallelize. Relatively simple to implement, will potentially speed up to the number of cores. Up to 4-8. Hard to say how more cores will do though. Also moderately inhibits future modifications.

I would like to delay 3 and 4 as far as possible as both make future modifications harder.

dhimmel commented 8 years ago

Also consider the numba just in time compiler. It is super easy to implement: just add a single decorator. It may not be able to JIT all the code. There is also a nogil=True option with numba.jit which release the GIL, so you can use threading for concurrency.

And another option is to reduce the number of permutations and then use an extreme value distribution to calculate a p-value. See this paper

Suor commented 8 years ago

Already tried numba, so far it only makes things slower. nogil=True doesn't work with high-level code using pandas and numpy, I tried it - code becomes even uglier than in Cython. Also, GIL actually makes things faster not slower, so there is no point. GIL prevents using threads for calculation, but processes could be used just fine.