nrnb / GoogleSummerOfCode

Main documentation site for NRNB GSoC project ideas and resources
114 stars 38 forks source link

netDx: Implement GeneMANIA in Julia to optimize for high-performance computing #70

Closed shraddhapai closed 7 years ago

shraddhapai commented 7 years ago

Background

netDx is a versatile patient classifier algorithm that integrates heterogeneous input data types (e.g. clinical and genomic) and uses machine-learning to identify features predictive of patient class [1]. netDx models input data as patient networks and uses the GeneMANIA machine learning algorithm for network integration and feature selection [2]. Patient classes can be any category of biomedical interest, including disease subtypes, responsiveness to medication, or disease outcome.

Project Goal

Implement GeneMANIA in Julia to optimize netDx for high-performance computing

The current Java-based implementation of GeneMANIA scales poorly on compute clusters because of the interaction Java's memory management with the architecture on these systems. Removing this bottleneck would allow netDx to handle datasets on the order of 10K-100K patients, 100-1000X larger than the size of current datasets. This project proposes reimplementing the GeneMANIA algorithm in the language Julia [3].

Julia is a high-level programming language, with syntax similar to Matlab and Python [3]. It provides efficient matrix representation and built-in parallel execution capabilities, making it better suited for high-performance computing (HPC). In addition to optimizing netDx for HPC, the new implementation will be tuned for problems specific to netDx, such as having relatively fewer nodes and more networks and using lasso-based regression for feature selection.

GeneMANIA algorithm

The details of the GeneMANIA algorithm are in Ref 2. Briefly, GeneMANIA represents the entire set of input networks as a single matrix that can be thought of as a composite adjacency matrix.

  1. Input networks are integrated by ridge regression on the composite matrix. The cost function promotes networks with the highest same-class similarity, so that such networks have higher weights, and regularization eliminates redundant networks.
  2. Patients of unknown label are ranked by similarity to user-provided query nodes (“find patients like these known examples”). GeneMANIA performs label propagation on the integrated network from step 1, starting from query nodes and walking down nearest neighbours, until all connected patients have been encountered. This “guilt by association” approach is used to rank patients by query similarity.

Technologies

The candidate must have sufficient programming knowledge to understand some details of the current Java-based implementation of GeneMANIA. Familiarity with Julia is an asset but not required. Julia combines features of Matlab and Python, so familiarity with these languages would be an asset.

Difficulty Level: 3

Familiarity with matrix-based operations and regression is required.

Potential mentors

Quaid Morris, Shraddha Pai, and Gary Bader

References

[1] Pai et al. (2016). preprint http://biorxiv.org/content/early/2016/10/31/084418 [2] Mostafavi et al. (2010). Bioinformatics. 26:1759. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2894508/ [3] http://julialang.org/

memoiry commented 7 years ago

Hi, I'm really interested in this project, Can you guide me how to get started in knowing more about this project? I'm a junior student major in automation.

memoiry commented 7 years ago

and ..

I have experience of using Java and I have known julia for a quite long time, which is a language I deem magic, but I just do not get the chance to use it to accomplish a project. While I have not written a lot java codes, I think there will not be much problems for me to read through the java code.

And I'm quite familiar with matlab, and R, c++, and a little of python. can I apply for this project?

here is my blog memoiry.me

Thanks!

shraddhapai commented 7 years ago

Hi,

Thanks for getting in touch; it's great that you're interested in the GeneMANIA implementation project!

The GSoC 2017 selection process is still underway and mentor groups will be announced on 27 February 2017. Our group, NRNB, has been selected several years in the past, so the chances of being selected this year are very good. After selection, we would be happy to work with interested applicants such as yourself, to develop a student proposal.

This is NRNB's GSoC page. It has the timeline of key dates: http://nrnb.org/gsoc.html Be sure to check the student guides under the "How to Apply" tab as well.

Your background and experience seem relevant; this is a project requiring moderate expertise in Matlab, Python, and/or Julia, Java familiarity and comfort with the math involved. I looked at your blog but did not find a CV; it would be great if you could send me one in English, but that can also wait till the GSoC announcement is made at the end of February. Where are you based?

I encourage you to write back if you have any other questions. It's an exciting project, so we're looking forward to a chance to try it out!

Regards, Shraddha

Shraddha Pai Post-doctoral Fellow, http://baderlab.org The Donnelly Center, University of Toronto

On Sun, Jan 15, 2017 at 12:40 AM, Convex Path notifications@github.com wrote:

and ..

I have experience of using Java and I have known julia for a quite long time, which is a language I deem magic, but I just do not get the chance to use it to accomplish a project. While I have not written a lot java codes, I think there will not be much problems for me to read through the java code.

And I'm quite familiar with matlab, and R, c++, and a little of python. can I apply for this project?

here is my blog memoiry.me

Thanks!

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/nrnb/GoogleSummerOfCode/issues/70#issuecomment-272675222, or mute the thread https://github.com/notifications/unsubscribe-auth/AHJ6Ai5NwmjOU2t8qber_dBYWpbqSFb7ks5rSbFVgaJpZM4LiFnB .

memoiry commented 7 years ago

Hi

I have sent a draft proposal to your email.

and I have finished a label propagation implementation in Julia for our project. https://github.com/memoiry/labelPropagation.jl

Thanks for the detailed explanation.

shraddhapai commented 7 years ago

Hi Guodong, I got your message yesterday - great progress! I will get back to you on it separately.

Some context for the NRNB group: Guodong and I have been communicating roughly once every 1-2 weeks. He has expressed an interest in learning concepts related to the GeneMANIA implementation project and I've suggested background reading, etc., Together we have discussed what main steps the implementation would require, and technical considerations for each. Guodong has been quite diligent in following up, as evidenced by his work above.

Guodong is aware that organization selection for GSoC this year will occur at the end of Feb.

Regards, Shraddha

Shraddha Pai Post-doctoral Fellow, http://baderlab.org The Donnelly Center, University of Toronto

On Mon, Feb 6, 2017 at 11:14 PM, Convex Path notifications@github.com wrote:

Hi

I have sent a draft proposal to your email.

and I have finished a label propagation implementation in Julia for our project. https://github.com/memoiry/labelPropagation.jl

Thanks for the detailed explanation.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/nrnb/GoogleSummerOfCode/issues/70#issuecomment-277896904, or mute the thread https://github.com/notifications/unsubscribe-auth/AHJ6AnTtcNl7aLHRY5HXSaS9Wg85qGD-ks5rZ--1gaJpZM4LiFnB .

khanspers commented 7 years ago

GSoC 2017 selected project