nrnb / GoogleSummerOfCode

Main documentation site for NRNB GSoC project ideas and resources
115 stars 39 forks source link

Using Foundation Models to Infer Gene Regulatory Networks #238

Closed n-tasnina closed 5 months ago

n-tasnina commented 6 months ago

Background

In recent years, there has been a growing interest in developing generative pretrained models for single-cell biology, such as scGPT, scBERT, and Geneformer. While foundation models have demonstrated remarkable success in language and computer vision domains, their general applicability in biology remains uncertain. This project seeks to establish a framework for assessing the performance of foundation models in generating embeddings that are suitable for gene regulatory network (GRN) inference. Given the project's timeline, emphasis will be placed on evaluating scGPT, the most recent foundation model. The evaluation process will leverage BEELINE, a well-established benchmark for GRN inference models. By doing so, this project aims to establish the basis for a standardized approach for gauging the effectiveness of both current and future foundational models in GRN inference.

References:

  1. Cui, H., Wang, C., Maan, H., Pang, K., Luo, F., Duan, N. and Wang, B., 2024. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods, pp.1-11.
  2. Yang, F., Wang, W., Wang, F., Fang, Y., Tang, D., Huang, J., Lu, H. and Yao, J., 2022. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Machine Intelligence, 4(10), pp.852-866.
  3. Theodoris, C.V., Xiao, L., Chopra, A., Chaffin, M.D., Al Sayed, Z.R., Hill, M.C., Mantineo, H., Brydon, E.M., Zeng, Z., Liu, X.S. and Ellinor, P.T., 2023. Transfer learning enables predictions in network biology. Nature, 618(7965), pp.616-624.
  4. Pratapa, A., Jalihal, A.P., Law, J.N., Bharadwaj, A. and Murali, T.M., 2020. Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Nature Methods, 17(2), pp.147-154.

Goal

Steps:

  1. Given a cell-by-gene matrix containing the read counts of RNA molecules in each cell, compute gene embedding using pretrained scGPT.
  2. Pick GRN inference methods included in BEELINE that do not require pseudotime information (e.g., GENIE3, GRNBoost2 and PIDC).
    • Infer a GRN using each method from the gene embeddings generated by scGPT.
    • Infer a GRN using each method from the original gene expression data.
  3. Compare these GRNs to ground truth networks to obtain insights into the utility of gene embeddings derived from scGPT compared to gene expression data.

Extension: If time allows, the project can evaluate additional foundation models on single cell data such as scBERT and Geneformer, thus performing a more comprehensive comparative analysis among single-cell foundation models.

Difficulty Level: Hard

Prospective contributors must possess or be willing to acquire a robust comprehension of gene regulatory networks. Additionally, they should demonstrate proficiency in effectively applying large language models (LLMs) in downstream tasks. Understanding the BEELINE framework is necessary for integrating their contributions into the existing framework.

Size and Length of Project

Large: ~350 hours

Skills

Essential skills:

Nice to have skills:

Public Repository

BEELINE

Potential Mentors

Nure Tasnina (tasnina@vt.edu), Blessy Antony(blessyantony9@gmail.com)

blessyantony9 commented 6 months ago

Hi @khanspers, thank you for reviewing the project. Could you please add me as an additional mentor to this project? Thank you!

khanspers commented 6 months ago

Hi @khanspers, thank you for reviewing the project. Could you please add me as an additional mentor to this project? Thank you!

Hi @blessyantony9, I will respond by email.

blessyantony9 commented 6 months ago

Thank you very much!

FaizalJnu commented 6 months ago

Hello Mentors @blessyantony9 and @n-tasnina I am an Computer Science undergrad student from Jawaharlal Nehru University, Delhi. I have been extensively going through the material on the subject provided. I've taken a Biology course in my second sem and BioInformatics course in my previous sem, so I've pretty decent undergrad level understanding of Genes, Single celled biology and Genetics in general.

I am also proficient in using the transformers library, huggingface and all the extensive models available for various use cases such as Semantic Search, text generation, Zero Shot Classification and much more.

I also have plenty of experience with Pytorch lib demonstrated through two of my previous projects:

  1. Self Driving Car
  2. Lunar Landing AI

I'll be devoting my time and effort in learning more about BEELINE and possibly coming up with a mini project of my own solving the tasks given in the description.

FaizalJnu commented 6 months ago

Hello Mentors @blessyantony9 and @n-tasnina So far I've been trying to implement the tasks provided and have run into a bit of snafu, hopefully you'll be able to help me proceed.

Ques1: Since we have to compare the GRNs with the ground truth networks, do we generate synthetics gtns for the particular processes we're interested in or are the files available?

Ques2: While trying to familiarize myself with scGPT, I realized we will be needing significant amount of GPU for training and loading purposes, particularly NVIDIA gpus as provided in the scgpt documentation. So far I've been using kaggle's free resources to accomplish the task of going through the tutorials to better understand scgpt. Will we eventually require more than the available free resources?

Update: I've gone through all the tutorials given in the scgpt github documentation to better understand scgpt, .h5ad data files, embeddings as defined in our context and GRNs in general. If there's anything else you'd like me to add to my personal training dataset(pun intended), please let me know!

n-tasnina commented 6 months ago

Ques1: Since we have to compare the GRNs with the ground truth networks, do we generate synthetics gtns for the particular processes we're interested in or are the files available?

Ques2: While trying to familiarize myself with scGPT, I realized we will be needing significant amount of GPU for training and loading purposes, particularly NVIDIA gpus as provided in the scgpt documentation. So far I've been using kaggle's free resources to accomplish the task of going through the tutorials to better understand scgpt. Will we eventually require more than the available free resources?

@FaizalJnu Thank you for showing interest in this project and all the work you have done so far.

Ideally, you should be able to use the simulated gene expression data (with reference GRN) available in BEELINE. See the Input Datasets (https://murali-group.github.io/Beeline/BEELINE.html) section here. Let us know if you have any further questions.

@khanspers Can you help me answer this question: will the contributors be provided with access to GPUs by NRNB or Google if required?

FaizalJnu commented 6 months ago

Hello @n-tasnina, Thank you for your input!

So I'll be using the ExpressionData.csv as the cell by gene matrix dataset and the sampleNetwork.csv as the ground truth network dataset to draw the final comparisons from in this case I presume? (Sorry if the doubts seem to rudimentary)

I was also reading through a number of different papers on LLM-Genome relations, to gain a stronger understanding of the concept I came across this paper, which helped expand my understanding to a great extent - https://arxiv.org/pdf/2311.07621.pdf, Do you have any more similar works you can recommend me?

n-tasnina commented 6 months ago

Hello @n-tasnina, Thank you for your input!

So I'll be using the ExpressionData.csv as the cell by gene matrix dataset and the sampleNetwork.csv as the ground truth network dataset to draw the final comparisons from in this case I presume? (Sorry if the doubts seem to rudimentary)

@FaizalJnu You are right except for the reference file: it should be refNetwork.csv.

n-tasnina commented 6 months ago

I was also reading through a number of different papers on LLM-Genome relations, to gain a stronger understanding of the concept I came across this paper, which helped expand my understanding to a great extent - https://arxiv.org/pdf/2311.07621.pdf, Do you have any more similar works you can recommend me?

@FaizalJnu You can go over the papers mentioned in the attached file. You may want to prioritize the papers regarding LLM on single-cell data. LLM_single_cell.pdf

Also, for future use, https://www.connectedpapers.com/ is a good site where you can find relevant papers given a certain paper.

FaizalJnu commented 6 months ago

Hey @n-tasnina this is great. Thank you so much!

khanspers commented 6 months ago

Ques2: While trying to familiarize myself with scGPT, I realized we will be needing significant amount of GPU for training and loading purposes, particularly NVIDIA gpus as provided in the scgpt documentation. So far I've been using kaggle's free resources to accomplish the task of going through the tutorials to better understand scgpt. Will we eventually require more than the available free resources?

@khanspers Can you help me answer this question: will the contributors be provided with access to GPUs by NRNB or Google if required?

@FaizalJnu : Sorry for the late response. Regarding GPUs for computing, I'm afraid neither Google nor NRNB provides this.

FaizalJnu commented 6 months ago

@khanspers no worries then, there are numerous workarounds that are less convenient but can be managed.

Also, @n-tasnina I was working on the final draft of my proposal, since it is advised to have your mentors go through it once, shall I forward the link here or perhaps mail it to you?

n-tasnina commented 6 months ago

Also, @n-tasnina I was working on the final draft of my proposal, since it is advised to have your mentors go through it once, shall I forward the link here or perhaps mail it to you?

Please email your draft proposal to both the mentors. We can communicate over email.

FaizalJnu commented 6 months ago

Hey @n-tasnina, wanted to ask if only evaluating ScGPT will give us a suitable benchmark to develop the framework we seek. I mean, while I was working for a security based organisation, they instructed me to use as many models as possible to develop a similar final framework for evaluating security of different domains. As I understand, the data entry points for Generating GRNs for the final comparison are numerous in number as opposed to the work I did previously, but it still seems like I'm missing something important that differentiates the two pieces of work.

n-tasnina commented 6 months ago

@FaizalJnu That's a good point. The first goal of this project would be to build a framework/pipeline where anyone can plugin gene embedding generated by a foundation model to BEELINE and infer GRN. And for this, we will start with the most recent foundation model, scGPT. If time allows, a natural extension of this project would be to incorporate other foundation models such as scBERT, Geneformer as mentioned in the 'Goal: Extension' section of the project. This way we can do comparative analysis across foundation models in terms of GRN inference. I hope it answers your question.

FaizalJnu commented 6 months ago

hey @n-tasnina, thanks! I'll add the extended points to my proposal then thank you!

khanspers commented 6 months ago

@FaizalJnu : Please note that you can update a draft proposal in the GSoC interface up until the deadline (April 2, 18:00 UTC), so it is encouraged that you submit your proposal well before the deadline. All proposals must be submitted via the GSoC interface; we cannot consider proposals sent to us by other means.

FaizalJnu commented 6 months ago

Yes ma'am(@khanspers) The proposal is almost finished and will be uploaded to the GSoC portal well before the deadline. Thank you for your reminder!

yeodynasty commented 6 months ago

@.*** & all,

This area of work is exactly what I intended to recruit students for. Since each of us is resource stretched, may I suggest we coordinate to work on different models & pool our findings together for the benchmark? Sincerely,Hock ChuanSenior Scientist, Bioinformatics Institute, Singapore  On Thursday, 28 March 2024 at 06:04:14 am SGT, Faizal Rahman @.***> wrote:

Yes @.***) The proposal is almost finished and will be uploaded to the GSoC portal well before the deadline. Thank you for your reminder!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

khanspers commented 5 months ago

This is an active GSoC 2024 project. Closing this project idea as it is no longer available to other contributors.