Closed n-tasnina closed 5 months ago
Hi @khanspers, thank you for reviewing the project. Could you please add me as an additional mentor to this project? Thank you!
Hi @khanspers, thank you for reviewing the project. Could you please add me as an additional mentor to this project? Thank you!
Hi @blessyantony9, I will respond by email.
Thank you very much!
Hello Mentors @blessyantony9 and @n-tasnina I am an Computer Science undergrad student from Jawaharlal Nehru University, Delhi. I have been extensively going through the material on the subject provided. I've taken a Biology course in my second sem and BioInformatics course in my previous sem, so I've pretty decent undergrad level understanding of Genes, Single celled biology and Genetics in general.
I am also proficient in using the transformers library, huggingface and all the extensive models available for various use cases such as Semantic Search, text generation, Zero Shot Classification and much more.
I also have plenty of experience with Pytorch lib demonstrated through two of my previous projects:
I'll be devoting my time and effort in learning more about BEELINE and possibly coming up with a mini project of my own solving the tasks given in the description.
Hello Mentors @blessyantony9 and @n-tasnina So far I've been trying to implement the tasks provided and have run into a bit of snafu, hopefully you'll be able to help me proceed.
Ques1: Since we have to compare the GRNs with the ground truth networks, do we generate synthetics gtns for the particular processes we're interested in or are the files available?
Ques2: While trying to familiarize myself with scGPT, I realized we will be needing significant amount of GPU for training and loading purposes, particularly NVIDIA gpus as provided in the scgpt documentation. So far I've been using kaggle's free resources to accomplish the task of going through the tutorials to better understand scgpt. Will we eventually require more than the available free resources?
Update: I've gone through all the tutorials given in the scgpt github documentation to better understand scgpt, .h5ad data files, embeddings as defined in our context and GRNs in general. If there's anything else you'd like me to add to my personal training dataset(pun intended), please let me know!
Ques1: Since we have to compare the GRNs with the ground truth networks, do we generate synthetics gtns for the particular processes we're interested in or are the files available?
Ques2: While trying to familiarize myself with scGPT, I realized we will be needing significant amount of GPU for training and loading purposes, particularly NVIDIA gpus as provided in the scgpt documentation. So far I've been using kaggle's free resources to accomplish the task of going through the tutorials to better understand scgpt. Will we eventually require more than the available free resources?
@FaizalJnu Thank you for showing interest in this project and all the work you have done so far.
Ideally, you should be able to use the simulated gene expression data (with reference GRN) available in BEELINE. See the Input Datasets (https://murali-group.github.io/Beeline/BEELINE.html) section here. Let us know if you have any further questions.
@khanspers Can you help me answer this question: will the contributors be provided with access to GPUs by NRNB or Google if required?
Hello @n-tasnina, Thank you for your input!
So I'll be using the ExpressionData.csv as the cell by gene matrix dataset and the sampleNetwork.csv as the ground truth network dataset to draw the final comparisons from in this case I presume? (Sorry if the doubts seem to rudimentary)
I was also reading through a number of different papers on LLM-Genome relations, to gain a stronger understanding of the concept I came across this paper, which helped expand my understanding to a great extent - https://arxiv.org/pdf/2311.07621.pdf, Do you have any more similar works you can recommend me?
Hello @n-tasnina, Thank you for your input!
So I'll be using the ExpressionData.csv as the cell by gene matrix dataset and the sampleNetwork.csv as the ground truth network dataset to draw the final comparisons from in this case I presume? (Sorry if the doubts seem to rudimentary)
@FaizalJnu You are right except for the reference file: it should be refNetwork.csv.
I was also reading through a number of different papers on LLM-Genome relations, to gain a stronger understanding of the concept I came across this paper, which helped expand my understanding to a great extent - https://arxiv.org/pdf/2311.07621.pdf, Do you have any more similar works you can recommend me?
@FaizalJnu You can go over the papers mentioned in the attached file. You may want to prioritize the papers regarding LLM on single-cell data. LLM_single_cell.pdf
Also, for future use, https://www.connectedpapers.com/ is a good site where you can find relevant papers given a certain paper.
Hey @n-tasnina this is great. Thank you so much!
Ques2: While trying to familiarize myself with scGPT, I realized we will be needing significant amount of GPU for training and loading purposes, particularly NVIDIA gpus as provided in the scgpt documentation. So far I've been using kaggle's free resources to accomplish the task of going through the tutorials to better understand scgpt. Will we eventually require more than the available free resources?
@khanspers Can you help me answer this question: will the contributors be provided with access to GPUs by NRNB or Google if required?
@FaizalJnu : Sorry for the late response. Regarding GPUs for computing, I'm afraid neither Google nor NRNB provides this.
@khanspers no worries then, there are numerous workarounds that are less convenient but can be managed.
Also, @n-tasnina I was working on the final draft of my proposal, since it is advised to have your mentors go through it once, shall I forward the link here or perhaps mail it to you?
Also, @n-tasnina I was working on the final draft of my proposal, since it is advised to have your mentors go through it once, shall I forward the link here or perhaps mail it to you?
Please email your draft proposal to both the mentors. We can communicate over email.
Hey @n-tasnina, wanted to ask if only evaluating ScGPT will give us a suitable benchmark to develop the framework we seek. I mean, while I was working for a security based organisation, they instructed me to use as many models as possible to develop a similar final framework for evaluating security of different domains. As I understand, the data entry points for Generating GRNs for the final comparison are numerous in number as opposed to the work I did previously, but it still seems like I'm missing something important that differentiates the two pieces of work.
@FaizalJnu That's a good point. The first goal of this project would be to build a framework/pipeline where anyone can plugin gene embedding generated by a foundation model to BEELINE and infer GRN. And for this, we will start with the most recent foundation model, scGPT. If time allows, a natural extension of this project would be to incorporate other foundation models such as scBERT, Geneformer as mentioned in the 'Goal: Extension' section of the project. This way we can do comparative analysis across foundation models in terms of GRN inference. I hope it answers your question.
hey @n-tasnina, thanks! I'll add the extended points to my proposal then thank you!
@FaizalJnu : Please note that you can update a draft proposal in the GSoC interface up until the deadline (April 2, 18:00 UTC), so it is encouraged that you submit your proposal well before the deadline. All proposals must be submitted via the GSoC interface; we cannot consider proposals sent to us by other means.
Yes ma'am(@khanspers) The proposal is almost finished and will be uploaded to the GSoC portal well before the deadline. Thank you for your reminder!
@.*** & all,
This area of work is exactly what I intended to recruit students for. Since each of us is resource stretched, may I suggest we coordinate to work on different models & pool our findings together for the benchmark? Sincerely,Hock ChuanSenior Scientist, Bioinformatics Institute, Singapore On Thursday, 28 March 2024 at 06:04:14 am SGT, Faizal Rahman @.***> wrote:
Yes @.***) The proposal is almost finished and will be uploaded to the GSoC portal well before the deadline. Thank you for your reminder!
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
This is an active GSoC 2024 project. Closing this project idea as it is no longer available to other contributors.
Background
In recent years, there has been a growing interest in developing generative pretrained models for single-cell biology, such as scGPT, scBERT, and Geneformer. While foundation models have demonstrated remarkable success in language and computer vision domains, their general applicability in biology remains uncertain. This project seeks to establish a framework for assessing the performance of foundation models in generating embeddings that are suitable for gene regulatory network (GRN) inference. Given the project's timeline, emphasis will be placed on evaluating scGPT, the most recent foundation model. The evaluation process will leverage BEELINE, a well-established benchmark for GRN inference models. By doing so, this project aims to establish the basis for a standardized approach for gauging the effectiveness of both current and future foundational models in GRN inference.
References:
Goal
Steps:
Extension: If time allows, the project can evaluate additional foundation models on single cell data such as scBERT and Geneformer, thus performing a more comprehensive comparative analysis among single-cell foundation models.
Difficulty Level: Hard
Prospective contributors must possess or be willing to acquire a robust comprehension of gene regulatory networks. Additionally, they should demonstrate proficiency in effectively applying large language models (LLMs) in downstream tasks. Understanding the BEELINE framework is necessary for integrating their contributions into the existing framework.
Size and Length of Project
Large: ~350 hours
Skills
Essential skills:
Nice to have skills:
Public Repository
BEELINE
Potential Mentors
Nure Tasnina (tasnina@vt.edu), Blessy Antony(blessyantony9@gmail.com)