nrnb / GoogleSummerOfCode

Main documentation site for NRNB GSoC project ideas and resources
115 stars 39 forks source link

Generate Example Dataset for PyTorch Geometric Based on Pathway Commons and Prototype #217

Closed cannin closed 1 year ago

cannin commented 1 year ago

Background

Pathway Commons

Pathway Commons (http://pathwaycommons.org/) is an aggregated database of molecular interactions of millions of interactions. Data stored in the Pathway Commons is in the BioPAX (http://biopax.org/) XML-based format. The data is aggregated from a collection of approximately 20 databases. Data from Pathway Commons is accessible here.

PyTorch & PyTorch Geometric

PyG (PyTorch Geometric) is a library built upon PyTorch to easily write and train Graph Neural Networks (GNNs) for a wide range of applications related to structured data.

Goal

The main two goals is to format the Pathway Commons dataset for use with PyG and to develop a prototype an example use-case for the Pathway Commons dataset with PyG. Students will be evaluated more on the code/documentation quality than producing a very accurate model.

Possible Prototype Example

The following are possibilities that would involve extending the Pathway Commons graph with genetic alteration information from cBioPortal.

How to Start

Interested applicants should:

  1. Download and install PyTorch and the PyG module: https://github.com/pyg-team/pytorch_geometric
  2. Download example PyG datasets from: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html
  3. Download Pathway Commons datasets from: https://www.pathwaycommons.org/archives/PC2/v12/
  4. Download datasets alteration data for genes from: https://github.com/cBioPortal/datahub/tree/master/public; specifically folders labeled: *_tcga_pan_can_atlas_2018 and the files named: data_mutations.txt. For purposes of preparing a proposal, applicants can look at entries in the "SIFT" column; entries that are "deleterious" are bad, while "tolerated" is okay. Applicants are welcomed to use other data in datahub if they have more expertise in genomic analysis.

Difficulty Level: Medium

Size and Length of Project

175 hours 12 weeks

Skills

Python (essential)

Public Repository

Potential Mentors

Augustin Luna ({firstname}{last_name} AT hms.harvard.edu)

arijitde92 commented 1 year ago

Hi @cannin , I would like to contribute to this project. I am working as PhD student in Medical Imaging Computer Vision domain. I am familiar with genomic data and have built Deep Learning solutions to classify Brain tumors using PyTorch and PyG (Published in ICPR 2022, check here). Also check my website [here].(https://arijitde92.github.io/) I think I am capable of solving this task and would like to discuss more about this project. Please guide me on how to proceed. Thanks.

glunkad commented 1 year ago

Hi @cannin I’m very interested in the GSoC 2023 Project Idea: Generate Example Dataset for PyTorch Geometric Based on Pathway Commons and Prototype. Can you provide more details on the current state of the project ? Also, how will the project’s success be measured, and what skills or experiences are required for a successful contribution to this project? I’m eager to learn more and potentially contribute to this exciting project!

cannin commented 1 year ago

@glunkad Thanks you would be starting this project with PyTorch built around existing resources from Pathway Commons and cBioPortal.

khanspers commented 1 year ago

NRNB has been accepted as a mentoring organization for GSoC 2023! Contributor applications open on March 20. Here are some useful links:

GSoC contributor guide NRNB project proposal template Eligibility requirements Full program timeline

RishithaR-388 commented 1 year ago

Hello @cannin , I am Rishitha Reddy from India.I am a 3rd year UG student at IIT Bhilai in DSAI discipline, India. I am proficient at python,machine learning,tensorflow,numpy, pandas and other deep learning frameworks.I have done many projects using deep learning neural networks, transfer learning .One of my projects is facial expression recognition using deep learning and opencv.I came across the project Generate Example Dataset for PyTorch Geometric Based on Pathway Commons and Prototype and would like to contribute for this project in the GSoc 2023.Could you please guide me.

Looking forward to contribute

Favourj-bit commented 1 year ago

Hello @cannin My name is Favour James. I hope this message finds you well. My name is Favour James, and I recently came across your project on the list of ideas for GSOC. I am particularly interested in this issue and I would like to contribute to this project during the GSOC program.

Before I submit my application, I have two questions that I hope you could clarify:

  1. May I begin working on the Jupyter issue before the official application period begins?
  2. How can I contribute to the NRNB organization during this interim period while I await the opening of the official application period? I would be grateful if you could provide guidance on these matters, as well as any additional information that could aid me in my contribution to your project.

Thank you for your time and attention, and I look forward to hearing back from you soon.

cannin commented 1 year ago

@Favourj-bit if you would like to apply, first look at the "How to Start" section bullet points and start working on a proposal; see links from @khanspers.

Favourj-bit commented 1 year ago

@cannin thanks so much for the response. I have done everything in the how-to-start section and I'm trying out the tutorial in pyG to get more understanding of torch-geometric as I am not too conversant with GNNs. However, I wanted to also inform you that I did not see the 'SIFT' column. Another question, for the datasets to be downloaded from pathway commons, can I download any random one, or there is a recommended one? thanks once again

cannin commented 1 year ago

@Favourj-bit which dataset did not have the SIFT column? you can use the reactome dataset if you want something smaller (they should all have the data format).

Favourj-bit commented 1 year ago

@cannin Please I downloaded this for all *_tcga_pan_can_atlas_2018 dataset. Am I supposed to check through them all for sift column, I don't really understand what I am to do with the sift column. Also, I did not see the reactome dataset in the datahub image

daniel-unyi-42 commented 1 year ago

Hello! I'd be happy to contribute to this project. I'm a PhD student doing research in graph-based deep learning. I regularly use PyG and contributed a few times to it.

I built a PyG dataset using Pathway Commons, with additional node information from the SIFT column of the acc_tcga_pan_can_atlas_2018 mutations file. I'm not sure yet how the graph information could help in cancer classification**, but certainly a direction worth to think about!

If you're interested in accepting me, I can send you a draft proposal tomorrow.

**Edit: after some searching and finding papers like this, now it's clear :)

Best, Daniel

Favourj-bit commented 1 year ago

@cannin hi, i tried figuring out a way to format the pathway commons dataset. i wanted to confirm somethings, is the data in the biopax format is the combination of all the other types of data. For example, the [PathwayCommons12.reactome.BIOPAX.owl.gz], has 4 other formats. i downloaded the one in txt format, [PathwayCommons12.reactome.hgnc.txt.gz], I however noticed that the pathway_names column was fully empty when I was going through it in my notebook. I just needed to confirm if this is right. Thank you. image

cannin commented 1 year ago

@Favourj-bit

1) Reactome is a pathway data set; it will not appear in the datahub for TCGA data.

2) The project description states: 'entries that are "deleterious" are bad, while "tolerated" is okay' and that should be enough to think of SIFT columns as variable for a classification analysis. Google searches like: "sift" tcga column reveal documentation pages like: https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Exploration/ with additional information about SIFT.

3) BioPAX is the main dataset from which the others are generated.

4) For the first line in the PathwayCommons12.reactome.hgnc.txt.gz file:

A1CF in-complex-with APOBEC1 Reactome Formation of the Editosome;mRNA Editing: C to U Conversion http://pathwaycommons.org/pc12/Complex_5987964ecf942175a932619f46670bb9;http://pathwaycommons.org/pc12/Complex_e45b2db87badb1968a732e508e6fe5d8

There are two pathways 1) Formation of the Editosome and 2) mRNA Editing: C to U Conversion The first 8 interactions use have the same value for pathway name; you are likely reading the file incorrectly.

Unless you already understand how to parse OWL files, I would not work with that file for your proposal. State with the tabular hgnc.txt.gz file.

cannin commented 1 year ago

@daniel-unyi-42 if you want comments on your proposal you can send it to me by email. if you are done you can submit it to gsoc. proposals will reviewed by several people. gsoc contributors are not accepted by a single person.

Favourj-bit commented 1 year ago

Hi @cannin, Please I sent my written proposal to your email and I humbly request for a review from you as the mentor. Thank you in advance.

khanspers commented 1 year ago

This project is an active GSoC 2023 project. Closing this issue because it is no longer available for other contributors/students.