Closed cannin closed 1 year ago
Hi @cannin , I would like to contribute to this project. I am working as PhD student in Medical Imaging Computer Vision domain. I am familiar with genomic data and have built Deep Learning solutions to classify Brain tumors using PyTorch and PyG (Published in ICPR 2022, check here). Also check my website [here].(https://arijitde92.github.io/) I think I am capable of solving this task and would like to discuss more about this project. Please guide me on how to proceed. Thanks.
Hi @cannin I’m very interested in the GSoC 2023 Project Idea: Generate Example Dataset for PyTorch Geometric Based on Pathway Commons and Prototype. Can you provide more details on the current state of the project ? Also, how will the project’s success be measured, and what skills or experiences are required for a successful contribution to this project? I’m eager to learn more and potentially contribute to this exciting project!
@glunkad Thanks you would be starting this project with PyTorch built around existing resources from Pathway Commons and cBioPortal.
NRNB has been accepted as a mentoring organization for GSoC 2023! Contributor applications open on March 20. Here are some useful links:
GSoC contributor guide NRNB project proposal template Eligibility requirements Full program timeline
Hello @cannin , I am Rishitha Reddy from India.I am a 3rd year UG student at IIT Bhilai in DSAI discipline, India. I am proficient at python,machine learning,tensorflow,numpy, pandas and other deep learning frameworks.I have done many projects using deep learning neural networks, transfer learning .One of my projects is facial expression recognition using deep learning and opencv.I came across the project Generate Example Dataset for PyTorch Geometric Based on Pathway Commons and Prototype and would like to contribute for this project in the GSoc 2023.Could you please guide me.
Looking forward to contribute
Hello @cannin My name is Favour James. I hope this message finds you well. My name is Favour James, and I recently came across your project on the list of ideas for GSOC. I am particularly interested in this issue and I would like to contribute to this project during the GSOC program.
Before I submit my application, I have two questions that I hope you could clarify:
Thank you for your time and attention, and I look forward to hearing back from you soon.
@Favourj-bit if you would like to apply, first look at the "How to Start" section bullet points and start working on a proposal; see links from @khanspers.
@cannin thanks so much for the response. I have done everything in the how-to-start section and I'm trying out the tutorial in pyG to get more understanding of torch-geometric as I am not too conversant with GNNs. However, I wanted to also inform you that I did not see the 'SIFT' column. Another question, for the datasets to be downloaded from pathway commons, can I download any random one, or there is a recommended one? thanks once again
@Favourj-bit which dataset did not have the SIFT column? you can use the reactome dataset if you want something smaller (they should all have the data format).
@cannin Please I downloaded this for all *_tcga_pan_can_atlas_2018 dataset. Am I supposed to check through them all for sift column, I don't really understand what I am to do with the sift column. Also, I did not see the reactome dataset in the datahub
Hello! I'd be happy to contribute to this project. I'm a PhD student doing research in graph-based deep learning. I regularly use PyG and contributed a few times to it.
I built a PyG dataset using Pathway Commons, with additional node information from the SIFT column of the acc_tcga_pan_can_atlas_2018 mutations file. I'm not sure yet how the graph information could help in cancer classification**, but certainly a direction worth to think about!
If you're interested in accepting me, I can send you a draft proposal tomorrow.
**Edit: after some searching and finding papers like this, now it's clear :)
Best, Daniel
@cannin hi, i tried figuring out a way to format the pathway commons dataset. i wanted to confirm somethings, is the data in the biopax format is the combination of all the other types of data. For example, the [PathwayCommons12.reactome.BIOPAX.owl.gz], has 4 other formats. i downloaded the one in txt format, [PathwayCommons12.reactome.hgnc.txt.gz], I however noticed that the pathway_names column was fully empty when I was going through it in my notebook. I just needed to confirm if this is right. Thank you.
@Favourj-bit
1) Reactome is a pathway data set; it will not appear in the datahub for TCGA data.
2) The project description states: 'entries that are "deleterious" are bad, while "tolerated" is okay' and that should be enough to think of SIFT columns as variable for a classification analysis. Google searches like: "sift" tcga column reveal documentation pages like: https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Exploration/ with additional information about SIFT.
3) BioPAX is the main dataset from which the others are generated.
4) For the first line in the PathwayCommons12.reactome.hgnc.txt.gz file:
A1CF in-complex-with APOBEC1 Reactome Formation of the Editosome;mRNA Editing: C to U Conversion http://pathwaycommons.org/pc12/Complex_5987964ecf942175a932619f46670bb9;http://pathwaycommons.org/pc12/Complex_e45b2db87badb1968a732e508e6fe5d8
There are two pathways 1) Formation of the Editosome and 2) mRNA Editing: C to U Conversion The first 8 interactions use have the same value for pathway name; you are likely reading the file incorrectly.
Unless you already understand how to parse OWL files, I would not work with that file for your proposal. State with the tabular hgnc.txt.gz file.
@daniel-unyi-42 if you want comments on your proposal you can send it to me by email. if you are done you can submit it to gsoc. proposals will reviewed by several people. gsoc contributors are not accepted by a single person.
Hi @cannin, Please I sent my written proposal to your email and I humbly request for a review from you as the mentor. Thank you in advance.
This project is an active GSoC 2023 project. Closing this issue because it is no longer available for other contributors/students.
Background
Pathway Commons
Pathway Commons (http://pathwaycommons.org/) is an aggregated database of molecular interactions of millions of interactions. Data stored in the Pathway Commons is in the BioPAX (http://biopax.org/) XML-based format. The data is aggregated from a collection of approximately 20 databases. Data from Pathway Commons is accessible here.
PyTorch & PyTorch Geometric
PyG (PyTorch Geometric) is a library built upon PyTorch to easily write and train Graph Neural Networks (GNNs) for a wide range of applications related to structured data.
Goal
The main two goals is to format the Pathway Commons dataset for use with PyG and to develop a prototype an example use-case for the Pathway Commons dataset with PyG. Students will be evaluated more on the code/documentation quality than producing a very accurate model.
Possible Prototype Example
The following are possibilities that would involve extending the Pathway Commons graph with genetic alteration information from cBioPortal.
How to Start
Interested applicants should:
Difficulty Level: Medium
Size and Length of Project
175 hours 12 weeks
Skills
Python (essential)
Public Repository
Potential Mentors
Augustin Luna ({firstname}{last_name} AT hms.harvard.edu)