nrnb / GoogleSummerOfCode

Main documentation site for NRNB GSoC project ideas and resources
114 stars 38 forks source link

Develop a Generic Converter for Importing Data into Pathway Commons #88

Closed cannin closed 4 years ago

cannin commented 7 years ago

Background

Pathway Commons is an aggregated database of molecular interactions with a web service that receives millions of hits per year. Data stored in the Pathway Commons is in the BioPAX XML-based format. Currently, the data is aggregated from a collection of approximately 20 databases, and we would like to continue to expand this. Currently, this process involves the creating a custom converter for each database in Java; a process that may be difficult for data providers that may not be working with Java.

The converters make use of the the Java Paxtools library to create BioPAX output.

Goal

The goal of this project is to make it easier to aggregate additional databases into Pathway Commons by building a generic converter that may be more widely used. The project is to generalize an existing converter that was made in a previous GSOC to support a format that is easier to understand. Below is an example of the data format that would be the input to the converter.

Example Input Dataset

Data from this file: http://www.pathwaycommons.org/archives/PC2/current/PathwayCommons.8.transfac.EXTENDED_BINARY_SIF.hgnc.txt.gz NOTE: Not all columns have data.

Interaction Information
PARTICIPANT_A   INTERACTION_TYPE    PARTICIPANT_B   INTERACTION_DATA_SOURCE INTERACTION_PUBMED_ID   PATHWAY_NAMES
AHR controls-expression-of  ABCB6   TRANSFAC        V$AHR_Q5
AHR controls-expression-of  ABI2    TRANSFAC        V$AHR_Q5
AHR controls-expression-of  ABTB2   TRANSFAC        V$AHRARNT_01
Entity Information
PARTICIPANT PARTICIPANT_TYPE    PARTICIPANT_NAME    UNIFICATION_XREF    RELATIONSHIP_XREF
HTR4    RnaReference    HTR4    ncbi gene:3360  uniprot:Q13639
CRHBP   RnaReference    CRHBP   ncbi gene:1393  uniprot:P24387
HTR7    RnaReference    HTR7    ncbi gene:3363  uniprot:P34969

Description

  1. Get familiar with the existing project from references, check out Goal 5: MSigDB converter creates a tab-delimited converter
  2. Get familiar with the basic BioPAX concepts that you see in the MSigDB converter.
  3. Working on the refactoring the code such that hard coded BioPAX concepts are extracted from the input data.
  4. Test the code on similar datasets

Skills

Java, XML

Difficulty level 2

This project makes use of pre-exisiting code, so the participant would not be starting from scratch.

Public Repository

Project Code for Work on the Pathway Database Converters for the Expansion of Pathway Commons

Potential mentors

Augustin Luna (aluna@jimmy.harvard.edu)

References

msc-jinal commented 7 years ago

Hi, I looked at number of Pathway database which provided biological pathways. I also looked http://www.pathwaycommons.org/pc2/datasources and http://cpdb.molgen.mpg.de/ where user can find all integrated pathways.

My questions is about Example Input Dataset which provided above. Is that input from all pathways database are similar to example input dataset? If no, then why we should to consider to this example input file?

Regards, Jinal

AdrianBZG commented 7 years ago

Hi @cannin ,

I'm a senior Computer Science university student from Spain, and I feel interested on this project for GSoC'17.

I have been looking and understanding the code at [1], as well as the given references, but I have still some questions:

  1. With the "Example Input Dataset" stated on the project description, what should be the output?

  2. If I'm correctly understanding, the main idea to make this a Generic converter is the 3rd point idea: "Working on the refactoring the code such that hard coded BioPAX concepts are extracted from the input data.", so the blk of work will be here, am I right?

[1] https://bitbucket.org/armish/gsoc14/src/e4a1f83527c2d55d34c2b4bda9d6e864741b93e6/Goal5-MSigDB/msigdb2biopax/src/com/google/gsoc14/msigdb2biopax/converter/MSigDB2BioPAXConverter.java?at=default&fileviewer=file-view-default

Thank you!