nrnb / GoogleSummerOfCode

Main documentation site for NRNB GSoC project ideas and resources
114 stars 38 forks source link

Conversion between GenBank and SBOL3 #183

Closed jakebeal closed 2 years ago

jakebeal commented 2 years ago

Background

SBOL3 can currently be converted to GenBank only by first being downconverted to SBOL2, and vice versa. We would like to have the ability to directly convert between the two formats. This would be implemented as part of sbol-utilities in using BioPython and pySBOL3.

Goal

Equivalent conversion of a set of test GenBank files.

Difficulty Level: Easy

There is a well-defined and existing two-step conversion, and the project just needs to build an equivalent direct conversion.

Size and Length of Project

Skills

Essential skills: Python Will be learned if not known: SBOL, BioPython

Public Repository

https://github.com/SynBioDex/SBOL-utilities

Potential Mentors

jakebeal@ieee.org, tom.mitchell@raytheon.com, Bryan.A.Bartley@raytheon.com,Chris.Myers@colorado.edu

Gonza10V commented 2 years ago

Hi @jakebeal I'm Gonzalo Vidal PhD candidate on biologial and medical engineering from Chile. I have 3 years of experience in Python and 1 in SBOL. I am willing to contribute to this project for GSoC 2022, any guidance on where to begin and where can I learn Biopython would be encouraging and helpful.

jakebeal commented 2 years ago

Hi, @Gonza10V : I'd be happy to supervise you on this project. If you want to get started playing with biopython, I would suggest: 1) Looking at how it's already used in SBOL-utilities, and 2) Spending some time with the BioPython Cookbook

tcmitchell commented 2 years ago

@ArchitJain1201 also expressed interest in this project. I sent the following background information in response to an email from @ArchitJain1201 requesting suggestions for where to begin. I am posting it here so others can clarify, elaborate, or correct this response, as well as for the benefit of others who might be interested in working on this task.

My reply:

See https://github.com/SynBioDex/SBOL-utilities

That repository is a collection of utility programs for SBOL, particularly SBOL3.

In the file sbol_utilities/conversion.py you will find two functions: convert_from_genbank and convert_to_genbank.

convert_to_genbank currently works by converting SBOL3 files to SBOL2 files, then uploading the files to an online SBOL2-to-genbank converter. convert_from_genbank goes the opposite way, converting genbank to SBOL2 and then SBOL2 to SBOL3. It's a lossy process in both directions.

What is desired in a conversion between GenBank and SBOL3 is a more direct conversion, and one entirely written in Python so that it can be run locally, without the need for an online converter, and without the need to convert to/from SBOL2.

As I understand it, Genbank is a very loose format. I don't think there is a specification, or if there is it is minimal. I might be wrong about that.

There are sample SBOL files, for both SBOL2 and SBOL3, in https://github.com/SynBioDex/SBOLTestSuite. You could try those out. The online converter can be found at https://validator.sbolstandard.org/validate/

If you plan to work on this it would be a good idea to open an issue on SBOL-utilities for it so that you can ask questions, get answers, and so forth. That will also prevent duplication of effort.

Please let us know via a GitHub issue if you need additional assistance. https://github.com/SynBioDex/SBOL-utilities/issues

I'm not the best person to answer all the questions for this task. There are others who monitor the issues there that will have additional information.

ahmedtarek26 commented 2 years ago

Hi @jakebeal @tcmitchell @cjmyers @bbartley I am Ahmed Tarek and I am a medical informatics 3rd-year undergraduate student. I have good experience using python for two years. I am interested in machine learning, and deep learning so I joined Neuromatch Academy as an interactive student in which we used Pytorch. I am working as a research assistant on a research paper in NLP and we are about to publish our work soon.

I took a Genetics course at college and did a project using some ML libraries, Biopython, Py3Dmol, and nglview which you can find here. I used Biopython in this project to deal with fasta files and read them, translate and transcribe the sequence, then analyze protein sequence and compare between each gene. I used PDB id for each gene to visualize it using Py3Dmol and nglview.

I'll start studying from the resources you attached above about SBOL (the SBOL tutorial material on the data model and Python library that was presented at IWBDA 2021) to start working on this project for GSOC 22.

Thanks for your time

khanspers commented 2 years ago

NRNB has officially been accepted as a mentoring organization for GSoC 2022! Here are some useful links:

ahmedtarek26 commented 2 years ago

Hi @tcmitchell @jakebeal @bbartley @cjmyers,

I have read the SBOL tutorial material on the data model and Python library that was presented at IWBDA 2021 and I have now a good understanding of SBOL, SBOL data model, what are SBOL composition, the difference between SBOL, FASTA, and GenBank.

Also, I have watched some videos from this playlist IWBDA 2021.

I opened the repo and understood the code of important files.

Finally, It's great that NRNB has officially been accepted. I'll start working on my proposal for this project as soon as possible.

I hope you tell me what is the next step?

Thanks for your time.

tcmitchell commented 2 years ago

Hi @ahmedtarek26, thanks for your interest! Here are some links that should help you with next steps:

We are happy to answer any questions that you might have while you develop your proposal/application. Please post those here so we can maintain a level playing field for all potential contributors.

Thanks!

ahmedtarek26 commented 2 years ago

Hi @tcmitchell, I'm working on the proposal and there are some details I'll add but these days there are many college works I should do, so I'll continue the proposal soon. I hope to share a draft via email next Thursday if available. Thanks for your time and help

tcmitchell commented 2 years ago

Here are some links from the GSoC Mentors mailing list that might be generally helpful to all who are interested in this project:

khanspers commented 2 years ago

A reminder that the application period opens on Monday April 4. Proposals to NRNB must be submitted on the official GSoC Site (https://summerofcode.withgoogle.com/) before April 19, 18:00 UTC to be considered, and contributors are encouraged to submit proposals in draft format early, so that mentors can give feedback directly at the GSoC site.

AlexanderPico commented 2 years ago

IMPORTANT REMINDER: GSoC 2022 is for new “beginners” to open source.

Applicants are expected to review eligibility requirements prior to applying. We can not accept applications from contributors with prior open source development experience. From the GSoC FAQ https://developers.google.com/open-source/gsoc/faq:

Can someone already participating in open source be a GSoC Contributor?

The goal of GSoC is to bring new contributors into open source organizations. GSoC can also help beginner contributors learn the ins and outs of open source while being mentored by experienced community members. GSoC is for new and beginner contributors to open source, it is not for experienced contributors to open source.

khanspers commented 2 years ago

Closing because this is an active project for GSoC 2022.