nrnb / GoogleSummerOfCode

Main documentation site for NRNB GSoC project ideas and resources
114 stars 38 forks source link

SnapGene / SBOL3 integration #187

Closed jakebeal closed 5 months ago

jakebeal commented 2 years ago

Background

SnapGene is a popular DNA design tool, but uses a custom .dna file format. There are two open software tools for reading .dna files into GenBank format, BioPython and SnapGeneReader. The connection onward from GenBank format to SBOL has not been tested for lossiness, however, and there is no open tool for writing .dna files.

Goal

This project will add the ability to convert from SnapGene .dna files to SBOL3 files and from any of GenBank, FASTA, or SBOL to SnapGene .dna format. This will be implemented as a writing extension for SnapGeneReader or BioPython and as an extension to the sbol-converter utility in SBOL utilities.

Correctness will be validated by round-tripping (import, then export) at least the Component and Sequence objects in SBOL3 files from the SBOL test suite and by checking that imported materials can be sensibly viewed in the free SnapGene viewer.

Difficulty Level: Medium

While the overall goals of the project are relatively straightforward, it will require figuring out how to work with SnapGene's poorly documented .dna format.

Size and Length of Project

Skills

Essential skills: Python Will be learned if not known: SBOL

Public Repository

https://github.com/Edinburgh-Genome-Foundry/SnapGeneReader or https://github.com/biopython/biopython https://github.com/SynBioDex/SBOL-utilities

Potential Mentors

Vishwesh V Kulkarni vvk215@gmail.com, @tcmitchell

Yash-g17 commented 2 years ago

Hi @jakebeal @tcmitchell @VishweshGitHub , I am Yash Gupta , a third year undergraduate. I am looking forward to contribute to this project for GSOC '22 . I have an year's worth of experience in Python . Being new to this project , any guidance on where and how to start would be very helpful.

Kartikkp07 commented 2 years ago

Hi @jakebeal , I am Kartik Kumar Pawar, a CSE sophomore at BITS PILANI. I have good experience using python for about 6 years.I am also adept in JAVA with knowledge of OOPS and basic design patterns,I have also worked with both SQL and NoSQL database systems.I am familiar with javascript and have worked with React and nodeJS as well. I am really excited to know more about this project and contribute to it, with the aim of becoming a GSOC 22 contributor as well. I kindly request you to guide me for the same so I can start as soon as possible.

jakebeal commented 2 years ago

@Yash-g17 @Kartikkp07 If you'd like to learn more about the project and start familiarizing yourself with material, a good starting point is the SBOL tutorial material on the data model and Python library that was presented at IWBDA 2021.

Aakash-02 commented 2 years ago

Hi @jakebeal, I am Aakash currently pursuing a dual degree on biological sciences at Indian Institute of Technology Madras(IITM). I have been coding on python for about an year now. I know basics of snap genes as well. I'd like to work on this project. Kindly guide me on how to get started.

jakebeal commented 2 years ago

@Aakash-02 Application for support on the project goes through the standard Google Summer of Code process. If you'd like to learn more about the project and start familiarizing yourself with material, please see the comment above yours.

ahmedtarek26 commented 2 years ago

Hi @jakebeal @tcmitchell @VishweshGitHub I am Ahmed Tarek and I am a medical informatics 3rd-year undergraduate student. I have good experience using python for two years. I am interested in machine learning, and deep learning so I joined Neuromatch Academy as an interactive student in which we used Pytorch. I am working as a research assistant on a research paper in NLP and we are about to publish our work soon.

I took a Genetics course at college and did a project using some ML libraries, Biopython, Py3Dmol, and nglview which you can find here. I used Biopython in this project to deal with fasta files and read them, translate and transcribe the sequence, then analyze protein sequence and compare between each gene. I used PDB id for each gene to visualize it using Py3Dmol and nglview.

I'll start studying from the resources you attached above about SBOL (the SBOL tutorial material on the data model and Python library that was presented at IWBDA 2021) to start working on this project for GSOC 22.

Thanks for your time

khanspers commented 2 years ago

NRNB has officially been accepted as a mentoring organization for GSoC 2022! Here are some useful links:

tcmitchell commented 2 years ago

Here are some links from the GSoC Mentors mailing list that might be generally helpful to all who are interested in this project:

khanspers commented 2 years ago

A reminder that the application period opens on Monday April 4. Proposals to NRNB must be submitted on the official GSoC Site (https://summerofcode.withgoogle.com/) before April 19, 18:00 UTC to be considered, and contributors are encouraged to submit proposals in draft format early, so that mentors can give feedback directly at the GSoC site.

AlexanderPico commented 2 years ago

IMPORTANT REMINDER: GSoC 2022 is for new “beginners” to open source.

Applicants are expected to review eligibility requirements prior to applying. We can not accept applications from contributors with prior open source development experience. From the GSoC FAQ https://developers.google.com/open-source/gsoc/faq:

Can someone already participating in open source be a GSoC Contributor?

The goal of GSoC is to bring new contributors into open source organizations. GSoC can also help beginner contributors learn the ins and outs of open source while being mentored by experienced community members. GSoC is for new and beginner contributors to open source, it is not for experienced contributors to open source.

khanspers commented 1 year ago

Closing in preparation for GSoC 2023.

jakebeal commented 1 year ago

Project is still valid and needed: reopening for 2023

HarshRathi2511 commented 1 year ago

Hello there @jakebeal , I'm Harsh Rathi,CSE sophomore writing to express my interest in contributing to this SBOL project as a part of the Google Summer of Code program . Apart from the good first issues in the SBOL-Utilitites ,please guide me on how to familiarize myself with the requirements of this project .

jakebeal commented 1 year ago

@HarshRathi2511 I would suggest starting by testing out the .dna to GenBank converters in the open source tools linked above. Once you've been able to run them, the next step would be to find the code in those tools that converts .dna to GenBank and get to know it, as a key part of this project will be to make a converter that goes in the other direction.

HarshRathi2511 commented 1 year ago

@jakebeal Sure I'll test around the .dna to GenBank converters and have a look at their code.

Foxtrot-14 commented 7 months ago

Hello, I am trying to work on this but I am unable to find .dna files. Where can I find the files?. Because it will be easier to parse the file and then figure out how to convert it into a different file.

tcmitchell commented 6 months ago

@Foxtrot-14 thanks for your interest. Please start by reading the description and testing out the open source .dna to GenBank converters linked therein. At least one contains sample .dna files. Once you've been able to run them, the next step would be to find the code in those tools that converts .dna to GenBank and get to know it, as a key part of this project will be to make a converter that goes in the other direction.

Foxtrot-14 commented 6 months ago

Ok, I have cloned the SnapGeneReader project and tested it with the sample .dna files, there are two functions in the same:

  1. snapgene_file_to_dict() returns an object of type <class 'dict'>
  2. snapgene_file_to_seqrecord() returns an object of type <class 'Bio.SeqRecord.SeqRecord'>
  3. Lastly, there is a snapgene_file_to_gbk() to convert the objects into the GenBank format. should the next step be to figure out a way to convert this file into SBOL...?
tcmitchell commented 6 months ago

Hi @Foxtrot-14, sorry for the delay in responding. The goal is two-way conversion: from snapgene to SBOL, and from several formats to snapgene. See the first paragraph under the heading "Goal" in the description of this issue where the goal is spelled out in more detail.

You'll have to figure out which of the 3 functions you list provide the necessary details for the conversion to SBOL. I am not familiar with snapgene so I cannot comment on which provide the necessary data and which would be the best to work from. That's part of this project.

Thanks for you continued interest!

Atishaysjain commented 5 months ago

Hi @tcmitchell @jakebeal @VishweshGitHub

I am interested in this problem. I have utilized the SnapGeneReader repo to convert .dna files to a dictionary and GenBank file after making the required changes to the repo. My contributions, including the necessary modifications, are available for review in the pull requests I submitted to the SnapGeneReader repository:

For converting .dna files into SBOL format, I've conceptualized two strategies: i) Convert .dna file to python dictionary (using SnapGeneReader) -> then convert to SBOL ii) Convert .dna file to GenBank file (using SnapGeneReader -> then convert to SBOL (using https://github.com/nrnb/GoogleSummerOfCode/issues/183 by @mohitdmak)

I am currently working on the second approach.

I am keen on taking up this project for GSOC 2024. I am currently a masters student in Computer Science at San Jose State University. Additionally, my prior experience at the MeDAL (Medical Deep Learning and Artificial Intelligence Lab) at IIT Bombay, India's leading research institute, has equipped me with relevant skills. At MeDAL Lab I had worked on developing a module to perform instance segmentation and classification of nuclei in Multi-Tissue Histology WSIs by scaling a python based codebase.

I eagerly anticipate your feedback and am looking forward to the opportunity of working under your mentorship. Thank you for considering my application.

jakebeal commented 5 months ago

While this project still needs to be done, we have decided that we are not in a good position to supervise a GSoC student on it this summer.