nrnb / GoogleSummerOfCode

Main documentation site for NRNB GSoC project ideas and resources
120 stars 40 forks source link

Expand Virasig, a dockerized viral mutational signatures framework pipeline. #191

Closed marcoxa closed 1 year ago

marcoxa commented 2 years ago

Background

Pathogens' vaccine escape and drug resistance are examples of evolutionary processes. The natural way to study this process is to analyze the genomic sequence of pathogens. Accordingly, many works have been published on consensus sequences to deliver phylogenetic models and computational and epidemiological models of its diffusion.

However, the importance of considering low-frequency mutations (i.e., the i-SNVs) has been recognized for different applications as the detection drug-resistant of mutations, the characterization of the contagion chains in a community, the estimation of the bottleneck size, and the analysis of the host-related processes.

Despite studying the intra-host i-SNVs has several advantages, the complexity of analysis and its potential pitfalls requires careful data processing, reverse transcription, PCR amplification, and NGS are all error-prone processes. In fact, several NGS analysis pipelines have been developed and used for analyzing the i-SNVs profiles of viruses. The main problem with low-frequency variants is distinguishing between rare variants and sequencing errors. For this reason, all existing methods apply different error models and filtering techniques to ensure to report only high-quality SNVs. Some return only SNVs that are deeply covered by high-quality reads or a codon-based filtration, and others derive sequencing errors from error models.

This impressive sprawling of methods to automatically analyze the viral quasi-species has given the opportunity to track the genomic evolution of viruses through early adaptation to the human host, long term evolution, and intra-host dynamics in longitudinal samples.​ However, the heterogeneity of studied organisms and the lack of standards require careful implementation for each pathogen, so for each virus is necessary to use a different pipeline and analysis that require a high level of bioinformatics skill, reducing the usability of such method of the scientific community.

Goal

Expand Virasig, a user friendly, "dockerized" pipeline framework to integrate high-confidence SNVs calling for different viruses (e.g., HIV, SARS-CoV-2, HPV, ...), in order to use them to characterize Mutational Signatures and related evolutionary processes. Providing encapsulating code layers for existing "single-virus" pipelines will be a major goal of this project. Producing SBOL output will also a goal of this project.

Difficulty Level: Medium

All the algorithmic methodologies are already developed, however integrating these software tools in a single pipeline framework could be difficult.

Size and Length of Project

Size: 350 hours Length: Flexible, 22 weeks.

Skills

Essential skills: Knowledge of Variant Calling pipelines, Nextflow, Docker, R.
Nice to have skills: UNIX Shells, Python, Julia.

Public Repository

Please link to a public, open-source repository for your project. All code from accepted projects must be open source and public throughout the coding period and beyond.

Potential Mentors

Marco Antoniotti, marco.antoniotti@unimib.it
Fabrizio Angaroni, fabrizio.angaroni@unimib.it
Daniele Ramazzotti, daniele.ramazzotti@unimib.it

khanspers commented 1 year ago

Closing in preparation for GSoC 2023.