Pathogens' vaccine escape and drug resistance are examples of
evolutionary processes. The natural way to study this process is to
analyze the genomic sequence of pathogens. Accordingly, many works
have been published on consensus sequences to deliver phylogenetic
models and computational and epidemiological models of its diffusion.
However, the importance of considering low-frequency mutations (i.e.,
the i-SNVs) has been recognized for different applications as the
detection drug-resistant of mutations, the characterization of the
contagion chains in a community, the estimation of the bottleneck
size, and the analysis of the host-related processes.
Despite studying the intra-host i-SNVs has several advantages, the
complexity of analysis and its potential pitfalls requires careful
data processing, reverse transcription, PCR amplification, and NGS are
all error-prone processes. In fact, several NGS analysis pipelines
have been developed and used for analyzing the i-SNVs profiles of
viruses. The main problem with low-frequency variants is
distinguishing between rare variants and sequencing errors. For this
reason, all existing methods apply different error models and
filtering techniques to ensure to report only high-quality SNVs. Some
return only SNVs that are deeply covered by high-quality reads or a
codon-based filtration, and others derive sequencing errors from error
models.
This impressive sprawling of methods to automatically analyze the
viral quasi-species has given the opportunity to track the genomic
evolution of viruses through early adaptation to the human host, long
term evolution, and intra-host dynamics in longitudinal samples.
However, the heterogeneity of studied organisms and the lack of
standards require careful implementation for each pathogen, so for
each virus is necessary to use a different pipeline and analysis that
require a high level of bioinformatics skill, reducing the usability
of such method of the scientific community.
Goal
Expand Virasig, a user friendly, "dockerized" pipeline framework to integrate
high-confidence SNVs calling for different viruses (e.g., HIV,
SARS-CoV-2, HPV, ...), in order to use them to characterize Mutational
Signatures and related evolutionary processes. Providing
encapsulating code layers for existing "single-virus" pipelines will
be a major goal of this project. Producing SBOL output will also a goal of this project.
Difficulty Level: Medium
All the algorithmic methodologies are already developed, however
integrating these software tools in a single pipeline framework could be difficult.
Size and Length of Project
Size: 350 hours
Length: Flexible, 22 weeks.
Skills
Essential skills: Knowledge of Variant Calling pipelines, Nextflow, Docker, R.
Nice to have skills: UNIX Shells, Python, Julia.
Public Repository
Please link to a public, open-source repository for your project. All code from accepted projects must be open source and public throughout the coding period and beyond.
Potential Mentors
Marco Antoniotti, marco.antoniotti@unimib.it
Fabrizio Angaroni, fabrizio.angaroni@unimib.it
Daniele Ramazzotti, daniele.ramazzotti@unimib.it
Background
Pathogens' vaccine escape and drug resistance are examples of evolutionary processes. The natural way to study this process is to analyze the genomic sequence of pathogens. Accordingly, many works have been published on consensus sequences to deliver phylogenetic models and computational and epidemiological models of its diffusion.
However, the importance of considering low-frequency mutations (i.e., the i-SNVs) has been recognized for different applications as the detection drug-resistant of mutations, the characterization of the contagion chains in a community, the estimation of the bottleneck size, and the analysis of the host-related processes.
Despite studying the intra-host i-SNVs has several advantages, the complexity of analysis and its potential pitfalls requires careful data processing, reverse transcription, PCR amplification, and NGS are all error-prone processes. In fact, several NGS analysis pipelines have been developed and used for analyzing the i-SNVs profiles of viruses. The main problem with low-frequency variants is distinguishing between rare variants and sequencing errors. For this reason, all existing methods apply different error models and filtering techniques to ensure to report only high-quality SNVs. Some return only SNVs that are deeply covered by high-quality reads or a codon-based filtration, and others derive sequencing errors from error models.
This impressive sprawling of methods to automatically analyze the viral quasi-species has given the opportunity to track the genomic evolution of viruses through early adaptation to the human host, long term evolution, and intra-host dynamics in longitudinal samples. However, the heterogeneity of studied organisms and the lack of standards require careful implementation for each pathogen, so for each virus is necessary to use a different pipeline and analysis that require a high level of bioinformatics skill, reducing the usability of such method of the scientific community.
Goal
Expand Virasig, a user friendly, "dockerized" pipeline framework to integrate high-confidence SNVs calling for different viruses (e.g., HIV, SARS-CoV-2, HPV, ...), in order to use them to characterize Mutational Signatures and related evolutionary processes. Providing encapsulating code layers for existing "single-virus" pipelines will be a major goal of this project. Producing SBOL output will also a goal of this project.
Difficulty Level: Medium
All the algorithmic methodologies are already developed, however integrating these software tools in a single pipeline framework could be difficult.
Size and Length of Project
Size: 350 hours Length: Flexible, 22 weeks.
Skills
Essential skills: Knowledge of Variant Calling pipelines, Nextflow, Docker, R.
Nice to have skills: UNIX Shells, Python, Julia.
Public Repository
Please link to a public, open-source repository for your project. All code from accepted projects must be open source and public throughout the coding period and beyond.
Potential Mentors
Marco Antoniotti, marco.antoniotti@unimib.it
Fabrizio Angaroni, fabrizio.angaroni@unimib.it
Daniele Ramazzotti, daniele.ramazzotti@unimib.it