This pages decribes the instructions for download and use of VIRULIGN for viral sequence data, with an explanation of all optional parameters. This repository also contains a PDF version of this tutorial. In this document, three example datasets (Dengue virus, Zika virus and HIV-1) are described where VIRULIGN was used to generate research-relevant output formats of the constructed codon-correct multiple sequence alignments.
VIRULIGN has been published in Bioinformatics. If you use the software for your analysis, please cite VIRULIGN by the following citation: " P. Libin, K. Deforche, A.B. Abecasis and K. Theys; VIRULIGN: fast codon-correct alignment and annotation of viral genomes; 2018; Bioinformatics; doi: 10.1093/bioinformatics/bty851 "
Virus sequence data are an essential resource for reconstructing spatiotemporal dynamics of viral spread as well as treatment and prevention strategies. However, the potential benefit of using sequence data for these applications critically depends on the accuracy and the correct annotation of these alignments of genetically diverse data. In particular, coding sequences of viral pathogens should be analyzed in their corresponding open reading frame (ORF) to fully utilize their biological information. Therefore, while the construction of multiple sequence alignments (MSAs) can be done with a range of sequence alignment software, MSAs of virus coding sequences in the correct reading frame and annotated with respect to the proteins encoded in the genome are more difficult to achieve.\ VIRULIGN is an easy-to-use command line application to construct codon-correct alignments of large virus sequence datasets. Additionally, VIRULIGN has support for standardized genome annotation and implements various alignment export formats that are useful for various research applications. VIRULIGN is an open-source project written in the C++ programming language and available under the GPLv2 license.\ VIRULIGN operates by aligning each target sequence (i.e., t in T) of the input file codon-correctly against the reference sequence (r). Subsequently a multiple sequence alignment MSA(r,T) is constructed based on all codon-correct (cc) pairwise aligned target sequences A_{cc}(r,t) (Figure below).
The most recent version and executable of VIRULIGN can be downloaded from the GitHub project web page:
https://github.com/rega-cev/virulign
Instructions are provided to build and install the software. VIRULIGN is a cross-platform application and has been tested on GNU/Linux, MacOS and Windows.
VIRULIGN minimally requires a FASTA file with target sequences and a reference sequence in order to generate a codon-correct alignment in a predefined output format (see below). The reference sequence can be either provided in FASTA format or embedded in an XML file (see below). The default command for VIRULIGN is as follows:
$ virulign [reference.fasta orf-description.xml] sequences.fasta
Additional parameters can be specified to configure the alignment construction and to export the alignment to a variety of output formats. In case that a parameter has not been explicitly specified, the first value of this optional parameter is used as the default value. The following parameters can be used to configure the alignment and its representation [^1]:
$ virulign [reference.fasta orf-description.xml]
sequences.fasta
--exportKind [Mutations PairwiseAlignments
GlobalAlignment PositionTable
MutationTable]
--exportAlphabet [AminoAcids Nucleotides]
--exportWithInsertions [yes no]
--exportReferenceSequence [no yes]
--gapExtensionPenalty doubleValue=>3.3
--gapOpenPenalty doubleValue=>10.0
--maxFrameShifts intValue=>3
--progress [no yes]
--nt-debug [dir]
Output: The alignment will be printed to standard out and any progress
or error messages will be printed to the standard error.
This output can be redirected to files, e.g.: virulign ref.xml
sequence.fasta > alignment.mutations 2> alignment.err
To print these options, invoke the virulign command without any arguments.
The alignment will be printed to standard out and any progress or error messages will be printed to the standard error. The output of VIRULIGN can be redirected to files. For example, the output is redirected to a file with the following command.
$ virulign [reference.fasta orf-description.xml] sequences.fasta
> output.file
In case any progress or error messages should be suppressed, make the following extension to your command.
$ virulign [reference.fasta orf-description.xml] sequences.fasta
> output.file 2> error.file
The parameter --exportKind
defines the output type of the alignment,
either a FASTA sequence file or a CSV mutation file. To display the
different options, consider this example FASTA input file including
sequences of different length and a full-length reference sequence for
comparison.
>Ref
CCCATTAGCCCTATTGAGACTGTACCAGTAAAATTAAAGCCAGGAATGGATGGCCCAAAA
>Seq1
ATTGACACTGTACCAGTAACATTAAAGCCAGGAATGGATGGACCAAAG
>Seq2
CCTATGGAAACTGTGCCAGTAAAATTAAAGCCAGGAATGGAT
>Seq3
CTCATTAGTCCTATTAGTGTAAAATTAAAACCAGGAATGGATGGCCCAAGG
>Seq4
AGTCCCATTGAAACTGTACCAGTAAAAGGAGATGGCCCAAAG
The option GlobalAlignment
will generate a FASTA file of the target
sequences aligned against a single reference sequence and formatted as a
MSA.
>Ref
CCCATTAGCCCTATTGAGACTGTACCAGTAAAATTAAAGCCAGGAATGGATGGCCCAAAA
>Seq1
------------ATTGACACTGTACCAGTAACATTAAAGCCAGGAATGGATGGACCAAAG
>Seq2
CCTATG---------GAAACTGTGCCAGTAAAATTAAAGCCAGGAATGGAT---------
>Seq3
CTCATTAGTCCTATTAGT---------GTAAAATTAAAACCAGGAATGGATGGCCCAAGG
>Seq4
------AGTCCCATTGAAACTGTACCAGTAAAA---------GGA---GATGGCCCAAAG
The option PairwiseAlignments
will generate a FASTA file of the target
sequences, with each sequence aligned separately against the reference
sequence.
>Ref
CCCATTAGCCCTATTGAGACTGTACCAGTAAAATTAAAGCCAGGAATGGATGGCCCAAAA
>Seq1
------------ATTGACACTGTACCAGTAACATTAAAGCCAGGAATGGATGGACCAAAG
>Ref
CCCATTAGCCCTATTGAGACTGTACCAGTAAAATTAAAGCCAGGAATGGATGGCCCAAAA
>Seq2
CCTATG---------GAAACTGTGCCAGTAAAATTAAAGCCAGGAATGGAT---------
The option PositionTable
will create a comma-separated value (CSV)
file where each position of the alignment is given as a separate column.
The CSV file is annotated according to the numerical positions in the
protein (e.g., Table below).
ID | Pos1 | Pos2 | Pos3 | … | |
---|---|---|---|---|---|
Ref | P | I | S | … | |
Seq1 | - | - | - | - | … |
Seq2 | P | M | - | … | |
Seq3 | L | I | S | … |
The option MutationTable
will create a CSV file where each mutation
present at a specific position is given as a separate column in Boolean
representation. The CSV file is annotated according to the numerical
position in the protein (e.g., Table below).
ID | Mut1P | Mut1L | Mut2I | Mut2M | … |
---|---|---|---|---|---|
Ref | y | n | y | n | … |
Seq1 | n | n | n | n | … |
Seq2 | y | y | n | y | … |
Seq3 | n | y | y | n | … |
Mutations
will output, for each sequence, a list of
amino acids changes compared to the reference sequence (e.g., Table below).ID | Mutations |
---|---|
Seq1 | - |
Seq2 | 2M |
Seq3 | 1L |
The parameter --exportAlphabet
defines the alphabet in which the
alignment is generated.
the option AminoAcids
will export an alignment with the
translation of the nucleotide codons.
the option Nucleotides
will export an alignment with nucleotides
The parameter --exportWithInsertions
determines whether insertions can
be added to the reference sequence.
the option yes
will insert gaps into the reference sequence to
accommodate the identification of codon insertions in the target
sequences.
the option No
removes codon insertions in the reference sequence
that were generated during the alignment procedure.
The parameter --exportReferenceSequence
controls whether the reference
sequence is to be added to the alignment (yes/no).
The parameter --gapExtensionPenalty
defines the value of the penalty
to extend an existing gap.
The parameter --gapOpenPenalty
defines the value of the penalty to
start a new gap.
The parameter --maxFrameShifts
defines the maximum number of
frame-shifts allowed.
The parameter --progress
allows to monitor the estimated time until
completion of the alignment.
When the option yes
is used, a progress message stating the number and
percentage of sequences aligned as well as the estimated time left to
finalize the pending alignment is shown.
The parameter --nt-debug
allows to visualise sequences that could not
be aligned by VIRULIGN. When the option dir
is used, pairwise sequence
alignments of failed target sequences and the reference sequence are
stored in the directory with the name ’dir’. This directory needs to be
created before the execution of the VIRULIGN command. This feature
allows to inspect each failed target sequence individually to understand
why the target sequence did not pass the quality control of the
alignment. Subsequently, errors in the target sequence can be corrected.
A reference sequence can be either provided to VIRULIGN in FASTA format or embedded in an XML file. In this XML file, also an annotation of the different proteins, regions or other structures can be given by the positions relative to the reference genome.
For illustrational purposes, we present a toy XML file to align virus sequence data against a reference sequence and the annotation of the proteins A, B and C in the genome of VIRUS. Later in this document, we will consider and use some more realistic annotations (i.e., ZIKV, HIV-1).
<?xml version="1.0" encoding="UTF-8"?>
<orf name="VIRUS"
referenceSequence="atgaaaaacccaaaaaagaaatccgga" >
<protein abbreviation="A"
startPosition="1" stopPosition="7" />
<protein abbreviation="B"
startPosition="7" stopPosition="13" />
<protein abbreviation="C"
startPosition="13" stopPosition="17" />
</orf>
Table below shows an example of an alignment that is constructed with this XML file as reference, and is exported using the tabular format.
ID | A_1 | A_2 | … | B_1 | … |
---|---|---|---|---|---|
Seq1 | X | Y | … | C | … |
Seq2 | X | T | … | F | … |
Seq3 | X | Y | … | D | … |
Seq4 | R | M | … | D | … |
Currently, the direct use of a Genbank XML file is not supported, as we aim for well-curated annotations of reference genomes for all virus pathogens. However, the creation of a XML file can be based on information provided within a Genbank description of the respective (reference) genome. To facilitate this transfer for the user community, a python (v2) script has been made to extract relevant information from a Genbank description file into the format of a VIRULIGN annotation file, which can be found at the GitHub repository of VIRULIGN tools.
https://github.com/rega-cev/virulign-tools
The command to run the script file for the conversion of the Genbank XML file is:
$ python genbank_to_virulign.py genbank_insdseq.xml
orf-name seq-start seq-end
The parameter orf-name
will set the name for the defined ORF, while
parameters seq-start
and seq-end
define the start and stop
nucleotide position of the respective ORF in the reference genome.
To demonstrate the working of this script with a relevant example, we downloaded the Genbank INSDSEQ XML file for the reference genome NC_001477 (link) of Dengue Serotype 1. The Genbank XML file is available at the tutorial web page. As the Dengue virus has only one ORF, we will generate one virulign XML file. When you open the Genbank file, and look for the CDS, you find that the ORF is located in the genome at position 95 until position 10273.
Consequently, the command needed to convert the CDS to a virulign XML is:
$ python genbank_to_virulign.py NC_001477.gbc.xml DENV1 95 10273
The output of this command may need to be improved into a correct annotation file for VIRULIGN. For example, the presence of annotations for the precursor or polyprotein together with the separate proteins can be conflicting as they overlap in genome positioning.
< ... "membrane glycoprotein precursor M" startPosition="343" stopPosition="841" />
< ... "protein pr" startPosition="343" stopPosition="616" />
< ... "membrane glycoprotein M" startPosition="616" stopPosition="841" />
< ... "envelope protein E" startPosition="841" stopPosition="2326" />
In order to verify that your virulign XML file is correct, you can invoke virulign to use the XML file and align the full sequence you're trying to describe. When using virulign tabular export format, in combination with the amino acid representation, you will be able to investigate the annotated amino acid alignment. In order to do this, use this command: $ virulign NC_001477.virulign.xml NC_001477.fasta --exportKind PositionTable--exportAlphabet AminoAcids
Please note that the abilities of our script to be able to extract the CDS with all its features depends on how accurately it was described in GenBank. If you experience any problems to perform this procedure for your virus of interest, please let us know, and we will try to help you out.
We demonstrate the use of VIRULIGN by constructing sequence alignments for three viral pathogens that are the causative agents for major epidemics: HIV-1, Dengue virus serotype 1 (DENV-1) and Zika virus (ZIKV). For each pathogen, a different feature of VIRULIGN is demonstrated.\ Virus sequence datasets were collected from public databases (i.e., Genbank and the Stanford HIV Drug Resistance Database) and passed to different alignment software applications for evaluation. We used a minimum number of processing steps as possible between dataset retrieval and alignment input to clearly illustrate the strength of VIRULIGN. All data files of these examples can be found on the tutorial web page:
https://github.com/rega-cev/virulign-tutorial
We compare the output of VIRULIGN against three popular alignment tools (MAFFT, MUSCLE and Clustal Omega) in their ability to construct an accurate codon-correct alignment from a genome sequence dataset.
Genome sequence data of DENV-1 (i.e., Dengue Serotype 1) were collected from the Dengue Virus Variation Database (link) [Hatcher et al, 2017]. Only full-length nucleotide sequences originating from a human host were retained and identical sequences were collapsed. From a total of 3539 genome sequences, the corresponding serotype information was used to select a subset of 1432 DENV-1 isolates.\ The input FASTA file ’denv-1.fasta’ can be found in the tutorial Dengue folder:
https://github.com/rega-cev/virulign-tutorial/examples-alignments/DENV/
The NCBI Reference Sequence for DENV-1 is NC_001477 (link). Find a FASTA file ’NC_001477.fasta’ that contains this reference sequence in tutorial Dengue folder:
https://github.com/rega-cev/virulign-tutorial/examples-alignments/DENV/
Alignments were constructed using the default or recommended parameters
for each tool. The following versions were downloaded; VIRULIGN (v1.0),
MAFFT (v7.313) [Katoh et al, 2014], MUSCLE (v3.8.31) [Edgar et al, 204] and Clustal
Omega (v1.2.3) [Sievers et al, 2011]. MUSCLE was used with the additional
option -diags
, which is intended for alignments of highly similar
sequences. No additional parameters were used for the other programs,
although for individual cases, the use of specific parameters could
affect the speed or accuracy of the alignment construction process.\
The genome sequence of NC_001477 was added to the target dataset to
facilitate the trimming of constructed alignments to the boundaries of
the reference CDS, in order to remove the alignment of the 5’/3’
untranslated regions. The codon-correctness of the alignment was then
evaluated by visually inspecting the amino acid translation of the
respective alignments.\
The following commands were used to obtain alignments:
$ mafft --auto denv-1.fasta > denv-1-mafft.fasta
$ muscle -maxiters 1 -diags -in denv-1.fasta -out denv-1-muscle.fasta
$ clustalo --auto -i denv-1.fasta -o denv-1-clustalo.fasta
$ virulign NC_001477.fasta denv-1.fasta
--exportKind GlobalAlignment
--exportAlphabet Nucleotides > denv-1-virulign.fasta
Each alignment was then trimmed to the start and stop position of the coding sequence of the reference sequence, the size of this coding region is 10188 nucleotides. All trimmed alignments can be found in tutorial Dengue folder:
https://github.com/rega-cev/virulign-tutorial/examples-alignments/DENV
From this evaluation, it can be observed that VIRULIGN is able to handle insertions and deletions without disrupting the reading frame and resulting in the absence of stop codons within the alignment, while maintaining quality of the alignment. Figure below visualises a selected window from constructed alignments to illustrate the codon-correctness of VIRULIGN. We recorded the time needed for each alignment construction (Table below). (Performed on a 3.6 GHz Intel Core i7 CPU with 12 GB of RAM, where each application had access to 1 CPU core.) This evaluation shows that VIRULIGN is able to obtain these results while still being computationally competitive with MAFFT.
Command | Alignment | Run time |
---|---|---|
mafft | denv-1-mafft.fasta | 2m33s |
muscle | denv-1-muscle.fasta | 58m |
clustalo | denv-1-clustalo.fasta | 760m |
virulign | denv-1-virulign.fasta | 19m46s |
While the reference sequence for DENV-1 was provided by means of a simple FASTA file, VIRULIGN can also be used with an XML file containing both the reference sequence and the protein annotation of the reference genome. To construct the same alignment as above, with an annotated reference sequence, use this command:
$ virulign DENV1-NC001477.xml denv-1.fasta
--exportKind GlobalAlignment
--exportAlphabet Nucleotides > denv-1-virulign.fasta
XML annotation files for each of the four DENV serotypes are available in the VIRULIGN references folder:
https://github.com/rega-cev/virulign/references
More information on this XML feature is available in the section [features] Use and Features. In the next section, we demonstrate how this XML annotation file can be used to obtain alignments directed towards specific research applications.\
In 2015, ZIKV caused a worldwide public health emergency, resulting in an intensive community effort to identify genomic correlates of disease manifestations of microcephaly and other neurological complications. As we have shown recently [Theys et al, 2017], the rapid advance in ZIKV genomics resulted in inconsistencies that complicate the interpretation, reproducibility and comparison of findings from and across studies, particularly due to the lack of a consensus on the standardized and representative reference annotation. ZIKV reference genomes did not match virus strains sampled from the global epidemic or showed high level of heterogeneity in reported peptide lengths across their genome annotations.\ To mitigate these concerns, we provided a correction with respect to the NCBI reference sequence NC_012532 for ZIKV (Figure below). More information on the corrected reference sequence can be found at the Rega ZIKV reference sequence website (Link).
This example shows how the functionality of the XML configuration file, describing the genome annotation for all proteins, can greatly simplify the analysis to find associations of genomic features with clinically, epidemiologically or evolutionary parameters. We show that VIRULIGN makes this possible, while keeping manual processing limited to a minimum.
In particular, we replicate the evidence that indicated the necessity to correct the reference genome that was proposed by the NCBI. Extensive variation of a N-glycosylation motif in the Envelope (E) protein can be observed between different hosts, different viral lineages and even within a set of virus genomes derived from the historical MR766 strain.
A specific subset of 19 (near-)complete ZIKV genomes was collected from Genbank, the resulting FASTA file ’zikv.fasta’ can be found in the tutorial Zika folder:
https://github.com/rega-cev/virulign-tutorial/examples-alignments/ZIKV/
The reference genome sequence and the corresponding protein annotation for ZIKV can be found in the XML configuration file. This XML file ’ZIKV-rega.xml’ can be found in the VIRULIGN references folder:
https://github.com/rega-cev/virulign/references
Previous literature has shown variability of the glycosylation motif around positions 150 - 165 in the E protein, and this has been suggested to result from excessive in vitro passaging [Theys et al, 2017]. We used VIRULIGN to create a position table of the amino acids in the alignment, annotated according to the respective protein. The following command was used:
virulign ZIKV-rega.xml zikv.fasta
--exportKind PositionTable
--exportReferenceSequence yes > zikv-aligned.csv
The resulting CSV file contains a column for each position in the genome, and can be found in the tutorial Zika folder:
https://github.com/rega-cev/virulign-tutorial/examples-alignments/ZIKV/
A simple R script shows the variability of the glycosylation motif across the different virus variants.
# import the alignment file
data<-read.csv('zika-aligned.csv',header=TRUE)
# determine the positions of the motif
reg<-match('E_151',names(data)):match('E_163',names(data))
# show the motif sequences
tidyr::unite(data[,c(1,reg)],Motif,-1,sep='')
The relevant region in the CSV file looks like:
id,E_151,E_152,E_153,E_154,E_155,E_156,E_157,E_158,E_159,E_160,E_161,E_162,E_163
Ref,M,I,V,N,D,T,G,H,E,T,D,E,N
KF268948,M,I,V,N,D,I,G,H,E,T,D,E,N
KF268949,M,I,V,N,-,-,-,-,-,-,D,E,N
KF268950,M,I,V,N,D,I,G,H,E,T,D,E,N
KU955595,M,I,V,N,D,T,G,H,E,T,D,E,N
KY014323,M,I,V,N,D,T,G,H,E,T,D,E,N
KU963574,M,I,V,N,-,-,-,-,-,-,D,E,N
HQ234500,M,I,V,N,-,-,-,-,-,-,D,E,N
KX369547,M,I,V,N,D,T,G,H,E,T,D,E,N
....
When additional meta-data is given as well, this analysis clearly illustrates the presence of a VNDT motif in viruses sampled from the recent epidemic, and an independence of the deletion regarding the host, year of collection and viral lineage (Figure below).
An alignment CSV table, that includes protein and position data in the header, can be easily processed by external tools. One could easily develop their own scripts to operate on this format, but many interesting manipulations can be done using default command line tools as well.
As an example, based on the whole-genome ZIKV alignment above, we can easily select a particular protein (e.g., the NS3 protein), using csvkit (link).
# define index of the NS3 headers
ns3_headers=`csvcut -n ZIKA-pos.csv | grep NS3 | cut -d ":" -f 2`
# use comma as separator between header columns
ns3_headers=`echo $ns3_headers | tr ' ' ','`
# extract the NS3 region from the whole-genome alignment
csvcut -c "seqid,${ns3_headers}" ZIKA-pos.csv > ZIKA-NS3-pos.csv
The genome structure of HIV-1 is characterized by three ORFs, where each frame determines different genes encoding the viral proteins. The gag/pol/env gene organization, as for other retroviruses, encodes for important structural proteins and enzymes, which are first translated as large poly-proteins. Gag and pol have overlapping ORFs, requiring a ribosomal frame shift to reveal the pol ORF. HIV-1 gag encodes for several structural proteins and is considered as a potential target for antiretroviral treatment [Tedbury et al, 2015].
HIV sequences were obtained from a large-scale analysis of HIV-1 diversity [@li2015], where 2966 sequences from different HIV-1 subtypes have been analyzed. The FASTA input file ’hiv.fasta’ can be found in the tutorial HIV-1 folder:
https://github.com/rega-cev/virulign-tutorial/examples-alignments/HIV-1/
The HXB2 sequence NC_001802link was used as the reference genome. An XML file with the corresponding coding sequence and annotation is available for the different polyproteins of relevance. The respective files ‘HIV-HXB2-env.xml‘, ‘HIV-HXB2-gag.xml‘ and ‘HIV-HXB2-pol.xml‘ are available in the VIRULIGN references folder:
https://github.com/rega-cev/virulign/references
For this example, we used the file ‘HIV-HXB2-gag.xml‘.
To align the gag sequences, we used the following command
virulign HIV-HXB2-gag.xml HIV.fasta
--exportKind GlobalAlignment
--exportAlphabet Nucleotides
--exportReferenceSequence yes > HIVgag.fasta
An option of VIRULIGN can be used to avoid insertions towards the reference sequence from being exported. This feature can be used to inspect the quality of the sequence dataset when insertions are sparse throughout the dataset or when insertions are not expected.
virulign HIV-HXB2-gag.xml HIV.fasta
--exportKind GlobalAlignment
--exportAlphabet Nucleotides
--exportWithInsertions no
--exportReferenceSequence yes > HIVgag-NoInsertions.fasta
A second example of HIV-1 alignment is directed towards drug resistance detection, which is still a major need for successful treatment, in particular in developing countries as a result of the up-scale of antiretroviral treatment. Therefore, the identification, understanding and interpretation of resistance mutations remains an important research topic. The pol polyprotein is cleaved into three viral enzymatic proteins (protease, reverse transcriptase, and integrase), each of which is an important drug target.
We downloaded a large set of reverse transcriptase sequences (N=111223) from the Stanford University HIV Drug Resistance Database [Rhee et al, 2003], heterogeneous in length and mapping of the complete reverse transcriptase region. The resulting FASTA file ’HIVdb.fasta’ can be found in the tutorial HIV-1 folder:
https://github.com/rega-cev/virulign-tutorial/examples-alignments/HIV-1/
The HXB2 sequence NC_001802 was used a reference genome. An XML file with the corresponding coding sequence and annotation is available for each specific ORF. The respective files ‘HIV-HXB2-env.xml‘, ‘HIV-HXB2-gag.xml‘ and ‘HIV-HXB2-pol.xml‘ are available in the VIRULIGN references folder:
https://github.com/rega-cev/virulign/references
For this example, we use the file ‘HIV-HXB2-pol.xml‘.
virulign HIV-HXB2-pol.xml HIVdb.fasta
--exportKind GlobalAlignment
--exportAlphabet Nucleotides
--exportWithInsertions no
--exportReferenceSequence yes > HIVrt.fasta
111189 sequences could be aligned by VIRULIGN, and we visually inspected the quality of the alignment. The constructed alignment can be used as input for different applications to investigate drug resistance mutations identification and interpretation. Thanks to VIRULIGN’s computational complexity, our new method is able to deal well with large dataset what is reflected in favorable run-times for this particular analysis. VIRULIGN performed this alignment in 49 minutes, while it took MAFFT 10 hours and 50 minutes. (Performed on the same hardware configuration as declared earlier. )
34 sequences could not be aligned by VIRULIGN, which identifiers and sequences are respectively stored in the files ‘HIVdb-errorsequences.txt‘ and ‘HIVdb-errorsequences.fasta‘ in the tutorial HIV-1 folder. The identifiers of these sequences can be obtained by using the unix command ’diff’ on the headers of the input and output files, and subsequently the sequences can be extracted from the input file using these identifiers.
However, the VIRULIGN parameter --nt-debug
can be used to
automatically redirect failed sequences to a folder, which should be
created prior to command execution. Adding this parameter to the command
used above gives the following:
virulign HIV-HXB2-pol.xml HIVdb.fasta
--exportKind GlobalAlignment
--exportAlphabet Nucleotides
--exportWithInsertions no
--exportReferenceSequence yes
--nt-debug Failed > HIVrt.fasta
The directory ‘Failed‘ in the tutorial HIV-1 folder contains pairwise sequence alignments of each failed target sequence with the reference sequence. To demonstrate here the debug feature of VIRULIGN without having to re-align the entire dataset again, we aligned the subset of 34 sequences with VIRULIGN using the following command
virulign HIV-HXB2-pol.xml HIVdb-errorsequences.fasta
--exportKind GlobalAlignment
--exportAlphabet Nucleotides
--nt-debug Failed
Each alignment in the folder ‘Failed‘ can be subjected to closer inspection in order to investigate the reason for the exclusion of the sequence from the alignment. Figure below shows an example of sequence 64344 which failed to be included in the final MSA.
To further demonstrate that VIRULIGN provides accurate and fast codon-aware sequence alignments, we downloaded the set of sequences from Genbank, that were described in a recent publication by Chaplin et al. (2018) [Chaplin et al, 2018]. This study analyzed the distinct patterns of thymidine analogue mutations with K65R in HIV-1 patients failing tenofovir-based antiretroviral therapy. We have added the sequence file ‘chaplin2018-sequences.fasta‘ to the tutorial HIV-1 folder.\ We aligned the set of sequences with VIRULIGN using the following command
virulign HIV-HXB2-pol.xml chaplin2018-sequences.fasta
> chaplin2018-mutations.csv
We counted the number of occurrences of NRTI mutations K65R and K103N in this dataset: respectively 92 times the K65R mutation was detected and 145 times the K103N mutation. This number matches exactly the mutation frequencies obtained using the Stanford HIVdb pipeline that was used in the study of Chaplin et al., supporting the confidence of VIRULIGN to retrieve accurate alignments.
For comparison purposes, we also constructed a MSA of this set of sequences with VIRULIGN and with MAFFT, and subsequently trimmed to the first position of RT. Figure below provides a visual inspection of the two constructed alignments, and an illustration of the codon-awareness of VIRULIGN. It can be seen that insertions cause frameshifts and the inclusion of stop codons in the MAFFT alignment, while VIRULIGN accommodates these insertions without disturbing the reading frames of the alignment.
With MAFFT:
mafft --auto chaplin2018-sequences.fasta
> chaplin2018-sequences-mafft.fasta
With VIRULIGN:
virulign HIV-HXB2-pol.xml chaplin2018-sequences.fasta
--exportKind GlobalAlignment
--exportAlphabet Nucleotides
--exportWithInsertions no
--exportReferenceSequence yes > chaplin2018-sequences-virulign.fasta
VIRULIGN has been used for a large number of analyses with respect to virus genomics. We provide a non-exhaustive list of examples:
Identification of HIV-1 drug resistance mutations and the pathways emerging under drug selective pressure, as well as modeling the different factors leading to treatment failure and trends over time [Ngcapu et al, 2017; Theys et al, 2013; Vercauteren et al, 2003; Vercauteren et al, 2008; Deforche et al, 2008a; Deforche et al, 2008b].
Large-scale analysis of HIV-1 and HCV sequence datasets to explore genetic diversity at population level and to map structural and functional factors that shape viral evolution [Abecasis et al, 2013; Li et al, 2015; Cuypers et al, 2015; Cuypers et al, 2016].
Evaluation of the annotation and representativeness of current reference genomes, and support of correcting the NCBI reference sequence for the Zika virus [Theys et al, 2017].
Web-application to support surveillance and tracing of viral outbreaks (e.g., HIV-1 and DENV), necessitating efficient analysis of large sequence databases and phylogenetic trees [Libin et al, 2017].
VIRULIGN was recently integrated into the RegaDB data management and analysis platform for the clinical follow-up of HIV-1 patients [Vercauteren et al, 2013; Libin et al, 2013].
Evaluation of an automated framework for the virus typing (HIV-1, Dengue, Zika and other viruses) and resistance interpretation algorithms [Pineda et al, 2013; Snoeck et al, 2006; Theys et al, 2015].
[1] A. B. Abecasis, A. M. Wensing, D. Paraskevis, J. Vercauteren, K. Theys, D. A. Van de Vijver, J. Albert, B. Asjo, C. Balotta, D. Beshkov, R. J. Camacho, B. Clotet, C. De Gascun, A. Griskevicius, Z. Gross- man, O. Hamouda, A. Horban, T. Kolupajeva, K. Korn, L. G. Kostrikis, C. Kucherer, K. Liitsola, M. Linka, C. Nielsen, D. Otelea, R. Pare- des, M. Poljak, E. Puchhammer-Stockl, J. C. Schmit, A. Sonnerborg, D. Stanekova, M. Stanojevic, D. Struck, C. A. Boucher, and A. M. Van- damme. HIV-1 subtype distribution and its demographic determinants in newly diagnosed patients in Europe suggest highly compartmentalized epi- demics. Retrovirology, 10:7, Jan 2013.
[2] B. Chaplin, G. Imade, C. Onwuamah, G. Odaibo, R. Audu, J. Okpokwu, D. Olaleye, S. Meloni, H. Rawizza, M. Muazu, A. Z. Musa, J. Samuel, O. Agbaji, O. Ezechi, E. Idigbe, and P. J. Kanki. Distinct Pattern of Thymidine Analogue Mutations with K65R in Patients Failing Tenofovir- Based Antiretroviral Therapy. AIDS Res. Hum. Retroviruses, 34(2):228– 233, Feb 2018.
[3] L. Cuypers, G. Li, P. Libin, S. Piampongsant, A. M. Vandamme, and K. Theys. Genetic Diversity and Selective Pressure in Hepatitis C Virus Genotypes 1-6: Significance for Direct-Acting Antiviral Treatment and Drug Resistance. Viruses, 7(9):5018–5039, Sep 2015.
[4] L. Cuypers, G. Li, C. Neumann-Haefelin, S. Piampongsant, P. Libin, K. Van Laethem, A. M. Vandamme, and K. Theys. Mapping the genomic diversity of HCV subtypes 1a and 1b: Implications of structural and im- munological constraints for vaccine and drug development. Virus Evol, 2(2):vew024, Jul 2016.
[5] K. Deforche, R. J. Camacho, Z. Grossman, M. A. Soares, K. Van Laethem, D. A. Katzenstein, P. R. Harrigan, R. Kantor, R. Shafer, A. M. Van- damme, R. Kantor, D. A. Katzenstein, R. W. Shafer, R. J. Camacho, A. P. Carvalho, B. Wynhoven, P. R. Harrigan, P. Cane, J. Clarke, J. Weber, S. Sirivichayakul, P. Phanuphak, M. A. Soares, A. Tanuri, J. Snoeck, A. M. Vandamme, L. Morris, H. Rudich, Z. Grossman, J. M. Schapiro, R. Ro- drigues, L. F. Brigido, A. Holguin, V. Soriano, K. Ariyoshi, W. Sugiura, M. B. Bouzas, P. Cahn, D. Pillay, T. L. Katzenstein, and L. B. J?rgensen. Bayesian network analyses of resistance pathways against efavirenz and nevirapine. AIDS, 22(16):2107–2115, Oct 2008.
[6] K. Deforche, A. Cozzi-Lepri, K. Theys, B. Clotet, R. J. Camacho, J. Kjaer, K. Van Laethem, A. Phillips, Y. Moreau, J. D. Lundgren, and A. M. Van- damme. Modelled in vivo HIV fitness under drug selective pressure and estimated genetic barrier towards resistance are predictive for virological response. Antivir. Ther. (Lond.), 13(3):399–407, 2008.
[7] R. C. Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32(5):1792–1797, 2004.
[8] E. L. Hatcher, S. A. Zhdanov, Y. Bao, O. Blinkova, E. P. Nawrocki, Y. Ostapchuck, A. A. Schaffer, and J. R. Brister. Virus Variation Re- source - improved response to emergent viral outbreaks. Nucleic Acids Res., 45(D1):D482–D490, Jan 2017.
[9] K. Katoh and D. M. Standley. MAFFT: iterative refinement and additional methods. Methods Mol. Biol., 1079:131–146, 2014.
[10] G. Li, S. Piampongsant, N. R. Faria, A. Voet, A. C. Pineda-Pena, R. Khouri, P. Lemey, A. M. Vandamme, and K. Theys. An integrated map of HIV genome-wide variation from a population perspective. Retro- virology, 12:18, Feb 2015.
[11] P. Libin, G. Beheydt, K. Deforche, S. Imbrechts, F. Ferreira, K. Van Laethem, K. Theys, A. P. Carvalho, J. Cavaco-Silva, G. Lapadula, C. Torti, M. Assel, S. Wesner, J. Snoeck, J. Ruelle, A. De Bel, P. La- cor, P. De Munter, E. Van Wijngaerden, M. Zazzi, R. Kaiser, A. Ayouba, M. Peeters, T. de Oliveira, L. C. Alcantara, Z. Grossman, P. Sloot, D. Ote- lea, S. Paraschiv, C. Boucher, R. J. Camacho, and A. M. Vandamme. RegaDB: community-driven data management and analysis for infectious diseases. Bioinformatics, 29(11):1477–1480, Jun 2013.
[12] P. Libin, E. Vanden Eynden, F. Incardona, A. Nowe, A. Bezenchek, A. Son- nerborg, A. M. Vandamme, K. Theys, and G. Baele. PhyloGeoTool: inter- actively exploring large phylogenies in an epidemiological context. Bioin- formatics, 33(24):3993–3995, Dec 2017.
[13] S. Ngcapu, K. Theys, P. Libin, V. C. Marconi, H. Sunpath, T. Ndung’u, and M. L. Gordon. Characterization of Nucleoside Reverse Transcriptase Inhibitor-Associated Mutations in the RNase H Region of HIV-1 Subtype C Infected Individuals. Viruses, 9(11), Nov 2017.
[14] A. C. Pineda-Pena, N. R. Faria, S. Imbrechts, P. Libin, A. B. Abecasis, K. Deforche, A. Gomez-Lopez, R. J. Camacho, T. de Oliveira, and A. M. Vandamme. Automated subtyping of HIV-1 genetic sequences for clini- cal and surveillance purposes: performance evaluation of the new REGA version 3 and seven other tools. Infect. Genet. Evol., 19:337–348, Oct 2013.
[15] S. Y. Rhee, M. J. Gonzales, R. Kantor, B. J. Betts, J. Ravela, and R. W. Shafer. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res., 31(1):298–303, 2003.
[16] F. Sievers, A. Wilm, D. Dineen, T. J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M. Remmert, J. Soding, J. D. Thompson, and D. G. Hig- gins. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol., 7:539, Oct 2011.
[17] J. Snoeck, R. Kantor, R. W. Shafer, K. Van Laethem, K. Deforche, A. P. Carvalho, B. Wynhoven, M. A. Soares, P. Cane, J. Clarke, C. Pillay, S. Sirivichayakul, K. Ariyoshi, A. Holguin, H. Rudich, R. Rodrigues, M. B. Bouzas, F. Brun-Vezinet, C. Reid, P. Cahn, L. F. Brigido, Z. Grossman, V. Soriano, W. Sugiura, P. Phanuphak, L. Morris, J. Weber, D. Pillay, A. Tanuri, R. P. Harrigan, R. Camacho, J. M. Schapiro, D. Katzenstein, and A. M. Vandamme. Discordances between interpretation algorithms for genotypic resistance to protease and reverse transcriptase inhibitors of hu- man immunodeficiency virus are subtype dependent. Antimicrob. Agents Chemother., 50(2):694–701, Feb 2006.
[18] Philip R Tedbury and Eric O Freed. Hiv-1 gag: an emerging target for antiretroviral therapy. In The Future of HIV-1 Therapeutics, pages 171– 201. Springer, 2015.
[19] K. Theys, A. Abecasis, P. Libin, P. T. Gomes, J. Cabanas, R. J. Camacho, and K. Van Laethem. Discordant predictions of residual activity could impact dolutegravir prescription upon raltegravir failure. J. Clin. Virol., 70:120–127, Sep 2015.
[20] K. Theys, P. Libin, K. Dallmeier, A. C. Pineda-Pena, A. M. Vandamme, L. Cuypers, and A. B. Abecasis. Zika genomics urgently need standardized and curated reference sequences. PLoS Pathog., 13(9):e1006528, 09 2017.
[21] K. Theys, J. Vercauteren, J. Snoeck, M. Zazzi, R. J. Camacho, C. Torti, E. Schulter, B. Clotet, A. Sonnerborg, A. De Luca, Z. Grossman, D. Struck, A. M. Vandamme, and A. B. Abecasis. HIV-1 subtype is an independent predictor of reverse transcriptase mutation K65R in HIV-1 patients treated with combination antiretroviral therapy including tenofovir. Antimicrob. Agents Chemother., 57(2):1053–1056, Feb 2013.
[22] J. Vercauteren, I. Derdelinckx, A. Sasse, M. Bogaert, H. Ceunen, A. De Roo, S. De Wit, K. Deforche, F. Echahidi, K. Fransen, J. C. Gof- fard, P. Goubau, E. Goudeseune, J. C. Yombi, P. Lacor, C. Liesnard, M. Moutschen, D. Pierard, R. Rens, Y. Schrooten, D. Vaira, A. van den Heuvel, B. van der Gucht, M. van Ranst, E. van Wijngaerden, B. Vander- cam, M. Vekemans, C. Verhofstede, N. Clumeck, A. M. Vandamme, and K. van Laethem. Prevalence and epidemiology of HIV type 1 drug resis- tance among newly diagnosed therapy-naive patients in Belgium from 2003 to 2006. AIDS Res. Hum. Retroviruses, 24(3):355–362, Mar 2008.
[23] J. Vercauteren, K. Theys, A. P. Carvalho, E. Valadas, L. M. Duque, E. Te- ofilo, T. Faria, D. Faria, J. Vera, M. J. Aguas, S. Peres, K. Mansinho, A. M. Vandamme, R. J. Camacho, K. Mansinho, A. Claudia Miranda, I. Aldir, F. Ventura, J. Nina, F. Borges, E. Valadas, M. Doroana, F. Antunes, M. Joao Aleixo, M. Joao Aguas, J. Botas, T. Branco, J. Vera, I. Vaz Pinto, J. Pocas, J. Sa, L. Duque, A. Diniz, A. Mineiro, F. Gomes, C. San- tos, D. Faria, P. Fonseca, P. Proenca, L. Tavares, C. Guerreiro, J. Nar- ciso, T. Faria, E. Teofilo, S. Pinheiro, I. Germano, U. Caixas, N. Faria, A. Paula Reis, M. Bentes Jesus, G. Amaro, F. Roxo, R. Abreu, and I. Neves. The demise of multidrug-resistant HIV-1: the national time trend in Por- tugal. J. Antimicrob. Chemother., 68(4):911–914, Apr 2013.