The process of going from an annotated genome to a valid NCBI submission is somewhat cumbersome. "Boutique" genome projects typically produce a scaffolded assembly in FASTA format, as produced by any of a variety of de-novo assemblers, and predicted genes in GFF3 tabular format, e.g. as produced by the maker pipeline. No convenient tools appear to exist to turn these results in a format and to a standard that NCBI accepts.
NCBI requires that "whole genome shotgunning" (WGS) genomes are submitted as .sqn
files.
A sqn file is a file in ASN.1 syntax that contains the sequences, their features, and
the metadata about the submission, i.e. the authors, the publication title, the organism,
etc.. .sqn
files are normally produced by the
sequin program, which has a graphical user
interface. Sequin works fine for a single gene or for a small genome (e.g. a mitochondrial
genome) but for large genomes with thousands of genes spread out over potentially
thousands of scaffolds the submission process done in this way is unworkable.
The alternative is to use the tbl2asn
command line program, which takes a directory with FASTA files (.fsa
), corresponding
files with the gene features in tabular format (.tbl
), and a submission template
(template.sbt), to produce .sqn
files. The trick thus becomes to convert the assembly
FASTA file and the annotation GFF3 file into a collection of FASTA chunks with
corresponding feature tables. This is doable in principle - several toolkits provide
generic convertors -, but NCBI places quite a few restrictions on what are permissible
things to have in the FASTA headers, what coordinate ranges are credible as gene features,
and what gene and gene product names are acceptable.
This project remedies these challenges by providing a command-line utility (with no 3rd party dependencies except URI::Escape) to do the required data re-formatting and cleaning. Included is also a shell script that chains the Perl scripts together and runs NCBI's tbl2asn on the result. This shell script is intended as an example and should be edited or copied to provide the right values.
The WGS2NCBI release is organized in a way that is standard for software releases written in the Perl5 programming language. This means that it can be installed using a series of commands that either you yourself, or your systems administrator, are likely already familiar with. The first step is to install a required dependency using the Perl5 package manager (cpan), as follows:
$ sudo cpan -i URI::Escape
The next steps assume that you have downloaded the WGS2NCBI release - for example from the git repository - have unzipped it, and have moved into the root folder of the release in your terminal. The next steps then are as follows:
$ perl Makefile.PL
$ make test
$ sudo make install
The second command (make test
) performs a number of basic tests of the software on your
system. These should all pass without problems. If you do encounter issues, it is best
not to proceed to the following step for the actual installation, but rather to try to
resolve the outstanding problems, for example by submitting an
issue report, so that the authors can help
you out.
In addition to the preceding steps, you also need to install the tbl2asn
program. The
instructions for this are here.
WGS2NCBI is used by following a number of steps, which are detailed below:
prepare
- pre-process the annotation file for rapid
access in the following stepsprocess
- convert the genome file and annotations
to FASTA chunks and feature tablesconvert
- runs tbl2asn to convert the FASTA chunks
and feature tables to SeqIn filescompress
- collates the SeqIn files into a single
archive for upload to NCBI Before issuing any commands, the following steps need to be taken:
.ini
files correctly. Using the linked files as
examples, the following need to be prepared:
prepare
Once the preparation is done, you will now run the prepare
subcommand, as follows:
$ wgs2ncbi prepare -conf <wgs2ncbi.ini>
The value of the -conf
argument specifies the location of the
wgs2ncbi.ini configuration file. In all following steps you will
also need to provide the location of this same file.
What happens during this step is that the GFF3 file is pre-processed so that the following steps will have quicker access to the relevant contents than they would have if they had to scan through the entire file every time. To be precise, the following happens:
maker
pipeline, are retained.As such, this step is to do initial filtering and pre-processing. Typically you will only need to run this step once.
process
This subcommand is issued as follows:
$ wgs2ncbi process -conf <wgs2ncbi.ini>
i.e. by providing the location of the wgs2ncbi.ini configuration
file to the -conf
argument.
The process
subcommand contains most of the "intelligence" (such as it is). In this step
the following happens:
chunksize
, rounded up to the nearest integer. However, contigs smaller than
minlength
, if you have them, will be omitted, as NCBI won't accept these.NNNs
. These will be sequence
fragments that NCBI will specify as inadmissible because they might be sequence adaptors
(i.e. vendor-specific synthetic DNA) or contaminants..fsa
file extension, as required by tbl2asn
. .tbl
). There will be as many .tbl
files
as there are .fsa
files..fsa
and the .tbl
files will be written in the same directory, specified by
datadirSince the results of this step depend on the settings in the products.ini and adaptors.ini files, and since you will hear from NCBI what needs to go in these files, this step and the following ones are something that you will probably run multiple times until NCBI is happy.
convert
This subcommand will run tbl2asn
. As such, it is essential that this program is
installed successfully according to NCBI's instructions, which are
here. If the program is installed such
that it is available on the PATH you can
proceed with this step without making any changes. If you've had to install it in a
location where it is not on the PATH
, you can specify an alternative location
under tbl2asn
in the main configuration file. Check that the executable is ready to run, e.g. by
issuing which tbl2asn
if it is on the PATH
or by running it from its alternative
location.
Once you are all set, issue the subcommand as follows:
$ wgs2ncbi convert -conf <wgs2ncbi.ini>
i.e. by providing the location of the wgs2ncbi.ini configuration
file to the -conf
argument. The following will then happen:
.sqn
extension, will be written to the directory
outdirtbl2asn
diagnosed, will be
written to
discrepNote that this discrepancy report will give you the first suggestions for problematic gene product names (which you can deal with in products.ini), but this will not be exhaustive: NCBI will likely point out additional problems, and any checks for contaminations or spurious adaptors will only be performed by NCBI. In other words, citing their website:
The Discrepancy Report is an evaluation of a single or multiple ASN.1 files, looking for suspicious annotation or annotation discrepancies that NCBI staff has noticed commonly occur in genome submissions, both complete and incomplete (WGS). A few of the problems that this function was written to find include inconsistent locus_tag prefixes, missing protein_id's, missing gene features, and suspect product names. The function is available in specially configured Sequin, as an argument for tbl2asn, or with the command-line program asndisc.
If you have questions about the Discrepancy Report, please contact us by email at genomes@ncbi.nlm.nih.gov prior to sending us your submission. Source: https://www.ncbi.nlm.nih.gov/genbank/asndisc/
compress
The final step simply takes the .sqn
files from the previous step and combines them in
a single .tar.gz
archive for upload to the NCBI submission portal. No data processing of
any kind takes place, this is purely for convenience and is executed as follows:
$ wgs2ncbi compress -conf <config.ini>
i.e. by providing the location of the wgs2ncbi.ini configuration
file to the -conf
argument. The following will then happen:
You will then upload the produced archive to the submission portal. Once you upload the
archive, you will get a verdict from whoever is handling this submission at NCBI.
Depending on their feedback, you will likely have to update the configuration files a few
more times to correct for spurious sequence data and gene product names, after which you
will re-run the process
subcommand (and onwards to convert
and compress
).
WGS2NCBI is implemented as a Perl5 package. It is open source software made available under the BSD3 license.
If you experience any difficulties with this software, or you have suggestions, or want to contribute directly, you have the following options: