Extract alleles from genome assemblies
GetAlleles is a simple program to extract alleles of target genes from genome assemblies.
It uses alignment via minimap2
or miniprot
to align reference genes to
assembly contigs, and extracts the alleles that pass the identity
and coverage thresholds. The extracted sequences are then translated to determine whether the
protein is truncated. The nucleotide and amino acid sequences are hashed for easy allele
identification in tabular format and for submission to typing scheme databases such as
PubMLST.
pip install git+https://github.com/tomdstanton/GetAlleles.git
If using nucleotide references, minimap2
needs to be installed in your $PATH
.
If using amino acid references, miniprot
needs to be installed in your $PATH
.
usage: getalleles <reference> <assembly> [<assembly> ...] [options]
Extract alleles from genome assemblies
Input:
Input files / stdin can be compressed
reference Reference genes in ffn/fna format, use - for stdin
assembly Assembly file(s) in fna format
Alignment options:
-i 80, --min-id 80 Minimum identity percentage for alignment
-c 80, --min-cov 80 Minimum coverage percentage for alignment
--best-n 0 Best N hits per reference or 0 to report all (that pass filters)
--cull Culls overlapping references so only the best is kept (that pass filters)
--args Extra arguments to pass to mini{map2,prot}; MUST BE WRAPPED
Allele options:
--table 11 Codon table to use for translation
--dna-hash sha1 Algorithm for hashing the allele DNA sequence
--aa-hash md5 Algorithm for hashing the allele AA sequence
Output options:
-o , --tsv Write/append tsv report to file (default: stdout)
--ffn [alleles.ffn] Output allele DNA sequences (single file or directory)
--faa [alleles.faa] Output allele AA sequences (single file or directory)
--alt-header Sample-specific fasta headers
--no-header Suppress header in TSV
Other options:
-t 0, --threads 0 Number of indexing threads or 0 for all available
-h, --help Print help and exit
-v, --verbose Verbose messages
--version Print version and exit
getalleles v0.0.2b0
By default, the program will output a BED-style TSV to <stdout>
with the following columns:
-o/--tsv
flag or >
and >>
in bash.--no-header
option will suppress the header line. If appending to a file, the header will only be written once./dev/null
or NUL
on Windows.You can output the extracted DNA and protein sequences in Fasta format with the --ffn
and --faa
flags respectively, with the defaults being alleles.{ffn,faa}
.
These arguments take a file or directory. If the file already exists, it will be appended to; if the argument is a directory, one file per assembly will be generated.
The ID (header) of each sequence is the hash digest of the sequence.
Extract reference genes from assemblies and save results to a file:
getalleles genes.fasta assemblies/*.fna > alleles.tsv
OR
cat genes.fasta | getalleles - assemblies/*.fna -o alleles.tsv
To do the same but output the nucleotide sequences to a fasta file.
getalleles genes.fasta assemblies/*.fna -o alleles.tsv --ffn alleles.ffn
OR
getalleles genes.fasta assemblies/*.fna -o alleles.tsv --ffn - > alleles.ffn
To output the allele protein sequences into an MSA program:
getalleles genes.fasta assemblies/*.fna -o alleles.tsv --ffn --faa - | muscle
To output one sequence file per assembly :
getalleles genes.fasta assemblies/*.fna -o alleles.tsv --ffn dna_seqs/ --faa protein_seqs/
v0.0.1b0
.