This package is an integrated tool for microbial sequence analyses. The methods are refactored from methods of the published research. We evaluated the results reproducibility with the same raw data.
This repository includes:
Check the prerequisites
Download this package
git clone --recurse-submodules https://github.com/hzi-bifo/seq2geno.git
cd seq2geno
git submodule update --init --recursive
The command uses --recurse-submodules to download the submodules. The flag is available only in git version >2.13. Earlier git versions might have the substitute. After the package is downloaded, main/seq2geno and main/seq2geno_gui are the executable scripts for Seq2Geno.
cd install/
SETENV.sh snakemake_env
conda activate snakemake_env
TESTING.sh
Once the environment is properly set up, Seq2Geno can be launched using the graphical user interface (GUI) or command line
The commands
S2G
or
seq2geno_gui
will launch the graphic user interface. Use the tool to read, edit, or save the arguments in a yaml file. Once the arguments are ready, the analyses can be launched with this interface; for large-scale researches, however, generating the yaml file and launching the analyses with the command line method (described below) might be more convenient, as having processes running in the background should be more convenient. To learn more, please read the manual doc/GUI_manual.pdf.
The input for seq2geno is a single yaml file describing all arguments:
S2G -d -f [options_yaml] -z [zip_input] -l [log_file] --outzip [output_zip_type]
Both options_yaml and zip_input specify the materials to use. At least one of them should be used. When options_yaml is properly set, zip_input will be neglected. The options_yaml describes all the options and paths to input data for Seq2Geno. The zip_input packs all the materials and has a structure that Seq2Geno can recognize (see input_zip_structure.md for more details).
The log_file should be a non-existing filename to store the log information; if not set, the messages will be directed to stdout and stderr.
The output_zip_type should be one of 'none' (default), 'all', 'main', or 'g2p'. The choice specifies whether or how the output results should be packed into an zip file.
The input file is an yaml file where all options are described. The file includes two parts:
option | action | values ([default]) |
---|---|---|
dryrun | display the processes and exit | [Y]/N |
snps | SNPs calling | Y/[N] |
denovo | creating de novo assemblies | Y/[N] |
expr | counting expression levels | Y/[N] |
phylo | inferring the phylogeny | Y/[N] |
de | differential expression | Y/[N] |
ar | ancestral reconstruction of expression levels | Y/[N] |
To only create the folder and config files, please turn off the last six options.
general (* mandatory):
cores: number of cpus (integer; automatically adjusted if larger than the available cpu number)
mem_mb: memory size to use (integer in mb; automatically adjusted if larger than the free memory). Note: some processes may crush because of insufficiently allocated memory
*wd: the working directory. The intermediate and final files will be stored under this folder. The final outcomes will be symlinked to the sub-directory RESULTS/.
*dna_reads: the list of DNA-seq data
It should be a two-column list, where the first column includes all samples and the second column lists the paired-end reads files. The two reads file are separated by a comma. The first line is the first sample.
sample01 /paired/end/reads/sample01_1.fastq.gz,/paired/end/reads/sample01_2.fastq.gz
sample02 /paired/end/reads/sample02_1.fastq.gz,/paired/end/reads/sample02_2.fastq.gz
sample03 /paired/end/reads/sample03_1.fastq.gz,/paired/end/reads/sample03_2.fastq.gz
The fasta, gff, and genbank files of a reference genome. They should have same sequence ids.
old_config: if recognizable, the config files that were previously stored in the working directory will be reused. ('Y': on; 'N': off)
rna_reads: the list of RNA-seq data. (string of filename)
It should be a two-column list, where the first column includes all samples and the second column lists the short reads files. The first line is the first sample.
sample01 /transcription/reads/sample01.rna.fastq.gz
sample02 /transcription/reads/sample02.rna.fastq.gz
sample03 /transcription/reads/sample03.rna.fastq.gz
The table is tab-separated. For n samples with m phenotypes, the table is (n+1)-by-(m+1) as shown below. The first column should be sample names. The header line should includes names of phenotypes. Missing values are acceptable.
strains virulence
sample01 high
sample02 mediate
sample03 low
The fasta file of adaptors of DNA-seq. It is used to process the DNA-seq reads.
The folder examples/ includes a structured zip file and a yaml file--the two input formats that Seq2Geno can recognize. The zip file can be used as the input with this command:
S2G -z examples/example_input.zip\
-l examples/example_input_zip.log\
--outzip g2p
To use the configuration yaml file, please ensure unpacked example data (that is, the zip file) and edit the yaml file to ensure the right paths to those example data. After they are ready, please run with this command:
S2G -f examples/seq2geno_input.yml\
-l exapmles/seq2geno_input_yml.log\
--outzip g2p
To include automatic submission to the Geno2Pheno server, just use the flag
--to_gp
:
S2G -f examples/seq2geno_input.yml\
-l exapmles/seq2geno_input_yml.log\
--outzip g2p --to_gp
Please directly visit Geno2Pheno
Why the analyses crushed?
Please check the log file or STDOUT and STDERR and determine the exact error.
Will every procedure be rerun if I want to add one sample?
No, you just need to add one more line in your reads list (i.e., the dna or the rna reads. See section arguments for more details.) and then run the same workflow again. Seq2Geno uses Snakemake to determine whether certain intermediate data need to be recomputed or not.
Will every procedure be rerun if I want to exclude one sample?
No; however, besides excluding that sample from the reads list, you will need to remove all the subsequent results that were previously computed. That could be risky.
Will every procedure be rerun if I accidentally delete some data?
No, only the deleted one and the subsequent data will be recomputed.
Where is the final data?
In the working directory, the main results are collected in the subfolder RESULTS/
. You can also find the other intermediate data in the corresponding folders (e.g., mapping results)
What is the current status?
If the log file was specified, you could check the log file to determine the current status. Otherwise, the status should be directed to your STDOUT or STDERR.
GPLv3 (please refer to LICENSE)
Please contact Tzu-Hao Kuo (Tzu-Hao.Kuo@helmholtz-hzi.de) and specify:
We will be publishing the paper for the joint work of Seq2Geno and Geno2Pheno. Before that, please use
Kuo, T.-H., Weimann, A., Bremges, A., & McHardy, A. C. (2021). Seq2Geno (v1.00001) [A reproducibility-aware, integrated package for microbial sequence analyses].
or
@software{thkuo2021seq2geno,
author = {Tzu-Hao Kuo, Aaron Weimann, Andreas Bremges, Alice C. McHardy},
title = {Seq2Geno: a reproducibility-aware, integrated package for microbial sequence analyses},
version = {v1.00001},
date = {2021-06-20},
}