Seq2Geno

This package is an integrated tool for microbial sequence analyses. The methods are refactored from methods of the published research. We evaluated the results reproducibility with the same raw data.

Repository structure
Available functions
Download Seq2Geno and ensure the environment
Usage and Input
Example data and usages
Train the phenotypic predictor with the Seq2Geno results
- Start from Seq2Geno
- With precomputed data
FAQ
License
Contact
Citation

Repository structure

This repository includes:

install: information and scripts for installation
examples: the example data (the tutorial )
main: the scripts for user interface and the calling methods of core workflows
snps: the scripts for generating SNPs table
denovo: the scripts for computing de novo assemblies and the gene presence/absence table
expr: the scripts for calculating expression levels
phylo: the scripts for phylogenetic tree inference
difexpr: the methods for identifying differentially expressed genes with the expression levels matrix
cont_anc_rcn: ancestral reconstruction for continuous data such as expression levels

Available functions

detect single nucleotide variants
create de novo assemblies
compute gene presence/absence and the indels
count gene expression levels
infer the phylogenetic tree
find differentially expressed genes (additional data that won't be used by Geno2Pheno)
reconstruct ancestral values of expression level (additional data that won't be used by Geno2Pheno)

Download Seq2Geno and ensure the environment

Check the prerequisites
- conda (tested version: 4.10.0)
- file .condarc that includes these channels and is detectable by your conda
  - hzi-bifo
  - conda-forge/label/broken
  - bioconda
  - conda-forge
  - defaults
- python (tested verson: 3.7)
- Linux (tested version: Debian GNU/Linux 8.8 jessie)
- git (tested version: 2.21)
Download this package

git clone --recurse-submodules https://github.com/hzi-bifo/seq2geno.git
cd seq2geno
git submodule update --init --recursive

The command uses --recurse-submodules to download the submodules. The flag is available only in git version >2.13. Earlier git versions might have the substitute. After the package is downloaded, main/seq2geno and main/seq2geno_gui are the executable scripts for Seq2Geno.

To ensure the environment and test the package, please either follow the steps in install/README.md or use the automatic tools:

cd install/
SETENV.sh snakemake_env
conda activate snakemake_env
TESTING.sh

Usage and Input

Once the environment is properly set up, Seq2Geno can be launched using the graphical user interface (GUI) or command line

The commands

S2G

seq2geno_gui

will launch the graphic user interface. Use the tool to read, edit, or save the arguments in a yaml file. Once the arguments are ready, the analyses can be launched with this interface; for large-scale researches, however, generating the yaml file and launching the analyses with the command line method (described below) might be more convenient, as having processes running in the background should be more convenient. To learn more, please read the manual doc/GUI_manual.pdf.

command line

The input for seq2geno is a single yaml file describing all arguments:

S2G -d -f [options_yaml] -z [zip_input] -l [log_file] --outzip [output_zip_type]

Both options_yaml and zip_input specify the materials to use. At least one of them should be used. When options_yaml is properly set, zip_input will be neglected. The options_yaml describes all the options and paths to input data for Seq2Geno. The zip_input packs all the materials and has a structure that Seq2Geno can recognize (see input_zip_structure.md for more details).

The log_file should be a non-existing filename to store the log information; if not set, the messages will be directed to stdout and stderr.

The output_zip_type should be one of 'none' (default), 'all', 'main', or 'g2p'. The choice specifies whether or how the output results should be packed into an zip file.

arguments

The input file is an yaml file where all options are described. The file includes two parts:

features:

option	action	values ([default])
dryrun	display the processes and exit	[Y]/N
snps	SNPs calling	Y/[N]
denovo	creating de novo assemblies	Y/[N]
expr	counting expression levels	Y/[N]
phylo	inferring the phylogeny	Y/[N]
de	differential expression	Y/[N]
ar	ancestral reconstruction of expression levels	Y/[N]

To only create the folder and config files, please turn off the last six options.

general (* mandatory):
- cores: number of cpus (integer; automatically adjusted if larger than the available cpu number)
- mem_mb: memory size to use (integer in mb; automatically adjusted if larger than the free memory). Note: some processes may crush because of insufficiently allocated memory
- *wd: the working directory. The intermediate and final files will be stored under this folder. The final outcomes will be symlinked to the sub-directory RESULTS/.
- *dna_reads: the list of DNA-seq data
It should be a two-column list, where the first column includes all samples and the second column lists the paired-end reads files. The two reads file are separated by a comma. The first line is the first sample.
```
sample01    /paired/end/reads/sample01_1.fastq.gz,/paired/end/reads/sample01_2.fastq.gz
sample02    /paired/end/reads/sample02_1.fastq.gz,/paired/end/reads/sample02_2.fastq.gz
sample03    /paired/end/reads/sample03_1.fastq.gz,/paired/end/reads/sample03_2.fastq.gz
```
- *ref_fa, ref_gff, ref_gbk: the data of reference genome
The fasta, gff, and genbank files of a reference genome. They should have same sequence ids.
- old_config: if recognizable, the config files that were previously stored in the working directory will be reused. ('Y': on; 'N': off)
- rna_reads: the list of RNA-seq data. (string of filename)
It should be a two-column list, where the first column includes all samples and the second column lists the short reads files. The first line is the first sample.
```
sample01    /transcription/reads/sample01.rna.fastq.gz
sample02    /transcription/reads/sample02.rna.fastq.gz
sample03    /transcription/reads/sample03.rna.fastq.gz
```
- phe_table: the phenotype table (string of filename)
The table is tab-separated. For n samples with m phenotypes, the table is (n+1)-by-(m+1) as shown below. The first column should be sample names. The header line should includes names of phenotypes. Missing values are acceptable.
```
strains virulence
sample01    high
sample02    mediate
sample03    low
```
- adaptor: the adaptor file (string of filename)
The fasta file of adaptors of DNA-seq. It is used to process the DNA-seq reads.

Example data and usages

The folder examples/ includes a structured zip file and a yaml file--the two input formats that Seq2Geno can recognize. The zip file can be used as the input with this command:

S2G -z examples/example_input.zip\
 -l examples/example_input_zip.log\
 --outzip g2p

To use the configuration yaml file, please ensure unpacked example data (that is, the zip file) and edit the yaml file to ensure the right paths to those example data. After they are ready, please run with this command:

S2G -f examples/seq2geno_input.yml\
 -l exapmles/seq2geno_input_yml.log\
 --outzip g2p

Train the phenotypic predictor with the Seq2Geno results

Start from Seq2Geno

To include automatic submission to the Geno2Pheno server, just use the flag --to_gp:

S2G -f examples/seq2geno_input.yml\
 -l exapmles/seq2geno_input_yml.log\
 --outzip g2p --to_gp

With precomputed data

Please directly visit Geno2Pheno

FAQ

Why the analyses crushed?

Please check the log file or STDOUT and STDERR and determine the exact error.

Will every procedure be rerun if I want to add one sample?

No, you just need to add one more line in your reads list (i.e., the dna or the rna reads. See section arguments for more details.) and then run the same workflow again. Seq2Geno uses Snakemake to determine whether certain intermediate data need to be recomputed or not.

Will every procedure be rerun if I want to exclude one sample?

No; however, besides excluding that sample from the reads list, you will need to remove all the subsequent results that were previously computed. That could be risky.

Will every procedure be rerun if I accidentally delete some data?

No, only the deleted one and the subsequent data will be recomputed.

Where is the final data?

In the working directory, the main results are collected in the subfolder RESULTS/. You can also find the other intermediate data in the corresponding folders (e.g., mapping results)

What is the current status?

If the log file was specified, you could check the log file to determine the current status. Otherwise, the status should be directed to your STDOUT or STDERR.

License

GPLv3 (please refer to LICENSE)

Contact

Please contact Tzu-Hao Kuo (Tzu-Hao.Kuo@helmholtz-hzi.de) and specify:

the error message or unexpected situation
how to reproduce the problem

Citation

We will be publishing the paper for the joint work of Seq2Geno and Geno2Pheno. Before that, please use

Kuo, T.-H., Weimann, A., Bremges, A., & McHardy, A. C. (2021). Seq2Geno (v1.00001) [A reproducibility-aware, integrated package for microbial sequence analyses].

@software{thkuo2021seq2geno,
  author = {Tzu-Hao Kuo, Aaron Weimann, Andreas Bremges, Alice C. McHardy},
  title = {Seq2Geno: a reproducibility-aware, integrated package for microbial sequence analyses},
  version = {v1.00001},
  date = {2021-06-20},
}

thkuo / seq2geno

readme