nicolazzie / AffyPipe

an open-source pipeline for Affymetrix Axiom genotyping workflow on livestock species
13 stars 7 forks source link

AffyPipe: an open-source pipeline for Affymetrix Axiom genotyping workflow

ref: E.L. Nicolazzi (Fondazione Parco Tecnologico Padano) - Via Einstein, Loc. Cascina Codazza (26900) Lodi (Italy). email: ezequielluis [dot] nicolazzi [at] gmail [dot] com

IMPORTANT WARNING FOR AXIOM apt2 USERS

Please note that a new series of library files are being released in many species. Most of these files carry the extention "apt2.xml". Please note AffyPipe will not run with these files. I have tried to contact Affymetrix's DevNet several times now, but their support has not been helpful. At all. I will keep on trying to understand why on earth they keep changing their software, inputs and outputs, and how to make this new software work. Please be patient, as this issue is not due to AffyPipe but for a sudden (and hardly documented) change in Affymetrix software. Please know that the windows GUI software works with these library files, so I'm going to write something I never thought I would: 'If you have a windows computer at hand, please use it. It'll take you less time and mental energy to use the Windows GUI rather than trying to understand how to make the Linux/Mac versions work'.

I am truly sorry, but my hands are tied here.

Hope to get back to you with good news, but for the moment AffyPipe is in the garage.

Ezequiel L. Nicolazzi

What is AffyPipe?

The goal of this pipeline is to authomatize Affymetrix's standard and "best practice" genotyping workflows for Linux and Mac users: from Power tools (APTools) to SNPolisher R package. This is a one-step tool that combines all Affymetrix software and produces edited and user-friendly format output files. In fact, AffyPipe allows you to edit SNP probe classes directly while exporting genotypes in PLINK format (Purcel et al, 2007). It was originally built for the International Buffalo Genome Consortium (Iamartino, 2013), but now is able to handle all species (e.g. human, cow, chichen, fisheries). Users are strongly adviced to read carefully Affymetrix's "Axiom genotyping solution data analysis guide" and "Best practice supplement to Axiom genotyping solution data analysis user guide" before using this tool.

0) AffyPipe publication & how to cite

The AffyPipe publication can be found in: http://www.ncbi.nlm.nih.gov/pubmed/25028724

If you used this pipeline for your analysis, please cite: Nicolazzie EL, Iamartino D, Williams JL (2014). AffyPipe: an open-source pipeline for Affymetrix Axiom genotyping workflow. Bioinformatics, DOI: 10.1093/bioinformatics/btu486

Thanks in advance!

1) Getting the pipeline, and requirements

The fastest and more clever way of getting this pipeline and all accessory files is installing git and cloning this repository. Further information on how to install git on Linux and Mac can be found at: http://git-scm.com/book/en/Getting-Started-Installing-Git . An example of cloning command using command line is:

% git clone --recursive https://github.com/nicolazzie/AffyPipe.git

The AffyPipe pipeline is for users running Linux/Unix and Mac operative systems, and only runs over 64bit processors. Windows users should use Gentoyping Console (TM) Software, which already cover all of these functionalities!!! You should have Python (2.x) and R (any version?) already installed on your computer (Mac users have python already installed by default). The whole pipeline was thoroughly tested under Python 2.7.6 and R 3.0.

IMPORTANT: Since Cygwin uses a twisted way of building linux-like (?) paths, AffyPipe may not work properly. We strongly suggest using a virtual machine (e.g. VirtualBox) with ubuntu (or similar), instead of Cygwin. A tip: if you really want to use Cygwin (why would you?!?!?), please know that you should use relative paths for all the folders and files involved. Absolute paths will not work.

2) Folders and files required

The Affymetrix genotyping workflow requires several Affymetrix files to run. For simplicity, all these files are expected to be placed into one folder. The default folder names and values specified below are provided as example. However, please note these names and values are also default in AffyPipe (see "Options" paragraph in Section 3).

All Affymetrix files are downloadable at their website (http://www.affymetrix.com). Please remember that you need to register to be able to download all the files below! NOTE: If you cloned or downloaded all the folders in this repository, you'll see example names of the files you need for the Buffalo species. All files are empty: i)to avoid copyright issues with Affymetrix and; ii) to force you downloading the latest version of all the files and softwares.

Once you have finished, if you named folders as default , you should have:

3) Running AffyPipe.py

AffyPipe is very versatile, thanks to a number of options you can set up. Default behavior runs Affymetrix Standard workflow, but you can choose to perform "Best Practice" workflow (see "-b" option), that includes an extra PlateQC step (please see: "Best practice supplement to Axiom genotyping solution data analysis user guide" for further details). There are two compulsary information for Affypipe: 1) the name (and path) of the cel list file (e.g. the one you created with createcelfile.sh) and; 2) the parameter file (PARAM_species.inp). You can find a long explanation of AffyPipe options below in the "Option" section or a short and handy version by typing:

% python AffyPipe.py -h

or

% python AffyPipe.py --help

The general usage for the pipeline is:

% python AffyPipe.py [options] [cel-list-file]

For example, if:

You can run the pipeline like this:

% python AffyPipe.py /home/Affydata/mycellistfile.txt

NOTE: The first time you run AffyPipe, please note that you have to have administrator permissions to allow AffyPipe install SNPolisher package. If you do not want an authomatic installation (or simply don't feel like giving "sudo" permissions to a script coded by someone else's), please install SNPolisher prior to run AffyPipe with the following code (on your terminal, with admin permissions, write "R" and press enter, then write) :

% install.packages('[your path to file: SNPolisher[[version]].tar.gz]',repos=NULL,type='source')

This command will install the package on your R library, so it will automatically recognize it!

Options

The AffyPipe pipeline is very flexible and user-friendly. You can choose your own parameters, filenames and folders with very little effort. A bit more detailed info on each of the options available is provided, including default values. Please be aware that these are true options, thus are absolutely optional and you can place them in any order you like.

-h, --help This displays usage and options available.

-t [PATH] or --tooldir=[PATH] [DEFAULT: ./AFFYTOOLS] This option can be used to change the path and name of the folder where Axiom® Buffalo Analysis Files.r[X].zip (or its relative file for the bovine species) files were uncompressed (see section 2.a for further information).

-a [PATH] or --aptdir=[PATH] [DEFAULT: ./AFFYTOOLS/apt-[folder_name]] If you download the Affymetrix PowerTools (APTools) version 1.15.2 for your operative system and you uncompressed it in AFFYTOOLS directory, you can skip this option, since AffyPipe will recognize your system automatically. For newer versions, or if APTools folder is not in the default path, you can use this option change the path to the uncompressed APTools folder. See section 2.a.2.1 for further information.

-s [PATH] or --SNPolisher=[PATH] [DEFAULT: ./AFFYTOOLS/SNPolisher_package] This option can be used to change the path and name of the folder where SNPolisher_package.zip files were uncompressed (see section 2.a.2.2 for further information).

-o [PATH] or --outdir=[PATH] [DEFAULT: ./OUTPUT] This option can be used to choose path and name of the output folder where all output files will be written. If the folder does not exists, a new folder will be created with the given name.

-d VALUE 0>=1 or --dqc=VALUE 0>=1 [DEFAULT: 0.82] This option can be used to set a user defined Dish_QC threshold. Default here is Affymetrix's best practice default. See their "Axiom genotyping solution data analysis guide" and "Best practice supplement to Axiom genotyping solution data analysis user guide" for further details.

-c VALUE 0>=1 or --crate=VALUE 0>=1 [DEFAULT: 0.97] This option can be used to set a user defined call rate threshold. Default here is Affymetrix's best practice default. See their "Axiom genotyping solution data analysis guide" and "Best practice supplement to Axiom genotyping solution data analysis user guide" for further details.

-y or --summary [DEFAULT: no summaries files] This option allows to output the summary information of the genotyping process. Please note that these files are VERY large. Since the general user does not usually uses this file, the default is not printing this out. However, there are several occasions where the analysis of this file could be useful, thus an option to output this file was included.

-b or --bestpractice [DEFAULT: STANDARD workflow] This option enhances the Best Practice workflow, adding an extra step between the two apt-genotype steps, for plate QC. Please be sure of reading "Best practice supplement to Axiom genotyping solution data analysis user guide" before choosing this option. If this option is chosen, please note that PLATE INFORMATION FILE IS REQUIRED. When running the "Best Practice" workflow, AffyPipe will require a file linking samples to plates. See the required (in this case) -f or --platefile option for more information.

-f or --platefile [DEFAULT: NONE] This option is required if -b option is present. This file has to contain 2 columns, comma or tab separated (or a combination of two if you want to be extra-triky!): first field has to be the name of the sample (e.g. exaclty as it is specified in the CEL list file, with or without the path, with or without the ".CEL" specification) and the second must be the plate ID. Please note that you can simply copy the CEL list file and add a field naming plates. You can name plates any way you want, just be aware that names are case-sensitive, thus PLATE and plate are considered different plates!! For example, the following are all acceptable specifications for "animalnumber1":

-l or --plateqc [DEFAULT: 0.95,0.99] This option is considered only if -b option is present. It allows to change Plate QC thresholds for PlatePassRate and AverageCallRate, respectively. Note that these values MUST BE comma separated, and both must be provided.

-p or --plink [DEFAULT: no plink output] This option outputs (all) BestProbeset SNPs in PLINK format, coding alleles as A B. The pipeline just goes through the Ps.performace.txt (output) file keeping genotypes of all probes classified as "1" in the "BestProbeset" field. Map file is created using SNP names (please read "Axiom genotyping solution data analysis guide" for further information).

--plinkACGT [DEFAULT: no plink output] This option is an alternative to -p or --plink option, with the only difference that it codes alleles in ACGT instead of AB. This option was suggested (and code provided) by GitHub user Hyunmin (@hmkim). Thank you!

-e or --editplink [DEFAULT: PMN ] This option allows the user to edit the SNP probe classes. Affymetrix SNPolisher R package currently classifies SNP probes into 6 classes: "PolyHighResolution" (P), "MonoHighResolution" (M), "NoMinorHomozygote" (N), "OTV" (O), "CallRateBelowThreshold" (C) and, "Other" (T). Any of the 6 SNP probe classes added after this option will be retained. If both probes of the same SNP carry the retained class(es), then only the one classified as "BestProbeset" will be retained. The default option retains all SNP probesets that are classified as PolyHighResolution, MonoHighResolution and NoMinorHomozygote.

--debug [DEFAUL: OFF ] This option prints a full report on each step of the APT process. This option is useful if the program stops or gives any error message. Please run the program with this flag before reporting anything to the author, to help him identify the problem as soon as possible!

-q or --quiet [DEFAULT: loud (it's an italian software! :) )] This option avoids showing runtime messages to stdout.

Examples

The following are just illustrative examples of commands to run the AffyPipe for typical situations. Please note that name files and paths are arbitrary (e.g. you should provide your own names/paths)

1)Run a standard workflow, using default QC values and get genotypes on Affymetrix's standard format

% python AffyPipe.py mycellistfile.txt

2)Run a standard workflow, use own QC values and get genotypes in PLINK format (default probe QC extraction).

% python AffyPipe.py mycellistfile.txt -d 0.90 -c 0.99 -p

3)Run a "best practice" workflow, use own QC values (default plate setting) and get best probes for "PolyHighRes" and "MonoHighRes" classes in PLINK format, coding alleles as AB (use --plinkACGT to code alleles in ACGT format).

% python AffyPipe.py mycellistfile.txt -d 0.90 -c 0.99 -b -l 0.99,0.99 --plink -e PM

Output files and folders

Unless differently specified by the user, all output files will be written in a directory named OUTPUT, placed in the same directory where AffyPipe is run. A number of files will be present in the OUTPUT folder, and most of them will be gzipped:

4) Different species

The AffyPipe is intended for all species gentoyped with the Axiom technology, although it was originally built for the specific needs of the International Buffalo Genome Consortium (Iamartino et al.,2013). Please note that testing has been carried out only on Buffalo + Human Exome 319 and EUR Axiom datasets (GEO platforms: GPL18760 and GPL52691). Just by setting up the parameter file, you should be successful in using this tool on any other non-tested species. In case of problems, please contact the author of this pipeline at: ezequielluis [dot] nicolazzi [at] gmail [dot] com, and he'll be very happy to help you (and integrate the necessary changes in this tool!).

5) References

6) Acknowledgments

This work was supported by the Italian Ministry of Education, University and Research, project GenHome [D.M. 505/Ric]; and the European Union's Seventh Framework Programme, project Gene2Farm [G.A. 289592]. I personally thank Hernan Morales Durand (IGEVET, Argentina) and GitHub user Hyunmin (@hmkim) for suggestions (and code) provided to improve this tool.

Disclaimer

AffyPipe is a free tool that uses proprietary software that is publicly available online: you can redistribute this pipeline and/or modify this program, but at your own risk. AffyPipe is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details: http://www.gnu.org/licenses/. This pipeline is for research and has not a commercial intent, but it can be used freely by any organization. The only goal is to help people streamline their work. Affymetrix is not responsible of any aspect regarding this pipeline. The author of this pipeline is not responsible for ANY output, modification or result obtained from it. For bug report, feedback and questions (PLEASE read the carefully this README file before sending your question) contact ezequielluis [dot] nicolazzi [at] gmail [dot] com.