krasileva-group / plant_rgenes

12 stars 9 forks source link

plant_rgenes

Set of scripts to annotate Pfam domains and extract NLR plant immune receptors and their architectures as published in Sarris et al BMC Biology 2016: https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-016-0228-7

Our basic pipeline

0) Obtain protein sequences of species of interest and organise them into a directory.

We follow the Phytozome organisation of master_dir/species/annotation/species_version_proteins.fa where each species is denoted by the first letter of the genus name and all letters in the species names, for example Athaliana

1) Pfam-based annotation of domains

usage: bash run_pfam_scan.sh dir

Dependencies:

2) Parsing the pfamscan output with K-parse_Pfam_domains_v3.1.pl

usage: perl K-parse_Pfam_domains_v3.1.pl <options>

-p|--pfam <pfamscan.out>

-e|--evalue <evalue cutoff>

-o|--output

-v|--verbose <T/F> default F. Display more information about each domain (start, stop, evalue)

We usually parse all pfam outputs of interest in parallel using xargs

3) Identification of non-canonical NLR-ID domain combinations with K-parse_Pfam_domains_NLR-fusions-v2.2.pl

usage: perl K-parse_Pfam_domains_NLR-fusions-v2.2.pl <options>

-i|--indir directory for batch retrieval of input *pfamscan*.parsed.verbose files

-e|--evalue evalue cutoff for determining domain fusions [default 1e-3]

-o|--output output directory

-d|--db_description description of datasets used in the analyses [Organism Species_ID NCBI_taxon_ID Family Database Date_aquired Restrictions Version Common_Name Source Reference] for example of this dataset see Additional file 1 in Sarris et al BMC Biology 2016

Outputs:

Example datasets:

The example dataset directory contains input Arabidopsis data as well as corresponding db_description file. It also contains the outputs from each stage of the analyses, so you can check your pipeline against them or test individual scripts.