martienssenlab / maize-code

BSD 3-Clause "New" or "Revised" License
6 stars 0 forks source link

maize-code

Manuscript

For the manuscript "MaizeCODE reveals bi-directionally expressed enhancers that harbor molecular signatures of maize domestication", the pre-release MaizeCODE v0.1.0-manuscript contains the code used to analyze the data and generate the figures, complemented with the script "MaizeCode_extra_manuscript_figures.sh" in the "scripts/manuscript" folder.

MaizeCode Pipeline Help

Step-by-Step pipeline

1) Clone the git repository anywhere you want, for example in a new folder called projects\ git clone https://github.com/martienssenlab/maize-code.git ./maize-code\ or to clone a specific branch 'devel'\ git clone --branch devel https://github.com/martienssenlab/maize-code.git ./maize-code\ You will be prompted to input your GitHub username and password.\ If you only want to update the scripts, use git pull. If you want to update from a specific branch 'devel' git pull origin devel. 2) cd into the maize-code folder that has been created, so following the same example\ cd ./maize-code/ 3) Check that the following required packages are installed and in your $PATH (the versions noted here are working for sure, no guarantees for different versions). Recommended installation using conda, except kentUtils that need to be installed form source

For all types of data:
bedtools 2.29.2
bowtie2 64-bit 2.4.1; Compiler: gcc version 7.5.0 (crosstool-NG 1.24.0.131_87df0e6_dirty)
cutadapt 2.10
deeptools 3.5.0
fastqc 0.11.9
homer 4.11
IDR 2.0.4.2
kentUtils (bedSort, bedGraphToBigWig)
macs2 2.2.7.1
meme 5.3.0
multiqc 1.11
sra-tools 2.11.0 (if downloading data from SRA)
pigz 2.3.4
samtools 1.10 (Using htslib 1.10.2)
seqkit 0.13.2
shortstack 3.8.5
STAR 2.7.5c
wget 1.20.1
R 4.0.3 + R packages: dplyr 1.0.6, tidyr 1.1.3, ggplot2 3.3.5, cowplot 1.1.1, RColorBrewer 1.1-2, AnnotationForge 1.32.0, rrvgo 1.5.3, topGO 2.42.0, purrr 0.3.4, limma 3.46.0, edgeR 3.32.1, stringr 1.4.0, ComplexUpset 1.2.1, wesanderson 0.3.6

Histone ChIPseq:
R 4.0.3 + R libraries: dplyr 1.0.6, tidyr 1.1.3, ggplot2 3.3.5, cowplot 1.1.1, RColorBrewer 1.1-2, purrr 0.3.4, ComplexUpset 1.2.1
multiqc 1.11
pigz 2.3.4
samtools 1.10 (Using htslib 1.10.2)
bedtools 2.29.2
bowtie2 64-bit 2.4.1; Compiler: gcc version 7.5.0 (crosstool-NG 1.24.0.131_87df0e6_dirty)
sra-tools 2.11.0 (if downloading data from SRA)
fastqc 0.11.9
cutadapt 2.10
macs2 2.2.7.1
IDR 2.0.4.2
deeptools 3.5.0

RNAseq samples:
R 4.0.3 + R libraries: dplyr 1.0.6, tidyr 1.1.3, ggplot2 3.3.5, cowplot 1.1.1, RColorBrewer 1.1-2, AnnotationForge 1.32.0, rrvgo 1.5.3, topGO 2.42.0, purrr 0.3.4, limma 3.46.0, edgeR 3.32.1, stringr 1.4.0
multiqc 1.11
pigz 2.3.4
samtools 1.10 (Using htslib 1.10.2)
bedtools 2.29.2
STAR 2.7.5c
sra-tools 2.11.0 (if downloading data from SRA)
fastqc 0.11.9
cutadapt 2.10
kentUtils (bedSort, bedGraphToBigWig)
deeptools 3.5.0

RAMPAGE samples:
R 4.0.3 + R libraries: dplyr 1.0.6, tidyr 1.1.3, ggplot2 3.3.5, cowplot 1.1.1, RColorBrewer 1.1-2
multiqc 1.11
pigz 2.3.4
samtools 1.10 (Using htslib 1.10.2)
bedtools 2.29.2
STAR 2.7.5c
sra-tools 2.11.0 (if downloading data from SRA)
fastqc 0.11.9
cutadapt 2.10
kentUtils (bedSort, bedGraphToBigWig)
macs2 2.2.7.1
IDR 2.0.4.2
deeptools 3.5.0

TF ChIPseq samples:
R 4.0.3 + R libraries: dplyr 1.0.6, tidyr 1.1.3, ggplot2 3.3.5, cowplot 1.1.1, RColorBrewer 1.1-2, purrr 0.3.4, ComplexUpset 1.2.1, stringr 1.4.0
multiqc 1.11
pigz 2.3.4
samtools 1.10 (Using htslib 1.10.2)
bedtools 2.29.2
bowtie2 64-bit 2.4.1; Compiler: gcc version 7.5.0 (crosstool-NG 1.24.0.131_87df0e6_dirty)
sra-tools 2.11.0 (if downloading data from SRA)
fastqc 0.11.9
cutadapt 2.10
macs2 2.2.7.1
IDR 2.0.4.2
deeptools 3.5.0
meme 5.3.0
homer 4.11
wget 1.20.1

shRNA samples:
R 4.0.3 + R libraries: dplyr 1.0.6, tidyr 1.1.3, ggplot2 3.3.5, wesanderson 0.3.6
multiqc 1.11
pigz 2.3.4
samtools 1.10 (Using htslib 1.10.2)
bedtools 2.29.2
bowtie2 64-bit 2.4.1; Compiler: gcc version 7.5.0 (crosstool-NG 1.24.0.131_87df0e6_dirty)
sra-tools 2.11.0 (if downloading data from SRA)
fastqc 0.11.9
cutadapt 2.10
shortstack 3.8.5
deeptools 3.5.0
seqkit 0.13.2

4) Organize your reference genome directories so that they are all in the same main folder and that each contain ONE fasta file (.fa extension), ONE GFF file (.gff or .gff* extension) and ONE GTF (.gtf extension) file.\ For example, having a genomes/ folder that contains the genomes/B73_NAM/ directory where you can find genomes/B73_NAM/B73_NAM.fa, genomes/B73_NAM/B73_NAM.gff and genomes/B73_NAM/B73_NAM.gtf files\ Other references should be in the same genomes/ folder, following the same pattern, i.e. genomes/W22_v2/W22.fa, genomes/W22_v2/W22.gff and genomes/W22_V2/W22.gtf\ The GTF file can be created from a GFF file with cufflinks gffread -T <gff_file> -o <gtf_file> and check that 'transcript_id' and 'gene_id' look good in the 9th column.\ The GFF file should have 'gene' in the 3rd column.\ All files can be gzipped (.gz extension). 5) Make the samplefile you want, following the pattern below and examples below. A complete example of a samplefile is in the data folder (Example_samplefile.txt). For cleaner naming purposes, use "_samplefile.txt" as a suffix. The columns are the following:\

Type of data Line Tissue Type of sample Replicate ID Sequencing ID Path to fastq Paired-end or single-end data Genome reference
ChIP B73 roots H3K27ac Rep1 SRRxxxxxx SRA SE B73_NAM
ChIP B73 roots Input Rep1 SRRxxxxxx SRA SE B73_NAM
RNAseq W22 ears RNAseq Rep1 S01 /home/maize-code/RNAseq/fastqs PE W22_v2
RAMPAGE W22 ears RAMPAGE Rep1 rampage_exp1 /home/maize-code/RAMPAGE/fastqs PE W22_v2
TF_TB1 B73 leaf IP Rep1 SRRxxxxxx SRA PE B73_v4
TF_TB1 B73 leaf Input Rep1 SRRxxxxxx SRA PE B73_v4
shRNA NC350 cn shRNA Rep1 cn /home/maize-code/shRNA/fastqs SE NC350_NAM
mC B73 roots mC Rep1 SRRxxxxxxx SRA PE B73_NAM
mC B73 roots Pico Rep1 SRRxxxxxxx SRA PE B73_NAM

6) Submit the scripts/MaizeCode.sh script, giving as argument -f <samplefile.txt> the samplefile of your choice and -p <path> the path to your folder that contains the different genome directories, i.e. the genomes folder mentioned above:\ qsub scripts/MaizeCode.sh -f example_samplefile.txt -r /path/to/genomes 7) By default, it will proceed with the analysis. -s can be set so that it does not proceed with the analysis at all or -c can be set if only single sample analysis should be performed but no combined analysis per line or between lines. 8) If the analysis has not proceeded or if you want to analyze different samples together, make the new samplefile of your choice and submit the scripts/MaizeCode.sh script again.\ qsub scripts/MaizeCode.sh -f new_samplefile.txt -r /path/to/genomes\ The samples that have already been processed will not be repeated but will still be included in the analysis. 9) Have a look at the results! (see Output below).


Comments


Scripts description NOT FULLY UPDATED FROM THIS POINT ONWARD


Output

Directories: From the main folder <maizecode> where the MaizeCode.sh is run

Statistics:

Plots: (examples are in the github data folder)