ggCaller_manuscript

Repository of scripts used in ggCaller manuscript.

Generating a simulated pangenome

Data is available in data/simulation_pangenome.

Population Simulation

To generate a simulated population for pangenome analysis, run:

python scripts/simulate_full_pangenome.py --gff data/simulated_pangenome/SP_ATCC700669.gff3 --nisolates 100 --n_sim_genes 1000 --pop_size 10e6 --out sim_pangenome --mutation_rate 1e-14 --gain_rate 1e-12 --loss_rate 1e-12

This will generate a list of fasta files in the directory sim_pangenome.

Adding fragmentation

To fragment assemblies, first copy sim_pangenome.

cp sim_pangenome sim_pangenome_fragmented

Then choose a directory --ref_dir to get empirical read length distributions from (should contain .fasta files). Finally, run:

python scripts/fragment_fasta.py --ref_dir reference_dir --sample_dir sim_pangenome_fragmented --min_length 10

Adding contaminants

To fragment assemblies, first copy sim_pangenome.

cp sim_pangenome sim_pangenome_contaminated

Then choose a infile --infile to generate fragments from, and --outfile to append to (this is done IN PLACE). Finally, run:

python scripts/insert_random_genome_fragments.py --infile contaminant.fasta --outfile sim_pangenome_fragmented/genome1.fasta --frag_size 10000

Simulating assemblies

Now simulate paired-end reads from your assemblies with ART:

art_illumina -ss HS25 -sam -i sim_pangenome/genome1.fasta -p -l 150 -f 20 -m 200 -s 10 -o genome1_reads

Assemble reads with SPADES:

spades.py --phred-offset 33 --isolate --threads 1 -1 genome1_reads1.fq -2 genome1_reads2.fq -o genome1_assembly

Gene identification and pangenome analysis

Now run Prokka on SPADES assemblies, using the gene annotation from simulate_full_pangenome.py:

prokka --cpus 8 --proteins sim_pangenome_prokka_DB.fa --force --outdir prokka_genome1 --notrna --norrna --prefix genome1 genome1_assembly/scaffolds.fasta

Run Panaroo on Prokka gene-calls (using moderate clean-mode):

panaroo -i prokka_gffs/*.gff -o panaroo_moderate_out --clean-mode moderate -t 8

Run Roary on Prokka gene-calls:

roary -p 8 -f roary prokka_gffs/*.gff

Run PEPPAN on Prokka gene-calls (use PEPPAN_parser to generate gene presence/absence matrix):

PEPPAN -t 8 -p PEPPAN_out prokka_gffs/*.gff
PEPPAN_parser -g PEPPAN_out.PEPPAN.gff -s PEPPAN_out

Run ggCaller on all assemblies using the gene annotation from simulate_full_pangenome.py:

ggcaller --refs sim_pangenome_list.txt --threads 8 --annotation sensitive --diamonddb sim_pangenome_prokka_DB.fa --out ggCaller_out --clean-mode moderate

Analysing simulated pangenome results

Copy gene presence/absence matrices from respective tools (gene_presence_absence_roary.csv for Roary, Panaroo and ggCaller, *.PEPPAN.gene_content.Rtab for PEPPAN). Also copy Prokka gff files for each simulation.

Use scripts/compare_simulated_gene_pa.Rmd to generate prokka mapping files from gffs, and to compare different pangenome analysis tools. Data is available in data/simulated_pangenome.

This will summary tables detailing false positives, false negatives and correctly-called COGs containing set numbers of errors.

Comparing real bacterial pangenomes

Gene presence/absence matrices for M. tuberculosis, S. pneumoniae and E. coli used in the ggCaller paper are available in data/real_pangenome

For a chosen dataset, run Prokka, Roary, PEPPAN, Panaroo and ggcaller using the parameters in Gene identification and pangenome analysis.

Respective gene annotations were provided from:

Gene presence/absence matrices for all tools are available in data/real_pangenome/gene_pa

For analysing COG consistency to determine COG size and coefficient of variation (CV), two scripts are required.

When analysing ggCaller or Panaroo, run the command below with the respective .gml file. Analysis in the manuscript ignored refound/pseudogenes.

python parse_gml.py --infile data/COG_consistency/Spneumoniae_panaroo_str.gml --ignore_refound --outpref output

To analyse PEPPAN, run:

python parse_gff.py --peppan data/COG_consistency/Pneumo.PEPPAN.gff --outpref output --ignore_pseudogenes

Roary does not provide provide gene lengths in output files, so generate gene lengths from gff files first:

python get_gene_lengths.py --indir <path to gff directory> --outpref length_dict --threads 8

Then you can run:

python parse_gff.py --roary data/COG_consistency/Spneumoniae_roary.txt --len_dict length_dict.json --outpref output --ignore_pseudogenes

These commands output a number of statistics:

"IQR" interquartile range of genes lengths within COGs
"MAD" median average deviation of genes lengths within COGs
"STDDEV" standard deviation and coefficient of variation (CV) of genes lengths within COGs
"sizes" number of genes in each COG

Results used in the manuscript are available in data/real_pangenome/COG_consistency/results

Contig break analysis

Fragment your chosen sequence:

python fragment_at_gene.py --CDS data/contig_break/fasta/CDS/CR931662_Streptococcus_pneumoniae_strain_34359_serotype_14_CDS.fa --infile data/contig_break/fasta/CDS/original/CR931662_Streptococcus_pneumoniae_strain_34359_serotype_14.fa

Call genes using GeneMarkS-2, or using Prokka. Then use ggCaller or Panaroo as detailed Gene identification and pangenome analysis.

Analyse gene recall using combined_DNA_CDS.fasta from Panaroo and gene_calls.fasta from ggCaller.

To specificy calls from Panaroo, use --caller pan, for ggCaller use --caller ggc

python scripts/gene_recall.py --caller ggc --query data/contig_break/ggc/ggc_group3_fragmented/gene_calls.ffn --seq data/contig_break/ggc/fasta/all_seqs.fasta --CDS data/contig_break/ggc/fasta/all_CDS.fasta --exact --outpref output

This will print precision and recall statistics to the console, and generate len_prop.txt which describes the proportions of real genes covered by the respective ORF call.

Additionally, FASTA and summary files will be generated, containing all errors. These will start with you chosen prefix, and end with:

"FN" for false negatives
"FP" for false positives
"DUP" for duplicated calls that align to the same position in the sequence
"ART" for artificial calls not present in any reference

To compare calls from multiple tools, run the below command. This will identify shared and exclusive errors for each tool.

python scripts/gene_recall.py --seq data/contig_break/ggc/fasta/all_seqs.fasta --CDS data/contig_break/ggc/fasta/all_CDS.fasta --exact --caller pan --query data/contig_break/prokka/panaroo_group3_fragmented/combined_DNA_CDS.fasta --caller2 ggc --query2 data/contig_break/ggc/ggc_group3_fragmented/gene_calls.ffn --outpref output

The output summary files match those from analysing single tool outputs.

All results generated from prediction comparisons used in the manuscript are available in data/contig_break/results

Gene end comparison

Fasta, alignment and summary files from previous analysis are available in data/gene_end_comparison.

For a chosen dataset, run Prokka, Panaroo and ggCaller using the parameters in Gene identification and pangenome analysis.

For panaroo, generate a prokka mapping file out a given gene whilst in a directory containing all prokka gffs:

grep "<target>" *.gff > prokkamap.txt

Pull out all genes of interest from ggCaller, reference (example here) or panaroo output directories:

python scripts/parse_genes.py --fastafile ggCaller_out/gene_calls.faa --targets CLS02009,CLS02800,CLS00182 --outpref aggc_pspA
python scripts/parse_genes.py --fastafile reference/proteins.faa --targets CLS02009,CLS02800,CLS00182 --outpref ref_pspA
python scripts/parse_genes.py --prokkamap prokkamap.txt --panaroodata panaroo_out/gene_data.csv --outpref panaroo_pspA

Append a reference protein sequence to the start of each file. Run mafft on files to generate MSA for all genes and tools

cat pspA.faa ggc_pspA.faa > ggc_pspA_cat.faa
mafft ggc_pspA_cat.faa > GGC_pbp1a.aln

Analyse alignments of gene ends. All .aln and .faa/.fasta files should be in the same directory

python scripts/gene_end_comparison.py --indir all_alignments --outpref results

This will generate a series of summary graphs describing gene start position comparisons and percent identity based on MSAs.

PanGenome wide association study (PGWAS)

Generate a presence/absence matrix for each isolate in the study. Examples are available in data/PGWAS/tetracycline/tetracycline_resistance.txt and data/PGWAS/erythromycin/erythromycin_resistance.txt

For this study, S. pneumoniae gene annotations were supplied from here.

Generate a .nwk tree using ggCaller (use strict cut-offs for graph cleaning and tree generation) and root at midpoint.

ggcaller --refs genome_list.txt --out pyseer_results --clean-mode strict --ignore-pseduogenes --alignment core --aligner def --annotation sensitive --diamonddb CDS_protein_sequences.dmnd --save --core-threshold 1
python scripts/midpoint_root.py --infile pyseer_results/core_tree.nwk --outfile core_tree_NJ_midpoint.nwk

Convert the nwk tree to distance matrix using pyseer

python pyseer/phylogeny_distance.py --lmm core_tree_NJ_midpoint.nwk > phylogeny_K.tsv

Generate unitigs using unitig-caller

unitig-caller --call --refs refs.txt --out PGWAS_unitigs

Calculate unitig associations with phenotype and determine adjusted signficance cut-off threshold.

pyseer --lmm --phenotypes erythromycin_resistance.txt --kmers PGWAS_unitigs.pyseer.gz --similarity phylogeny_K.tsv --output-patterns unitigs_patterns.txt --cpu 8 > unitigs_hits.txt
python pyseer/count_patterns.py unitigs_patterns.txt
cat <(head -1 tet_unitigs.txt) <(awk '$4<THRESHOLD {print $0}' unitigs_hits.txt) > significant_unitigs.txt

Map unitigs to a single reference with phandango and annotate

phandango_mapper significant_kmers.txt ref.fa phandango.plot
annotate_hits_pyseer significant_unitigs.txt references.txt annotated_unitigs.txt

Generate query in correct format, query singificant hits in ggCaller graph

awk -F '\t' 'NR>1 && NF=1' significant_unitigs.txt > significant_unitigs_query.txt
ggcaller --graph genome_list.gfa --colours genome_list.bfg_colors --threads 16 --out pyseer_query --data pyseer_results/ggc_data --query significant_unitigs_query.txt --query-id 1.0

Analyse hits. Annotation files should match that used here

python scripts/count_annotations.py --fasta pyseer_query/matched_queries.fasta --outpref annotations --annotations annotations_file.xlsx

This will generate a series of graphs describing the gene annotations covered by significant unitigs.

All singificant unitigs and annotated unitigs by pyseer are available in respective folders in data/PGWAS

Computational benchmarking

Data is available in data/computational_benchmarking.

To generate random samples, use the sample_genome_list.py script:

python sample_genome_list.py --infile input.txt --outpref out_prefix --genome_outdir save/genome/files/here --sample_sizes 10,50,100,500

Assemblies containing Ns are automatically removed, as they lead to fragmented DBGs and lead to poor clustering with ggCaller.

Tools were run as detailed in Gene identification and pangenome analysis.

Gene call consistency files were generated using parse_gml.py and parse_gff.py as described above.

samhorsfield96 / ggCaller_manuscript

readme