Add functionality to batch gene alignments

revinici commented 2 months ago

The gene alignment step can take a long time to complete. Is it possible to add functionality to output the unaligned gene sequences to later align batches of them across a computing cluster? The pirate pangenome tool, e.g., has a helper script that will align gene families listed in one of its output files, making it easy to subset the file and distribute the gene alignment step to thousands of computers.

revinici commented 2 months ago

I was able to hack a solution together by breaking apart the logical steps in the post_run_alignment_gen.py script into separate python scripts. The unaligned genes can be output to a folder and batched for alignment via a computing cluster. The gene alignments can then be aggregated into one folder to concatenate the core gene alignments.

For my purposes I broke apart the functionality for pan genome alignment linked below, ignoring the functionality for codon alignments. Thankfully, my solution does not require any edits to the panaroo code base! https://github.com/gtonkinhill/panaroo/blob/e928a7acc8340389201545052d23e0495484126a/panaroo/post_run_alignment_gen.py#L107-L116 https://github.com/gtonkinhill/panaroo/blob/e928a7acc8340389201545052d23e0495484126a/panaroo/generate_output.py#L318-L333

Here is how I got access to the panaroo modules and methods for extracting the gene fastas to a folder. Note, you have to be careful about the assumptions the panaroo code makes like expecting '/' at the end of paths or deleting gene fasta files after alignment.

#!/usr/bin/env python
import networkx as nx
from joblib import Parallel, delayed
from tqdm import tqdm
import shutil

# get the path to the panaroo package files
panaroo_path = subprocess.run('pip show panaroo | grep Location | cut -f2 -d" "',
                              shell=True, check=True, capture_output=True, text=True).stdout
panaroo_path = os.path.join(panaroo_path, 'panaroo')

# add path to python path
sys.path.append(panaroo_path)

# import required panaroo modules
from panaroo.generate_alignments import output_sequence

# load the final pangenome graph
G = nx.read_gml("/path/to/final/graph_gml")

# Load isolate names
isolate_names = G.graph['isolateNames']

# make dummy dir
os.makedirs('panaroo_foo')

# create expected folder
os.makedirs(os.path.join('panaroo_foo', 'aligned_gene_sequences'), exist_ok=True)

# copy combined_DNA_CDS to dummy dir
subprocess.run(f'cp /path/to/combined_DNA_CDS.fasta panaroo_foo/', shell=True, check=True)

#Multithread writing gene sequences to disk (temp directory) so aligners can find them
#Note, if you have lots of genes to extract you can also batch the nodes in the graph for multiple computers 
os.makedirs('extracted_genes')
num_cpu = 24
unaligned_sequence_files = Parallel(n_jobs=num_cpu)(
    delayed(output_sequence)(G.nodes[x], isolate_names, 'extracted_genes/', 'panaroo_foo/')
    for x in tqdm(list(G.nodes()))
)

#remove single sequence files
unaligned_sequence_files = list(filter(None, unaligned_sequence_files))

# clean up
shutil.rmtree('panaroo_foo')

# create batches of extracted gene fastas for downstream alignment

gtonkinhill commented 2 months ago

Hi,

Thanks for posting this solution! I'll look at whether we can add this into Panaroo directly when I next get a chance.

revinici commented 1 month ago

Okay, I got a much simpler solution. Creating the pangenome sequence alignments and sequence files can take a long time to create, but you can 'scale out' the task to multiple computers by batching the pangenome graph nodes and creating the pangenome alignments for each batch in parallel. My earlier solution just produced the unaligned gene sequences but the new simpler approach produces the all pangenome alignment output files in one go.

Assuming you have the pangenome graph nodes batched, then you can you can generate pangenome alignments for each batch with the code below:

#!/usr/bin/env python3.9
import os
from subprocess import run
import sys

import networkx as nx

# get the path to the panaroo package files
panaroo_path = run('pip show panaroo | grep Location | cut -f2 -d" "',
                              shell=True, check=True, capture_output=True, text=True).stdout
panaroo_path = os.path.join(panaroo_path, 'panaroo')

# add path to python path
sys.path.append(panaroo_path)

# import required panaroo modules
from panaroo.generate_output import generate_pan_genome_alignment

# set threads
threads = 10

# load the final pangenome graph
G = nx.read_gml("/path/to/final/graph_gml")

# load node batch
node_batch = "/path/to/node_batch_1.txt"
with open(node_batch, 'r') as IN:
    selected_nodes = IN.read().splitlines()

# generate a pangenome subgraph conditioned on the nodes in the batch
selected_nodes = [n for n in selected_nodes if n in G]
subgraph = G.subgraph(selected_nodes).copy()

# Load isolate names
isolate_names = G.graph['isolateNames']

temp_dir = 'temp_panalign'
os.makedirs(temp_dir, exist_ok=True)

pangenome_algn_output = f'pan_algn_{os.path.basename(node_batch)}'
os.makedirs(pangenome_algn_output, exist_ok=True)

# decompress the combined DNA and protein fastas
for archive, dest_file in [("combined_DNA_CDS.fasta.gz", "combined_DNA_CDS.fasta"),
                           ("combined_protein_CDS.fasta.gz", "combined_protein_CDS.fasta")]:

    run(f'pigz -d -c {archive} > {pangenome_algn_output}/{dest_file}', shell=True, check=True)

# generate gene and protein MSAs, and gene family multi-FASTAs for the gene nodes in the subgraph
do_codon_alignments = True
generate_pan_genome_alignment(G=subgraph,
                              temp_dir=os.path.join(temp_dir, ""),
                              output_dir=os.path.join(pangenome_algn_output, ""),
                              threads=threads,
                              aligner='mafft',
                              codons=do_codon_alignments,
                              isolates=isolate_names)

# clean up files
run(f'rm {pangenome_algn_output}/combined_protein_CDS.fasta', shell=True, check=True)
run(f'rm {pangenome_algn_output}/combined_DNA_CDS.fasta', shell=True, check=True)
run(f'rm -r {temp_dir}', shell=True, check=True)

# compress pangenome sequences
run(f'tar -cf - {pangenome_algn_output} | pigz -p {threads} > {pangenome_algn_output}.tar.gz', shell=True, check=True)
run(f'rm -r {pangenome_algn_output}', shell=True, check=True)

Then once all batches are done processing you can merge the results like this:

# uncompress the pangenome alignments
mkdir pangenome_alignments

for result_archive in pan_algn_*.tar.gz; do
    tar --use-compress-program 'pigz' -xf $result_archive --strip-components 1 -C pangenome_alignments
done

# make gene alignment archive
mv pangenome_alignments/aligned_gene_sequences gene_alignments
tar -cf - gene_alignments | pigz -p 10 > gene_alignments.tar.gz

# make protein alignment archive
mv pangenome_alignments/aligned_protein_sequences protein_alignments
tar -cf - protein_alignments | pigz -p 10 > protein_alignments.tar.gz

# make gene family sequences archive
mv pangenome_alignments/unaligned_dna_sequences gene_sequences
tar -cf - gene_sequences | pigz -p 10 > gene_sequences.tar.gz

# clean up
rm -r pangenome_alignments gene_alignments protein_alignments gene_sequences

gtonkinhill commented 3 weeks ago

Thanks very much for this. I will update the docs to point to this as a 'how to'. In terms of the main Panaroo code, we've now added the option to specify 'none' as the aligner which will just generate the unaligned gene files.

gtonkinhill / panaroo

Add functionality to batch gene alignments #306