High RAM consumption in plant genome

enriquepola1996 commented 3 months ago

Hello dear developers,

I am using Hapo-G for the first time to polish a 2.6Gb plant genome with Illumina, however, I am running out of RAM, the consumption is above 700GB of RAM and I do not have more of those, will there be a strategy that Can I minimize RAM consumption?

The process was going well:

[epola@server Hapo-G]$ ls
assembly.fasta  bam  chunks  chunks_bam  cmds  hapog_chunks  logs

[epola@server Hapo-G]$ ls hapog_chunks/
chunks_1.changes   chunks_12.changes  chunks_15.changes  chunks_18.changes  chunks_20.changes  chunks_5.changes  chunks_8.changes
chunks_1.fasta     chunks_12.fasta    chunks_15.fasta    chunks_18.fasta    chunks_20.fasta    chunks_5.fasta    chunks_8.fasta
chunks_10.changes  chunks_13.changes  chunks_16.changes  chunks_19.changes  chunks_3.changes   chunks_6.changes  chunks_9.changes
chunks_10.fasta    chunks_13.fasta    chunks_16.fasta    chunks_19.fasta    chunks_3.fasta     chunks_6.fasta    chunks_9.fasta
chunks_11.changes  chunks_14.changes  chunks_17.changes  chunks_2.changes   chunks_4.changes   chunks_7.changes
chunks_11.fasta    chunks_14.fasta    chunks_17.fasta    chunks_2.fasta     chunks_4.fasta     chunks_7.fasta

However, it died due to lack of RAM.

I would like to know if it is possible to polish little by little with Illumina? What I did was concatenate all my data and put it in Hapo-G but I don't know if putting individual data would help. The script I am using is the following:

#!/bin/bash
#PBS -N Hapo-all_fastq
#PBS -l walltime=700:00:00
#PBS -l nodes=1:ppn=20,vmem=700gb
#PBS -o hapo_output.log
#PBS -e hapo_error.log
#PBS -q ensam

# Load the Hapo-G module
module load hapog/1.3.7
module load samtools/1.9
module load bwa/0.7.17

# Change to the working directory
cd $PBS_O_WORKDIR

# Run Hapo-G
hapog.py \
 --genome ../plant_ref.fasta \
 --pe1 AT_R1_P.fastq.gz \
 --pe2 AT_R2_P.fastq.gz \
 -o Hapo-G \
 -t 20 \
 -u

bistace commented 3 months ago

Hello,

the hapog.py script is just a big wrapper around the hapog binary that is written in C. It seems that the mapping step went as expected so it is possible to launch the hapog binary on each chunk individually and concatenate the results at the end.

If you installed HAPO-G via conda, you should find a binary called hapog_bin that you can launch on each chunk. If you compiled it from source, it is located in hapog_build/hapog. Here is something that should work (not tested so there may be some syntax errors):

set -euo pipefail

# Path to the hapog binary. For conda it is hapog_bin, 
# for manual installation it is the full path to hapog_build/hapog
HAPOG=REPLACE_ME

# Look in the hapog/chunks_bam output directory
# Indicate how many chunks there are
CHUNKS=REPLACE_ME

# The output directory of the run that failed due to RAM
# We want to use the mappings that were generated
cd YOUR_PREVIOUS_HAPOG_OUTPUT_DIRECTORY

for chunk in {0..${CHUNKS}}
do
    ${HAPOG} -b chunks_bam/${chunk}.bam \
        -f chunks/${chunk}.fasta \
        -o hapog_chunks/${chunk}.fasta \
        -c hapog_chunks/${chunk}.changes \
        1> logs/hapog_${chunk}.o 2> logs/hapog_${chunk}.e
done

# Everything went as expected, concatenate the results
mkdir -p hapog_results
cat hapog_chunks/*.fasta > hapog_results/hapog.fasta
cat hapog_chunks/*.changes > hapog_results/hapog.changes

Do not hesitate to come back to me if you encounter any problem or to say if that worked!

enriquepola1996 commented 3 months ago

Hello, thanks for your response, I found something like this:

[epola@server hapog]$ ls
LICENSE.md README.md bin build.sh conda_files hapog hapog.py hapog_build requirements.txt setup.cfg setup.py src

[epola@server hapog]$ cd hapog_build/
[epola@server hapog_build]$ ls
CMakeCache.txt CMakeFiles Makefile cmake_install.cmake hapog

This correct?

My data:

[epola@server Hapo-G]$ ls
assembly.fasta  bam  chunks  chunks_bam  cmds  hapog_chunks  logs
[epola@server Hapo-G]$ ls chunks_bam/
chunks_1.bam   chunks_12.bam  chunks_15.bam  chunks_18.bam  chunks_20.bam  chunks_5.bam  chunks_8.bam
chunks_10.bam  chunks_13.bam  chunks_16.bam  chunks_19.bam  chunks_3.bam   chunks_6.bam  chunks_9.bam
chunks_11.bam  chunks_14.bam  chunks_17.bam  chunks_2.bam   chunks_4.bam   chunks_7.bam

I was thinking of launching something like this:

#!/bin/bash
#PBS -N HapoG_Chunks
#PBS -l walltime=700:00:00
#PBS -l nodes=1:ppn=20,vmem=700gb
#PBS -o hapog_chunks_output.log
#PBS -e hapog_chunks_error.log
#PBS -q ensam

# Load the necessary modules
module load hapog/1.3.7
module load samtools/1.9
module load bwa/0.7.17

# Change to the working directory
cd $PBS_O_WORKDIR

set -euo pipefail

# Path to the hapog binary
HAPOG=/data/software/hapog/hapog_build/hapog

#Indicate how many chunks there are
CHUNKS=20

# The output directory of the run that failed due to RAM
# We want to use the mappings that were generated
cd  user/Hapo-G

for chunk in $(seq 1 ${CHUNKS})
do
 ${HAPOG} -b chunks_bam/chunks_${chunk}.bam \
 -f chunks/chunks_${chunk}.fasta \
 -o hapog_chunks/chunks_${chunk}.fasta \
 -c hapog_chunks/chunks_${chunk}.changes \
 1> logs/hapog_${chunk}.o 2> logs/hapog_${chunk}.e
donated

# Everything went as expected, concatenate the results
mkdir -p hapog_results
cat hapog_chunks/*.fasta > hapog_results/hapog.fasta
cat hapog_chunks/*.changes > hapog_results/hapog.changes

What do you think? Thank so much for your help.

enriquepola1996 commented 3 months ago

Should I delete the hapog_chunks folder before running it?

bistace commented 3 months ago

Everything looks good, you can keep the folder as it will not create it by itself.

enriquepola1996 commented 3 months ago

Hello,

Thank you very much for the support, my process has ended successfully. I just have a couple of questions, is there a problem if I do the polishing on a previously masked genome? I'm seeing that the hapog.fasta result is an unmasked genome, it changed. I have masked the genome again, but I just want to make sure there is no problem putting a masked genome in for polishing. I saw that the number of scaffolds decreased by 10% and the N50 increased slightly. I would also like to see what is the best way to see if the polishing was successful, what metrics could give me a signal.

I greatly appreciated your responses and again thank you very much for providing solutions to users.

enriquepola1996 commented 3 months ago

Also the other question I have, is it advisable to do several rounds of polishing with Hapo-G?

bistace commented 3 months ago

Hello,

by masked genome, do you mean that the masked bases are in lower-case or Ns? If it is the former then yes, HAPO-G outputs everything as upper-case. In the case of Ns, then it should not have changed anything.

Regarding the number of scaffolds decreasing, as you are not using the wrapper hapog.py script, the unpolished sequence are not written to the resulting fasta file. These unpolished sequences are the ones that had no alignments in the BAM file. To get them back, you would need to see which ones are missing from your original fasta file and copy them to the results. This is done like this in the wrapper (the genome parameter is your original assembly fasta file path):

def include_unpolished(genome):
    initial_contig_names = set()
    if os.path.exists("correspondance.txt"):
        with open("correspondance.txt") as corr:
            for line in corr:
                _, initial = line.strip("\n").split("\t")
                initial_contig_names.add(initial)
    else:
        with open(genome) as genome_file:
            for line in genome_file:
                if line.startswith(">"):
                    initial_contig_names.add(line[1:].strip("\n"))

    polished_contig_names = set()
    with open("hapog_results/hapog.fasta") as fasta:
        for line in fasta:
            if line.startswith(">"):
                contig_name = line[1:].strip("\n").replace("_polished", "")
                polished_contig_names.add(contig_name)

    with open("hapog_results/hapog.fasta", "a") as out:
        genome_file = open(genome)
        for record in SeqIO.parse(genome_file, "fasta"):
            if record.description.replace("_polished", "") not in polished_contig_names:
                out.write(record.format("fasta"))
        genome_file.close()

If we deal with Nanopore data, we routinely do one round of medaka polishing with Nanopore reads and two rounds of HAPO-G with Illumina reads.

I am a bit curious about your genome assembly stats as we never reach these high RAM levels, even on very large genomes. Could you please tell me what are the N50 and maximum size of your sequences?

enriquepola1996 commented 3 months ago

Hello, thank you again very much for your support and yes, I polished a softmask genome. I will recover the missing scaffolds as indicated.

My assembly statistics are as follows:

Non polish genome
Statistics without reference    
# contigs   11948
# contigs (>= 0 bp) 11948
# contigs (>= 1000 bp)  11815
Largest contig  6559239
Total length    2675680810
Total length (>= 0 bp)  2675680810
Total length (>= 1000 bp)   2675578986
N50 581467
N90 109645
auN 885971
L50 1210
L90 5264
GC (%)  38.18
Mismatches  
# N's per 100 kbp   0
# N's   0

Polish genome (one round with Hapo-G)
Statistics without reference    
# contigs   10303
# contigs (>= 0 bp) 10303
# contigs (>= 1000 bp)  10197
Largest contig  6560227
Total length    2604764496
Total length (>= 0 bp)  2604764496
Total length (>= 1000 bp)   2604683834
N50 609613
N90 130055
auN 908215
L50 1151
L90 4748
GC (%)  38.35
Mismatches  
# N's per 100 kbp   0
# N's   65

It is certainly quite fragmented.

enriquepola1996 commented 3 months ago

Thank you very much, the program ran well, you can close this issue.

bistace commented 3 months ago

Thank you for reporting back!

institut-de-genomique / HAPO-G

High RAM consumption in plant genome #38