Closed enriquepola1996 closed 3 months ago
Hello,
the hapog.py
script is just a big wrapper around the hapog
binary that is written in C. It seems that the mapping step went as expected so it is possible to launch the hapog
binary on each chunk individually and concatenate the results at the end.
If you installed HAPO-G via conda, you should find a binary called hapog_bin
that you can launch on each chunk. If you compiled it from source, it is located in hapog_build/hapog
. Here is something that should work (not tested so there may be some syntax errors):
set -euo pipefail
# Path to the hapog binary. For conda it is hapog_bin,
# for manual installation it is the full path to hapog_build/hapog
HAPOG=REPLACE_ME
# Look in the hapog/chunks_bam output directory
# Indicate how many chunks there are
CHUNKS=REPLACE_ME
# The output directory of the run that failed due to RAM
# We want to use the mappings that were generated
cd YOUR_PREVIOUS_HAPOG_OUTPUT_DIRECTORY
for chunk in {0..${CHUNKS}}
do
${HAPOG} -b chunks_bam/${chunk}.bam \
-f chunks/${chunk}.fasta \
-o hapog_chunks/${chunk}.fasta \
-c hapog_chunks/${chunk}.changes \
1> logs/hapog_${chunk}.o 2> logs/hapog_${chunk}.e
done
# Everything went as expected, concatenate the results
mkdir -p hapog_results
cat hapog_chunks/*.fasta > hapog_results/hapog.fasta
cat hapog_chunks/*.changes > hapog_results/hapog.changes
Do not hesitate to come back to me if you encounter any problem or to say if that worked!
Hello, thanks for your response, I found something like this:
[epola@server hapog]$ ls
LICENSE.md README.md bin build.sh conda_files hapog hapog.py hapog_build requirements.txt setup.cfg setup.py src
[epola@server hapog]$ cd hapog_build/
[epola@server hapog_build]$ ls
CMakeCache.txt CMakeFiles Makefile cmake_install.cmake hapog
This correct?
My data:
[epola@server Hapo-G]$ ls
assembly.fasta bam chunks chunks_bam cmds hapog_chunks logs
[epola@server Hapo-G]$ ls chunks_bam/
chunks_1.bam chunks_12.bam chunks_15.bam chunks_18.bam chunks_20.bam chunks_5.bam chunks_8.bam
chunks_10.bam chunks_13.bam chunks_16.bam chunks_19.bam chunks_3.bam chunks_6.bam chunks_9.bam
chunks_11.bam chunks_14.bam chunks_17.bam chunks_2.bam chunks_4.bam chunks_7.bam
I was thinking of launching something like this:
#!/bin/bash
#PBS -N HapoG_Chunks
#PBS -l walltime=700:00:00
#PBS -l nodes=1:ppn=20,vmem=700gb
#PBS -o hapog_chunks_output.log
#PBS -e hapog_chunks_error.log
#PBS -q ensam
# Load the necessary modules
module load hapog/1.3.7
module load samtools/1.9
module load bwa/0.7.17
# Change to the working directory
cd $PBS_O_WORKDIR
set -euo pipefail
# Path to the hapog binary
HAPOG=/data/software/hapog/hapog_build/hapog
#Indicate how many chunks there are
CHUNKS=20
# The output directory of the run that failed due to RAM
# We want to use the mappings that were generated
cd user/Hapo-G
for chunk in $(seq 1 ${CHUNKS})
do
${HAPOG} -b chunks_bam/chunks_${chunk}.bam \
-f chunks/chunks_${chunk}.fasta \
-o hapog_chunks/chunks_${chunk}.fasta \
-c hapog_chunks/chunks_${chunk}.changes \
1> logs/hapog_${chunk}.o 2> logs/hapog_${chunk}.e
donated
# Everything went as expected, concatenate the results
mkdir -p hapog_results
cat hapog_chunks/*.fasta > hapog_results/hapog.fasta
cat hapog_chunks/*.changes > hapog_results/hapog.changes
What do you think? Thank so much for your help.
Should I delete the hapog_chunks folder before running it?
Everything looks good, you can keep the folder as it will not create it by itself.
Hello,
Thank you very much for the support, my process has ended successfully. I just have a couple of questions, is there a problem if I do the polishing on a previously masked genome? I'm seeing that the hapog.fasta result is an unmasked genome, it changed. I have masked the genome again, but I just want to make sure there is no problem putting a masked genome in for polishing. I saw that the number of scaffolds decreased by 10% and the N50 increased slightly. I would also like to see what is the best way to see if the polishing was successful, what metrics could give me a signal.
I greatly appreciated your responses and again thank you very much for providing solutions to users.
Also the other question I have, is it advisable to do several rounds of polishing with Hapo-G?
Hello,
by masked genome, do you mean that the masked bases are in lower-case or Ns? If it is the former then yes, HAPO-G outputs everything as upper-case. In the case of Ns, then it should not have changed anything.
Regarding the number of scaffolds decreasing, as you are not using the wrapper hapog.py
script, the unpolished sequence are not written to the resulting fasta file. These unpolished sequences are the ones that had no alignments in the BAM file. To get them back, you would need to see which ones are missing from your original fasta file and copy them to the results. This is done like this in the wrapper (the genome
parameter is your original assembly fasta file path):
def include_unpolished(genome):
initial_contig_names = set()
if os.path.exists("correspondance.txt"):
with open("correspondance.txt") as corr:
for line in corr:
_, initial = line.strip("\n").split("\t")
initial_contig_names.add(initial)
else:
with open(genome) as genome_file:
for line in genome_file:
if line.startswith(">"):
initial_contig_names.add(line[1:].strip("\n"))
polished_contig_names = set()
with open("hapog_results/hapog.fasta") as fasta:
for line in fasta:
if line.startswith(">"):
contig_name = line[1:].strip("\n").replace("_polished", "")
polished_contig_names.add(contig_name)
with open("hapog_results/hapog.fasta", "a") as out:
genome_file = open(genome)
for record in SeqIO.parse(genome_file, "fasta"):
if record.description.replace("_polished", "") not in polished_contig_names:
out.write(record.format("fasta"))
genome_file.close()
If we deal with Nanopore data, we routinely do one round of medaka polishing with Nanopore reads and two rounds of HAPO-G with Illumina reads.
I am a bit curious about your genome assembly stats as we never reach these high RAM levels, even on very large genomes. Could you please tell me what are the N50 and maximum size of your sequences?
Hello, thank you again very much for your support and yes, I polished a softmask genome. I will recover the missing scaffolds as indicated.
My assembly statistics are as follows:
Non polish genome
Statistics without reference
# contigs 11948
# contigs (>= 0 bp) 11948
# contigs (>= 1000 bp) 11815
Largest contig 6559239
Total length 2675680810
Total length (>= 0 bp) 2675680810
Total length (>= 1000 bp) 2675578986
N50 581467
N90 109645
auN 885971
L50 1210
L90 5264
GC (%) 38.18
Mismatches
# N's per 100 kbp 0
# N's 0
Polish genome (one round with Hapo-G)
Statistics without reference
# contigs 10303
# contigs (>= 0 bp) 10303
# contigs (>= 1000 bp) 10197
Largest contig 6560227
Total length 2604764496
Total length (>= 0 bp) 2604764496
Total length (>= 1000 bp) 2604683834
N50 609613
N90 130055
auN 908215
L50 1151
L90 4748
GC (%) 38.35
Mismatches
# N's per 100 kbp 0
# N's 65
It is certainly quite fragmented.
Thank you very much, the program ran well, you can close this issue.
Thank you for reporting back!
Hello dear developers,
I am using Hapo-G for the first time to polish a 2.6Gb plant genome with Illumina, however, I am running out of RAM, the consumption is above 700GB of RAM and I do not have more of those, will there be a strategy that Can I minimize RAM consumption?
The process was going well:
However, it died due to lack of RAM.
I would like to know if it is possible to polish little by little with Illumina? What I did was concatenate all my data and put it in Hapo-G but I don't know if putting individual data would help. The script I am using is the following: