marbl / SALSA

SALSA: A tool to scaffold long read assemblies with Hi-C data
MIT License
182 stars 47 forks source link

Similar number of scaffolds to contigs #182

Open dgs108 opened 4 months ago

dgs108 commented 4 months ago

I'm working on scaffolding a HiCanu assembly for a shark species that has had duplicates purged. It was assembled with ~45x PacBio Hifi (Median length: 13,710 bp; Mean length: 13,590 bp; Max. length: 62,066 bp). Here are some assembly stats: 3,054 contigs, largest contig is 37,942,190 bp, total length is 4,155,925,466 bp, L50 is 186, L90 is 906.

I have scaffolded using the most recent version of SALSA2 after following the Arima preparation pipeline (https://github.com/ArimaGenomics/mapping_pipeline/blob/master/Arima_Mapping_UserGuide_A160156_v03.pdf) with 559 million paired-end 150 bp Hi-C reads produced in a single library prep.

SALSA placed the 3,054 contigs into 1,873 scaffolds with the following stats: largest scaffold is 166,764,294 bp, total length is 4,156,643,966 bp, L50 is 19, L90 is 317.

BUSCO scores for the scaffolded assembly (without polishing) are decent (92.4% complete; 4.0% fragmented; 3.6% missing) but I would like to improve on these and (more importantly) get the assembly closer to chromosome level, if possible with the data I have.

Any advice would be much appreciated!

Here is my script:

#!/bin/bash
#SBATCH -J scaff
#SBATCH -o scaff.out
#SBATCH -e scaff.err
#SBATCH -N 1
#SBATCH -n 64
#SBATCH -p normal
#SBATCH --mail-user=dominic.swift@tamucc.edu
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --time=96:00:00

source /home/dswift/.bashrc
conda activate scaffold

# choose preliminary assembly to scaffold and assign

ASSEMBLY_PATH="$WORK/Workspace/sand_tiger/genome/assemblies/canu/purge_dups/purged.fa"

ASSEMBLY=$(basename "$ASSEMBLY_PATH")

# copy to scaffolding directory, rename, and assign

cp -v $ASSEMBLY_PATH .

mv $ASSEMBLY prelim_assembly.fa

PRELIM="prelim_assembly.fa"

# assign prelim prefix

PREFIX=$(basename $PRELIM | cut -d. -f1)

# index assembly ; this step is done when prepping files for SALSA so can be skipped if you've already done it

samtools faidx $PRELIM

bwa index $PRELIM -a bwtsw

# assign forward and reverse hi-c reads and trim

HIC_F="$WORK/Workspace/sand_tiger/genome/hic/20043-07-01_S0_L001_R1_001.fastq.gz"

HIC_R="$WORK/Workspace/sand_tiger/genome/hic/20043-07-01_S0_L001_R2_001.fastq.gz"

zcat $HIC_F | awk '{ if(NR%2==0) {print substr($1,6)} else {print}}' | gzip > ../hic/hic_R1_trim.fastq.gz

zcat $HIC_R | awk '{ if(NR%2==0) {print substr($1,6)} else {print}}' | gzip > ../hic/hic_R2_trim.fastq.gz

# align trimmed forward and reverse hi-c reads to preliminary assembly independently 

bwa mem -t 64 $PRELIM ../hic/hic_R1_trim.fastq.gz | samtools view -@ 64 -Sb - > $PREFIX"_1.bam"

bwa mem -t 64 $PRELIM ../hic/hic_R2_trim.fastq.gz | samtools view -@ 64 -Sb - > $PREFIX"_2.bam"

# filter bam files

samtools view -h $PREFIX"_1.bam" -@ 64 | perl ~/bin/filter_five_end.pl | samtools view -@ 64 -Sb - > $PREFIX"_filt_1.bam"

samtools view -h $PREFIX"_2.bam" -@ 64 | perl ~/bin/filter_five_end.pl | samtools view -@ 64 -Sb - > $PREFIX"_filt_2.bam"

# pair filtered bam files and sort

perl ~/bin/two_read_bam_combiner.pl $PREFIX"_filt_1.bam" $PREFIX"_filt_2.bam" samtools 10 | samtools view -bS -t $PRELIM".fai" - | samtools sort -@ 64 -o temp.bam -

# add read groups to bam

java -Xmx4G -Djava.io.tmpdir=temp/ -jar /home/dswift/bin/miniconda3/envs/scaffold/share/picard-2.18.29-0/picard.jar AddOrReplaceReadGroups INPUT=temp.bam OUTPUT=paired.bam ID=$PRELIM LB=$PRELIM SM=$ASSEMBLY PL=ILLUMINA PU=none

# discard PCR duplicates

java -Xmx60G -XX:-UseGCOverheadLimit -Djava.io.tmpdir=temp/ -jar /home/dswift/bin/miniconda3/envs/scaffold/share/picard-2.18.29-0/picard.jar MarkDuplicates INPUT=paired.bam OUTPUT=align.bam METRICS_FILE=metrics.alignment.txt TMP_DIR=temp/ ASSUME_SORTED=TRUE VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=TRUE

# index alignment.bam

samtools index align.bam -@ 64

# produce stats

perl ~/bin/get_stats.pl align.bam > bamstats.txt

samtools flagstat align.bam -@ 64 > flagstats.txt

# convert bam to bed format required by SALSA2 and sort by read name

bamToBed -i align.bam > align.bed

sort -k 4 align.bed > tmp && mv tmp align.bed

# change conda env

conda deactivate

conda activate salsa2_new

# SALSA2

run_pipeline.py -a prelim_assembly.fa -l prelim_assembly.fa.fai -b align.bed -e GATC,GACTC,GAGTC,GATTC,GAATC -o ctau_canu_purged_scaffold_r3 -m yes
skoren commented 4 months ago

I think your assembly is close to chromosome scale, at least at the N50 level. According to this: https://www.genomesize.com/result_species.php?id=1665 the species has 43 diploid chromosomes so having half the genome in 19 is on par with that. An assembly from another shark species (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_037974335.1/) supports this. The BUSCO wouldn't really be affected by scaffolding since no new sequence is added to the assembly and any joins have gaps so a gene that is partial will still likely be partial.

I'd suggest also trying YAHS (https://github.com/c-zhou/yahs) and relying on curation (e.g. https://github.com/BGAcademy23/manual-curation) to finalize the assembly.

dgs108 commented 4 months ago

Thanks for the quick response! It is the sand tiger shark and its genome is a fair bit larger than that: https://www.genomesize.com/results.php?page=1

Based on the other genomes I have scaffolded, I expected <1500 scaffolds given the number of contigs. I will look into those other tools.