Open dgs108 opened 4 months ago
I think your assembly is close to chromosome scale, at least at the N50 level. According to this: https://www.genomesize.com/result_species.php?id=1665 the species has 43 diploid chromosomes so having half the genome in 19 is on par with that. An assembly from another shark species (https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_037974335.1/) supports this. The BUSCO wouldn't really be affected by scaffolding since no new sequence is added to the assembly and any joins have gaps so a gene that is partial will still likely be partial.
I'd suggest also trying YAHS (https://github.com/c-zhou/yahs) and relying on curation (e.g. https://github.com/BGAcademy23/manual-curation) to finalize the assembly.
Thanks for the quick response! It is the sand tiger shark and its genome is a fair bit larger than that: https://www.genomesize.com/results.php?page=1
Based on the other genomes I have scaffolded, I expected <1500 scaffolds given the number of contigs. I will look into those other tools.
I'm working on scaffolding a HiCanu assembly for a shark species that has had duplicates purged. It was assembled with ~45x PacBio Hifi (Median length: 13,710 bp; Mean length: 13,590 bp; Max. length: 62,066 bp). Here are some assembly stats: 3,054 contigs, largest contig is 37,942,190 bp, total length is 4,155,925,466 bp, L50 is 186, L90 is 906.
I have scaffolded using the most recent version of SALSA2 after following the Arima preparation pipeline (https://github.com/ArimaGenomics/mapping_pipeline/blob/master/Arima_Mapping_UserGuide_A160156_v03.pdf) with 559 million paired-end 150 bp Hi-C reads produced in a single library prep.
SALSA placed the 3,054 contigs into 1,873 scaffolds with the following stats: largest scaffold is 166,764,294 bp, total length is 4,156,643,966 bp, L50 is 19, L90 is 317.
BUSCO scores for the scaffolded assembly (without polishing) are decent (92.4% complete; 4.0% fragmented; 3.6% missing) but I would like to improve on these and (more importantly) get the assembly closer to chromosome level, if possible with the data I have.
Any advice would be much appreciated!
Here is my script: