pangenome / pggb

the pangenome graph builder
https://doi.org/10.1101/2023.04.05.535718
MIT License
346 stars 37 forks source link

[help] My reference genome of a diploid organism is a primary assembly #363

Closed Isoris closed 7 months ago

Isoris commented 7 months ago

Hello,

thanks for making PGGB, I have a question, I am running an analysis and would like to align 18 species of catfish, they are diploid individuals however when curling the assemblies GCA in genbank we have only around 28+ chromosomes and around 47 scaffolds to 300+ scaffolds based on the assembly quality. I would like to know if the genomes are haploid representations of the diploid genome i.e., if the assemblies are primary the two haplotypes are collapsed. Therefore should PGGB be run with haplotype = 1 ?

  | RefSeq | GenBank
-- | -- | --
Genome size | 969.6 Mb | 969.6 Mb
Total ungapped length | 969.6 Mb | 969.6 Mb
Number of chromosomes | 28 | 28
Number of organelles | 1 | 0
Number of scaffolds | 47 | 47
Scaffold N50 | 33.7 Mb | 33.7 Mb
Scaffold L50 | 12 | 12
Number of contigs | 47 | 47
Contig N50 | 33.7 Mb | 33.7 Mb
Contig L50 | 12 | 12
GC percent | 39 | 39
Genome coverage | 100.0x | 100.0x
Assembly level | Chromosome | Chromosome

Genome assembly CGAR_prim_01v2reference
Actions

NCBI RefSeq assembly
    GCF_024256425.1

Submitted GenBank assembly
    GCA_024256425.2

Taxon
    [Clarias gariepinus](https://www.ncbi.nlm.nih.gov/datasets/taxonomy/13013) (North African catfish)
Isolate
    MV-2021
WGS project
    [JAMBPB01](https://www.ncbi.nlm.nih.gov/nuccore/JAMBPB000000000.1)
Assembly type
    haploid
Submitter
    Leibniz Institute for Farm Animal Biology (FBN)
Date
    Jul 13, 2022

View the [legacy Assembly page](https://www.ncbi.nlm.nih.gov/assembly/GCF_024256425.1/?shouldredirect=false)
Assembly statistics
    RefSeq  GenBank
Genome size 969.6 Mb    969.6 Mb
Total ungapped length   969.6 Mb    969.6 Mb
Number of chromosomes   28  28
Number of organelles    1   0
Number of scaffolds 47  47
Scaffold N50    33.7 Mb 33.7 Mb
Scaffold L50    12  12
Number of contigs   47  47
Contig N50  33.7 Mb 33.7 Mb
Contig L50  12  12
GC percent  39  39
Genome coverage 100.0x  100.0x
Assembly level  Chromosome  Chromosome
Sample details

BioSample ID
    [SAMN27021044](https://www.ncbi.nlm.nih.gov/biosample/SAMN27021044/)
Description
    Model organism or animal sample from Clarias gariepinus
Comment
    Genome assemblies of Clarias gariepinus, including diploid-collapsed (i.e., primary) and haplotype-resolved (i.e., phased) assemblies.
Owner name
    Leibniz Institute for Farm Animal Biology (FBN)
Isolate
    MV-2021
Ecotype
    Netherlands
Age
    1.2

Assembly methods

Sequencing technology
    Oxford Nanopore PromethION; PacBio Sequel; Illumina NovaSeq; Illumina HiSeq
Comment
    This genome assembly is the traditional collapsed haploid assembly of this particular individual organism's genome. The same reads were also separately assembled into two pseudohaplotype assemblies of the diploid genome. All three assemblies have this same BioSample.
Assembly method
    Hifiasm v. 0.16.1

Additional genomes
[Browse all Clarias gariepinus genomes (4)](https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=13013)
BioProject
[PRJNA818990](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA818990/)

Clarias gariepinus Genome sequencing and assembly
Annotation details
[See full annotation report](https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Clarias_gariepinus/GCF_024256425.1-RS_2023_03)
    RefSeq
Provider    NCBI RefSeq
Name    GCF_024256425.1-RS_2023_03
Date    Mar 3, 2023
Genes   40,491
Protein-coding  24,297
Software version    10.1
Quality analysis
BUSCO analysis (4.1.4)
Single_copy 97.6%Duplicated 1.5%Fragmented 0.2%Missing 0.7%
97.6%
Type    Value
Single_copy 97.6%   0.976
Duplicated 1.5% 0.015
Fragmented 0.2% 0.002
Missing 0.7%    0.007

C:99.1%[S:97.6%,D:1.5%],F:0.2%,M:0.7%,n:3640

actinopterygii_odb10 (3640)
Chromosomes
[1](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=1)
[2](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=2)
[3](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=3)
[4](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=4)
[5](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=5)
[6](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=6)
[7](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=7)
[8](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=8)
[9](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=9)
[10](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=10)
[11](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=11)
[12](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=12)
[13](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=13)
[14](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=14)
[15](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=15)
[16](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=16)
[17](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=17)
[18](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=18)
[19](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=19)
[20](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=20)
[21](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=21)
[22](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=22)
[23](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=23)
[24](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=24)
[25](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=25)
[26](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=26)
[27](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=27)
[28](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=28)
[MT](https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=genome&acc=GCF_024256425.1&chr=MT)
Chromosome

GenBank

RefSeq

Size (bp)

GC content (%)

Unlocalized count

Action

1   [CM044232.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044232.1/)  [NC_071100.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071100.1/)    52,237,485  39  0   
2   [CM044233.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044233.1/)  [NC_071101.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071101.1/)    52,228,123  39  0   
3   [CM044234.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044234.1/)  [NC_071102.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071102.1/)    48,147,763  39  0   
4   [CM044235.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044235.1/)  [NC_071103.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071103.1/)    43,632,810  39  0   
5   [CM044236.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044236.1/)  [NC_071104.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071104.1/)    41,276,641  39  0   
6   [CM044237.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044237.1/)  [NC_071105.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071105.1/)    41,146,859  38.5    0   
7   [CM044238.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044238.1/)  [NC_071106.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071106.1/)    40,980,544  39  0   
8   [CM044239.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044239.1/)  [NC_071107.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071107.1/)    38,527,363  38.5    0   
9   [CM044240.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044240.1/)  [NC_071108.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071108.1/)    37,984,889  39  0   
10  [CM044241.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044241.1/)  [NC_071109.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071109.1/)    35,173,121  38.5    0   
11  [CM044242.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044242.1/)  [NC_071110.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071110.1/)    34,464,041  39  0   
12  [CM044243.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044243.1/)  [NC_071111.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071111.1/)    33,715,535  39.5    0   
13  [CM044244.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044244.1/)  [NC_071112.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071112.1/)    33,675,256  39  0   
14  [CM044245.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044245.1/)  [NC_071113.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071113.1/)    32,602,070  39  0   
15  [CM044246.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044246.1/)  [NC_071114.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071114.1/)    32,143,731  39  0   
16  [CM044247.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044247.1/)  [NC_071115.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071115.1/)    31,840,490  39  0   
17  [CM044248.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044248.1/)  [NC_071116.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071116.1/)    30,626,297  39  0   
18  [CM044249.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044249.1/)  [NC_071117.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071117.1/)    30,601,865  39  0   
19  [CM044250.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044250.1/)  [NC_071118.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071118.1/)    30,248,171  38.5    0   
20  [CM044251.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044251.1/)  [NC_071119.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071119.1/)    29,824,949  39  0   
21  [CM044252.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044252.1/)  [NC_071120.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071120.1/)    29,591,905  39  0   
22  [CM044253.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044253.1/)  [NC_071121.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071121.1/)    29,481,022  39  0   
23  [CM044254.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044254.1/)  [NC_071122.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071122.1/)    26,592,883  39  0   
24  [CM044255.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044255.1/)  [NC_071123.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071123.1/)    26,202,038  39  0   
25  [CM044256.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044256.1/)  [NC_071124.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071124.1/)    25,952,957  39  0   
26  [CM044257.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044257.1/)  [NC_071125.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071125.1/)    25,461,494  38.5    0   
27  [CM044258.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044258.1/)  [NC_071126.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071126.1/)    24,522,578  39  0   
28  [CM044259.1](https://www.ncbi.nlm.nih.gov/nuccore/CM044259.1/)  [NC_071127.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_071127.1/)    21,030,501  40  0   
MT  [KT001082.1](https://www.ncbi.nlm.nih.gov/nuccore/KT001082.1/)  [NC_027661.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_027661.1/)    16,508  43  0   

Note: This genome assembly includes 19 unplaced scaffolds.
Revision history
GenBank

RefSeq

Name

Level

Date

Action

GCA_024256425.2 GCF_024256425.1 CGAR_prim_01v2  Chromosome  Jul 13, 2022    
[GCA_024256425.1](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_024256425.1/)    

n/a
    CGAR_prim_01    Chromosome  Jul 13, 2022
Thank you for your time.

Quentin
ekg commented 7 months ago

Set things up so that the number of haploid copies you expect in your summaries is the number that's given to PGGB. You'll need to investigate each assembly to see if it's haploid (collapsed) or diploid.

ekg commented 7 months ago

Also I suggest renaming the FASTA sequences with PanSN format. If the assemblies are haploid collapsed then you'd have something like fish1#1#accession for each.

Isoris commented 7 months ago

Thank you Eric.