phasegenomics / FALCON-Phase

FALCON-Phase integrates PacBio long-read assemblies with Phase Genomics Hi-C data to create phased, diploid, chromosome-scale scaffolds
Other
72 stars 17 forks source link

A significant drop in BUSCO scores after FALCON-Phase #85

Open melop opened 2 years ago

melop commented 2 years ago

Hello. It appears that FALCON-Phase produces contigs with much lower busco score than the input 4-polish/cns-output/polished_p_ctgs.fasta.

Here are the benchmarks: 4-polish/cns-output/polished_p_ctgs.fasta

    |Results from dataset actinopterygii_odb10        |
    --------------------------------------------------
    |C:98.8%[S:97.6%,D:1.2%],F:0.3%,M:0.9%,n:3640     |
    |3597   Complete BUSCOs (C)                       |
    |3553   Complete and single-copy BUSCOs (S)       |
    |44     Complete and duplicated BUSCOs (D)        |
    |10     Fragmented BUSCOs (F)                     |
    |33     Missing BUSCOs (M)                        |
    |3640   Total BUSCO groups searched               |
    --------------------------------------------------

phased.0.fasta

    --------------------------------------------------
    |Results from dataset actinopterygii_odb10        |
    --------------------------------------------------
    |C:94.4%[S:92.9%,D:1.5%],F:2.8%,M:2.8%,n:3640     |
    |3436   Complete BUSCOs (C)                       |
    |3383   Complete and single-copy BUSCOs (S)       |
    |53     Complete and duplicated BUSCOs (D)        |
    |101    Fragmented BUSCOs (F)                     |
    |103    Missing BUSCOs (M)                        |
    |3640   Total BUSCO groups searched               |
    --------------------------------------------------

phased.1.fasta

    |Results from dataset actinopterygii_odb10        |
    --------------------------------------------------
    |C:94.4%[S:92.9%,D:1.5%],F:2.7%,M:2.9%,n:3640     |
    |3435   Complete BUSCOs (C)                       |
    |3381   Complete and single-copy BUSCOs (S)       |
    |54     Complete and duplicated BUSCOs (D)        |
    |100    Fragmented BUSCOs (F)                     |
    |105    Missing BUSCOs (M)                        |
    |3640   Total BUSCO groups searched               |
    --------------------------------------------------

At first I thought that it's because some genes are genuinely broken on either haplotypes, but when I concatenated the two the BUSCO doesn't improve.

Is there a way to figure out what caused Falcon-phase to degrade the BUSCO scores? Is it because it breaks up contigs without scaffolding them back?

Thanks!

melop commented 2 years ago

I think the problem is caused by the fact that the "pseudohap" option doesn't really produce the expected concatenated pseudohaplotypes, but rather just shows the minced header:

000000F::000000F_001:0-218343_0 000000F::000000F:218458-232413_0 000000F::000000F_002:0-84661_0 000000F::000000F:317041-368522_0 000000F::000000F:368522-415865_0 000000F::000000F:415865-430810_0 000000F::000000F_005:0-36059_0 000000F::000000F:467605-480889_0 000000F::000000F:480889-521113_0 000000F::000000F_007:0-38207_0 000000F::000000F:558969-573780_0 000000F::000000F:573780-588985_0 000000F::000000F:588985-620711_0 000000F::000000F_010:0-11055_0 000000F::000000F:631861-639454_0 000000F::000000F:639454-662763_0 000000F::000000F:662739-746233_0 000000F::000000F:746233-746786_0 000000F::000000F:746786-789812_0 000000F::000000F_015:0-76932_0 000000F::000000F_016:0-50601_0 000000F::000000F_017:0-13292_0 000000F::000000F_018:0-32842_0 000000F::000000F_019:0-102963_0 000000F::000000F:1063576-1127105_0 000000F::000000F:1127105-1129013_0 000000F::000000F_022:0-25403_0 000000F::000000F:1154172-1172642_0 000000F::000000F:1172642-1204265_0 000000F::000000F:1204265-1388710_0 000000F::000000F:1388710-1483357_0 000000F::000000F:1483357-1494887_0 000000F::000000F:1494887-1520229_0 000000F::000000F:1520229-1563713_0 000000F::000000F:1563713-1563716_0 000000F::000000F_028:0-147123_0 000000F::000000F:1710999-1720164_0 000000F::000000F_029:0-61134_0 000000F::000000F:1781421-1800081_0 000000F::000000F_030:0-40101_0 000000F::000000F:1840213-1852205_0 000000F::000000F:1852205-2104169_0 000000F::000000F:2104167-2142589_0 000000F::000000F:2142589-2285735_0 000000F::000000F:2285735-2287441_0 000000F::000000F_034:0-29057_0 000000F::000000F:2315758-2414671_0 000000F::000000F:2414671-2434977_0 000000F::000000F_037:0-45164_0

Is it possible to fix this issue?