different results ppanggolin projection with gbff or fasta files

frdel1 commented 6 months ago

Hi, I am experiencing very different results when using ppanggolin projection with gbff or fasta files for the same genome.

Steps to reproduce:

# get a bunch of genomes to create the pangenome
datasets download genome accession GCF_001399095.1 GCF_002073795.1 GCF_021496845.1 GCF_000715295.1 GCF_000311145.1 GCF_000290175.1 GCF_000310785.1 GCF_001546785.1 GCF_006716245.1 GCF_023149175.1 GCF_001016545.1 GCF_008709865.1 GCF_002239305.1 GCF_011326915.1 GCF_001015555.1 GCF_001140565.1 GCF_011379325.1 GCF_011327225.1 GCF_900074445.1 GCF_000288475.1 GCF_009933575.1 GCF_004120405.1 GCF_000288175.1 GCF_000288895.1 GCF_008710995.1 GCF_003859515.1 GCF_009495905.1 GCF_011326825.1 GCF_000310685.1 GCF_000289875.1 --include gbff

# create the organism.gbff.list file
# create the pangenome with ppanggolin all
conda activate ppanggolin-2.0.4
ppanggolin all --anno /path/to/organism.gbff.list --cpu 1 --identity 0.8 --output /path/to/output_ppanggolin_all

# get the query genome in both gbff and fasta format
datasets download genome accession GCF_000196055.1 --include gbff,genome
unzip ncbi_dataset.zip
# the genomic sequences look similar as far as I can tell

# run ppanggolin projection with gbff file
conda activate ppanggolin-2.0.4
ppanggolin projection --pangenome /path/to/pangenome.h5 --cpu 1 --anno /path/to/genomic.gbff --identity 0.8 --table --genome_name TEST_gbff --output output_ppanggolin_projection_gbff

# run ppanggolin projection with fasta file
conda activate ppanggolin-2.0.4
ppanggolin projection --pangenome /path/to/pangenome.h5 --cpu 1 --fasta /path/to/GCF_000196055.1_ASM19605v1_genomic.fna --identity 0.8 --table --genome_name TEST_fasta --output output_ppanggolin_projection_fasta

The two results are very differents, all the CDSs from the gbff file are classified in the cloud partition:

# fasta:
wc -l output_ppanggolin_projection_fasta/TEST_fasta/TEST_fasta.tsv
# 2116
more output_ppanggolin_projection_fasta/TEST_fasta/TEST_fasta.tsv | grep 'cloud' | wc -l
# 313
# gbff:
wc -l output_ppanggolin_projection_fasta/TEST_fasta/TEST_fasta.tsv
# 2076
more output_ppanggolin_projection_fasta/TEST_fasta/TEST_fasta.tsv | grep 'cloud' | wc -l
# 2076

This also shows in the ppanggolin projection consol output :

# fasta:
# 17 RGPs have been predicted in the input genomes.
# gbff:
# 1 RGPs have been predicted in the input genomes.

The biological interpretation of the results with the fasta file makes sense whereas the result with the gbff file are very surprising. Is this a bug or am I doing something wrong? Best wishes

JeanMainguy commented 6 months ago

Hello, This is surprising in deed. I've been able to replicate the issue.

It seems there's a problem with gene ID matching between ppanggolin's internal IDs and the original IDs from the GBFF file. Because of this mismatch, ppanggolin is unable to correctly associate gene families with input genes, resulting in that none of the genes have a pangenome family.

I'll work on fixing this promptly.

Thanks for reporting the bug ! Best regards,

JeanMainguy commented 2 months ago

The fix for this issue has been released in v2.1.0.

labgem / PPanGGOLiN

different results ppanggolin projection with gbff or fasta files #207