labgem / PPanGGOLiN

Build a partitioned pangenome graph from microbial genomes
https://ppanggolin.readthedocs.io
Other
239 stars 28 forks source link

Can't use external clustering: "Exception: Representative gene has not been set" #279

Closed ericolo closed 2 weeks ago

ericolo commented 3 weeks ago

Hello !

I'm trying to use my external clustering results with my dataset like this:

ppanggolin workflow --anno list_gff.tsv -c 15 -o real_test --clusters clusters.tsv --infer_singletons

But I get this error: File "/clusterfs/jgi/groups/science/homes/eolondela/.micromamba/envs/ppanggo/lib/python3.12/site-packages/ppanggolin/geneFamily.py", line 198, in representative raise Exception("Representative gene has not been set")

I'm using GFF3 files as an input and I chose to provide the three column clustering file as described in the documentation, so the representative genes are indicated in clusters.tsv. I get the same error if I try the two column file and let ppanggolin take the first gene of the cluster as the representative.

PS: In the documentation it is said that the representative should be the second column, but it is actually the last column as I found out from the cluster.py script. When I strictly followed the documentation, the error stated that protein IDs were duplicated.

I can provide more files if needed. The clustering file is attached, and the same command works if I don't input my clustering.

clusters.tsv.zip

The last lines from the log file were:

2024-09-03 14:11:02 annotate.py:l812 INFO       transl_table tag was not found for 1907 CDS in /clusterfs/jgi/
scratch/science/mds/eolondela/GTDB_refs_annot/all_gff/GCF_019316745.1.gff. Provided translation_table argument value was used instead: 11.
2024-09-03 14:11:02 annotate.py:l1084 INFO      gene identifiers used in the provided annotation files were unique, PPanGGOLiN will use them.
2024-09-03 14:11:02 writeBinaries.py:l706 INFO  Writing genome annotations...
2024-09-03 14:11:03 writeBinaries.py:l717 INFO  writing the protein coding gene dna sequences in pangenome...
2024-09-03 14:11:05 writeBinaries.py:l762 INFO  Done writing the pangenome. It is in file : real_test/pangenome.h5
2024-09-03 14:11:05 cluster.py:l402 INFO        Reading test_clusters_bis.tsv the gene families file ...
2024-09-03 14:11:07 cluster.py:l357 INFO        Inferred 8 singleton families

Thanks in advance, Eric

ericolo commented 3 weeks ago

On my machine, the example in the git repository works perfectly like this: ppanggolin workflow --anno genomes.gbff.list -c 15 -o mock_test --clusters clusters.tsv --infer_singletons

So the problem seems to really come from my clustering file I tried reformatting mine like the example (representative-tab-gene) and I still get the same error

Thanks, Eric

JeanMainguy commented 2 weeks ago

Hi,

Thanks for raising this issue!

It seems like the problem might be due to how the genes are named in your clustering table, which doesn’t match the way PPanGGOLiN expects them.

PPanGGOLiN uses the gene ID from the CDS line of the GFF file. For example, with the following gene:

NZ_CALCQZ010000140.1 RefSeq gene 11218 11601 . - . ID=gene-QDP31_RS09520;Name=QDP31_RS09520
NZ_CALCQZ010000140.1 Protein Homology CDS 11218 11601 . - 0 ID=cds-WP_279012685.1;Parent=gene-QDP31_RS09520

The ID would be cds-WP_279012685.1.

At the end of the annotation step, PPanGGOLiN checks if all gene IDs are unique. If they aren’t, it uses internal IDs in the format <genome>_CDS_<id>, such as GCF_000173495.1_CDS_0759. In this case you get the following log: INFO: gene identifiers used in the provided annotation files were not unique, PPanGGOLiN will use self-generated identifiers.

To check if PPanGGOLiN used the annotation file's IDs or generated its own, you can run this command:

ppanggolin info -p myannopang/pangenome.h5 --parameters

If you see # used_local_identifiers: False, it means PPanGGOLiN used internal IDs instead of those from the annotation file.

In your case, it looks like the genes in your clustering table follow the pattern <genomeID>:<contigID>_<id>, which PPanGGOLiN doesn’t recognize and can’t map back to the pangenome genes.

JeanMainguy commented 2 weeks ago

I understand that working with external clustering files can be tricky, especially when PPanGGOLiN uses its own internal IDs. A possible workaround is to run the clustering step with PPanGGOLiN and then generate the family_tsv file using the write_pangenome command.

This file will list the gene family ID, gene ID, and local ID (which corresponds to the ID in the GFF file). Essentially, the second and third columns will help you map the internal IDs to the CDS IDs from the annotation file.

To sum up the commands would be:


ppanggolin annotate --anno list_gff.tsv -o ppanggolin_result
ppanggolin cluster -p ppanggolin_result/pangenome.h5

ppanggolin write_pangenome --families_tsv -o ppanggolin_result -f
JeanMainguy commented 2 weeks ago

About the error you got, this is quite misleading. We’ve already identified some issues with external clustering files process and have patched them and improved error handling in PR #278. So the error messages should be clearer in the next release.

Thank you as well for pointing out the inconsistencies in the documentation—I'll fix them. (I've also noticed that the documentation for the family_tsv file is not up to date, so I’ll address that too.)

ericolo commented 2 weeks ago

Hi,

Thanks for your quick reply !

So I renamed my proteins like this <genomeID>:<contigID>_<id> because in my dataset some contigs coming from different genomes have redundant names, and I edited the GFF files as well after the ID= flag

Maybe something is wrong with the format of the names ? Because my IDs are recognized as unique by ppanggolin according to the log: 2024-09-03 14:11:02 annotate.py:l1084 INFO gene identifiers used in the provided annotation files were unique, PPanGGOLiN will use them.

I can try your workaround, I'll let you know if that works, last case scenario I can just run it without providing my clustering results, but I'm trying to save time as I have a huge dataset

Thanks a lot !

ericolo commented 2 weeks ago

I have found the problem, after using the workaround and generating new clusters from the GFF files, I compared ppanggolin_result/gene_families.tsv to my own clustering file and there were indeed some proteins in my clustering file that were not in any of the GFF...

So the problem was the way that I generated my GFFs which omitted some proteins, and not ppanggolin or the protein IDs.

Thanks, sorry for this mistake, I can add another comment whenever I succeed with new GFF files

ericolo commented 2 weeks ago

It ended up working with my new GFF files, thanks for the workaround that helped me debug !