Closed ericolo closed 2 months ago
On my machine, the example in the git repository works perfectly like this:
ppanggolin workflow --anno genomes.gbff.list -c 15 -o mock_test --clusters clusters.tsv --infer_singletons
So the problem seems to really come from my clustering file I tried reformatting mine like the example (representative-tab-gene) and I still get the same error
Thanks, Eric
Hi,
Thanks for raising this issue!
It seems like the problem might be due to how the genes are named in your clustering table, which doesn’t match the way PPanGGOLiN expects them.
PPanGGOLiN uses the gene ID from the CDS
line of the GFF file. For example, with the following gene:
NZ_CALCQZ010000140.1 RefSeq gene 11218 11601 . - . ID=gene-QDP31_RS09520;Name=QDP31_RS09520
NZ_CALCQZ010000140.1 Protein Homology CDS 11218 11601 . - 0 ID=cds-WP_279012685.1;Parent=gene-QDP31_RS09520
The ID would be cds-WP_279012685.1
.
At the end of the annotation step, PPanGGOLiN checks if all gene IDs are unique. If they aren’t, it uses internal IDs in the format <genome>_CDS_<id>
, such as GCF_000173495.1_CDS_0759
. In this case you get the following log: INFO: gene identifiers used in the provided annotation files were not unique, PPanGGOLiN will use self-generated identifiers.
To check if PPanGGOLiN used the annotation file's IDs or generated its own, you can run this command:
ppanggolin info -p myannopang/pangenome.h5 --parameters
If you see # used_local_identifiers: False
, it means PPanGGOLiN used internal IDs instead of those from the annotation file.
In your case, it looks like the genes in your clustering table follow the pattern <genomeID>:<contigID>_<id>
, which PPanGGOLiN doesn’t recognize and can’t map back to the pangenome genes.
I understand that working with external clustering files can be tricky, especially when PPanGGOLiN uses its own internal IDs. A possible workaround is to run the clustering step with PPanGGOLiN and then generate the family_tsv
file using the write_pangenome
command.
This file will list the gene family ID, gene ID, and local ID (which corresponds to the ID in the GFF file). Essentially, the second and third columns will help you map the internal IDs to the CDS IDs from the annotation file.
To sum up the commands would be:
ppanggolin annotate --anno list_gff.tsv -o ppanggolin_result
ppanggolin cluster -p ppanggolin_result/pangenome.h5
ppanggolin write_pangenome --families_tsv -o ppanggolin_result -f
About the error you got, this is quite misleading. We’ve already identified some issues with external clustering files process and have patched them and improved error handling in PR #278. So the error messages should be clearer in the next release.
Thank you as well for pointing out the inconsistencies in the documentation—I'll fix them. (I've also noticed that the documentation for the family_tsv
file is not up to date, so I’ll address that too.)
Hi,
Thanks for your quick reply !
So I renamed my proteins like this <genomeID>:<contigID>_<id>
because in my dataset some contigs coming from different genomes have redundant names, and I edited the GFF files as well after the ID=
flag
Maybe something is wrong with the format of the names ? Because my IDs are recognized as unique by ppanggolin according to the log:
2024-09-03 14:11:02 annotate.py:l1084 INFO gene identifiers used in the provided annotation files were unique, PPanGGOLiN will use them.
I can try your workaround, I'll let you know if that works, last case scenario I can just run it without providing my clustering results, but I'm trying to save time as I have a huge dataset
Thanks a lot !
I have found the problem, after using the workaround and generating new clusters from the GFF files, I compared ppanggolin_result/gene_families.tsv
to my own clustering file and there were indeed some proteins in my clustering file that were not in any of the GFF...
So the problem was the way that I generated my GFFs which omitted some proteins, and not ppanggolin or the protein IDs.
Thanks, sorry for this mistake, I can add another comment whenever I succeed with new GFF files
It ended up working with my new GFF files, thanks for the workaround that helped me debug !
Hello !
I'm trying to use my external clustering results with my dataset like this:
ppanggolin workflow --anno list_gff.tsv -c 15 -o real_test --clusters clusters.tsv --infer_singletons
But I get this error:
File "/clusterfs/jgi/groups/science/homes/eolondela/.micromamba/envs/ppanggo/lib/python3.12/site-packages/ppanggolin/geneFamily.py", line 198, in representative raise Exception("Representative gene has not been set")
I'm using GFF3 files as an input and I chose to provide the three column clustering file as described in the documentation, so the representative genes are indicated in
clusters.tsv
. I get the same error if I try the two column file and let ppanggolin take the first gene of the cluster as the representative.PS: In the documentation it is said that the representative should be the second column, but it is actually the last column as I found out from the
cluster.py
script. When I strictly followed the documentation, the error stated that protein IDs were duplicated.I can provide more files if needed. The clustering file is attached, and the same command works if I don't input my clustering.
clusters.tsv.zip
The last lines from the log file were:
Thanks in advance, Eric