less genes clustered than genes in the pangenome

LemoAlex commented 1 year ago

Hello ppanggolin users,

I am trying to run a panrgp analysis on 40 bacterial genomes, however after the command :

ppanggolin panrgp --fasta fasta.input -o output_panrgp --cpu 12

everything goes well as : [...]

cluster.py:l212 INFO Clustering all of the genes sequences... 2022-11-11 16:46:19 cluster.py:l48 INFO Creating sequence database... 2022-11-11 16:46:20 cluster.py:l58 INFO Clustering sequences... 2022-11-11 16:46:29 cluster.py:l60 INFO Extracting cluster representatives... 2022-11-11 16:46:30 cluster.py:l72 INFO Writing gene to family informations 2022-11-11 16:46:30 cluster.py:l220 INFO Associating fragments to their original gene family... 2022-11-11 16:46:30 cluster.py:l33 INFO Aligning cluster representatives... 2022-11-11 16:46:42 cluster.py:l38 INFO Extracting alignments... 2022-11-11 16:46:43 cluster.py:l104 INFO Starting with 12061 families 2022-11-11 16:46:43 cluster.py:l135 INFO Ending with 9651 gene families 2022-11-11 16:46:43 cluster.py:l163 INFO Adding protein sequences to the gene families 2022-11-11 16:46:43 cluster.py:l140 INFO Adding 132647 genes to the gene families Traceback (most recent call last):

and then, I get an error :

Traceback (most recent call last): File "/Users/opt/anaconda3/envs/Ppangolin-env/bin/ppanggolin", line 8, in sys.exit(main()) File "/Users/opt/anaconda3/envs/Ppangolin-env/lib/python3.8/site-packages/ppanggolin/main.py", line 247, in main ppanggolin.workflow.panRGP.launch(args) File "/Users/opt/anaconda3/envs/Ppangolin-env/lib/python3.8/site-packages/ppanggolin/workflow/panRGP.py", line 61, in launch clustering(pangenome, args.tmpdir, args.cpu, defrag=not args.no_defrag, disable_bar=args.disable_prog_bar) File "/Users/opt/anaconda3/envs/Ppangolin-env/lib/python3.8/site-packages/ppanggolin/cluster/cluster.py", line 226, in clustering read_gene2fam(pangenome, genes2fam, disable_bar=disable_bar) File "/Users/opt/anaconda3/envs/Ppangolin-env/lib/python3.8/site-packages/ppanggolin/cluster/cluster.py", line 145, in read_gene2fam raise Exception("Something unexpected happened during clustering " Exception: Something unexpected happened during clustering (have less genes clustered than genes in the pangenome). A probable reason is that two genes in two different organisms have the same IDs; If you are sure that all of your genes have non identical IDs, please post an issue at https://github.com/labgem/PPanGGOLiN/

I am not providing any annotation file , just the fasta sequences, so this error is a bit surprising to me. What could be the reason for this ?

Thanks a lot for your help & time,

Best, Alexandre

axbazin commented 1 year ago

Hello, Indeed it seems quite surprising. Which version are you using ? I will try to replicate the error with the version you used.

Adelme

LemoAlex commented 1 year ago

Hello,

thanks for the prompt answer.

I am using ppanggolin 1.2.74 , installed through conda on a virtual environment. I am running it on a macbook with the M1 chip, in case it is relevant (it has been an issue with some other programs..)

Thanks,

Alexandre

LemoAlex commented 1 year ago

Hello,

It tunred out I had a duplicate chromosome name in the fasta information input (and in one .fasta file). After renaming, the issue disappeared.

Sorry for the trouble !

Best,

Alexandre

axbazin commented 1 year ago

Hi,

What do you mean by "duplicate chromosome name in .fasta" ? A genome was there twice with the same fasta file indicated ? Or there was a contig with the same identifier in 2 different fasta files ?

I feel like it's something that ppanggolin should be able to tell since there is some amount on "input verification" at the beginning. Depending on which case you meant I'll see if I can add some warnings in the code if that happens.

Adelme

LemoAlex commented 1 year ago

Hi,

Sorry my explanation was indeed not very precise.

in my input --fasta file:

Genome1 path/to/file.fasta ChromA ChromB Genome2 path/to/file.fasta ChromC Chrom D Genome3 path/to/file.fasta ChromE ChromE

So, here the Genome3 chromosomes are duplicated. And in the "real" fasta file of "Genome3", the input was actually:

ChromE ACGT.... ChromE ACGT....

So the chromosome were duplicated in my original input.

I hope this is a bit cleared now !

Best,

Alexandre

axbazin commented 1 year ago

Hi,

Alright thank you for the detailed explanation ! I see what might have happened. I'll check to add some warnings for those cases when we'll be preparing a new release.

Adelme

axbazin commented 1 year ago

Closing as this is likely an edge case and should not happen in general.

If other people meet this problem, please don't hesitate to comment and we might consider working on adding some warnings :)

labgem / PPanGGOLiN

less genes clustered than genes in the pangenome #100