sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
316 stars 190 forks source link

Unique genes issue #523

Open Sirbius opened 4 years ago

Sirbius commented 4 years ago

Hi there,

I'm running a pangenome analyisis on 20 bacterial strains using roary version 3.13, on a server running CentOS 7. I want to get the unique genes present in only one of those 20 strains but, as I've already asked here, I cannot get them.

When running: query_pan_genome -a difference --input_set_one 1.gff --input_set_two 2.gff 3.gff 4.gff .... -g clustered_proteins I get a csv file with some clusters that are supposed to be unique to strain1, but they are not! If I retrieve the sequence using the sqlite3 db suggested here and blast it, I find a perfect match with one of the other 19th strain (the reference one, by the way). Moreover, these genes in the reference are functionally properly annotated (i.e short-chain dehydrogenase), while in the csv is "hypotetical protein" (but that's problably is prokka annotation failure). I also tried to select the only-one-strain column from the clustered_proteins file as suggested here, but still get wrong ones. By reading this other issue, I tried the option -s but I just got less "unique" clusters, but still wrong ones. What's the problem?? Is roary really supposed to do so or not??

Liao-PRIC commented 3 years ago

Hi Sirbius, I found the same issue in my case, which is really misleading. Actually, not only roary but other programs for pan-genome analysis such as PGAP I tried before have this issue. Most of the "unique" genes are not unique. Even the clustering of genes and the resulted core might be wrong. And it's surprising that no one really pays attention to this issue. Do people just assume the results are correct and use it directly without any doubt? Have you found any solution? Hope someone will notice this issue.

Sirbius commented 3 years ago

Hi Liao, No, I haven't found a solution to that issue. In the meantime I discovered anvi'o for the pangenome analysis and I can say I found more or less the same singletons found with roary, and mostly hyphotetical proteins. I really don't know what to say..

Liao-PRIC commented 3 years ago

table.customTableClassName {margin-bottom: 10px;border-collapse: collapse;display: table;}.customTableClassName td, .customTableClassName th {border: 1px solid #ddd;}Hi there,Actually I just discovered the reason why the issue occurred. When I went back to read the paper described the algorithm of roary, I found that the default amino acid identity of gene clusters is 95%! That is exactly the reason why the "unique" genes are not unique. They are only unique below 95% identity. So now you know how to make your own threshold for clustering.Hope it helps.Li    Regards,Li Liao, Ph.DAssociate ProfessorSOA Key Laboratory for Polar SciencePolar Research Institute of China, Shanghai      From: "Silvia @.>"To: "sanger-pathogens/Roary @.>"CC: "Liao-PRIC @.>","Manual @.>"Sent: 2021-08-11 23:23Subject: Re: [sanger-pathogens/Roary] Unique genes issue (#523)Hi Liao, No, I haven't found a solution to that issue. In the meantime I discovered anvi'o for the pangenome analysis and I can say I found more or less the same singletons found with roary, and mostly hyphotetical proteins. I really don't know what to say..—You are receiving this because you are subscribed to this thread.Reply to this email directly, view it on GitHub, or unsubscribe.Triage notifications on the go with GitHub Mobile for iOS or Android.