sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
323 stars 189 forks source link

Is there anyway to get unique genome from Roary output #337

Open Norman-Kuo opened 7 years ago

Norman-Kuo commented 7 years ago

Hi

I have able to successfully get the pan-genome, core genome and accessory genome from our isolates. What I am also interested is getting unique genes from each isolates. I am wondering if there is anyway to get that information.

What I was thinking originally was to subtract the core genome (intersect) and accessory genome from whole genome (Pan-genome) to get unique genome, but I got 0 because they add up perfectly. I am wondering is it due to the accessory genome file also contains the individual unique genes, and if there is anyway to extract them.

Thank you

andrewjpage commented 7 years ago

The unique genes are also included in the accessory genome. You should be aware however that the unique genes can be quite noisy, containing assembly errors and bits of random contamination, so treat them with caution. You can extract genes using the query pan genome script.

Norman-Kuo commented 7 years ago

Thanks for the response Sorry but I am not sure how to extract unique genes from the accessory genome. Which option of query pan genome do I use to extract unique genes? is it (-a difference)?

Thank you

FilipeMatteoli commented 5 years ago

This is old but I will add an answer that might help someone. First I would like to add that the script query_pan_genome also failed with me to retrieve unique IDs or fasta.

If you look into clustered_proteins file, you'll notice that clusters are shown with respective IDs, from bigger clusters to smaller. That means, from a point until the end only clusters with one gene ID will be listed. After you find that point in the file, cut these to a new file, then just grep the desired ID, and you will end up with a unique ID list. Later you can parse your genbank with a bioperl script to retrieve sequences.

Best,