sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
323 stars 190 forks source link

-b blastp executable [blastp] flag in roary #392

Open TimothyNA opened 6 years ago

TimothyNA commented 6 years ago

I am performing roary pangenome analysis on hp genomes and I got several annotations for the same gene in the individual genomes. For instance vacA which typically has a single copy in all genomes has now several vacA genes of varying reference query coverage in each sample genome after performing blastn. We were thinking it may partly be due to the blast coverage of the reference proteins and the variation in percentage identity used in calling genes “ –I roary option”. I’ll be grateful for information about using the blast executable flag “-b” in the roary command line so we can include blast query coverage of e.g. 50% in the pan-genome analysis. Many thanks

tseemann commented 6 years ago

Any chance of true paralogs? Are you sure the multiple vacA genes aren't just broken/frame-shifted parts of a single gene? Are they all full length in the annotation? Were they Pacbio or Nanopore assemblies?

You could try

roary -b "blastp -some_option 42 -more_options 3" ...
TimothyNA commented 6 years ago

Thank you for your reply; vacA should have a single gene in all hp genomes. Although we noticed some hp vacA annotations were of varying locus length in relation to the reference 26695. They represented different sections of the reference vacA with blastn. We used illumina Hiseq assembly. What does 42 and 3 mean in the command provided? Is it the blastp coverage?

tseemann commented 6 years ago

Those options are just made up --- they were just placeholders for you to figure out different blast setings which may achieve what you want.

But I now see your vacA gene was broken into 3 pieces in the assembly.

This can happen when there is extreme-GC within the locus, which Illumnina can't sequence well, or youhave mixed population with miultipkle vacA allleles.

liyangjie commented 6 years ago

I have a similar problem. I want to know how to control blast query coverage,too. Because in my result, some genes, witch are in same group, their length are varies greatly! And, some seqlength less than 120 nucleotides, but, your "Roary: Supplementary Material“ says " Sequences where more than 5% of nucleotides are unknown, or that are less than 120 nucleotides, are excluded from further analysis" ,Is it where I understand it wrong? many thanks!