simroux / Inovirus

Set of scripts and data used to detect putative inovirus sequences and/or taxonomically classify them.
5 stars 2 forks source link

MSG: asking for tag value that does not exist locus_tag #6

Open mujiezhang opened 2 years ago

mujiezhang commented 2 years ago

When I run the script "Identify_candidate_fragments_from_gbk.pl" , I got an exception like this: ------------- EXCEPTION ------------- MSG: asking for tag value that does not exist locus_tag STACK Bio::SeqFeature::Generic::get_tag_values /dssg/home/acct-clsjhh/clsjhh/anaconda3/envs/inovirus_detector/lib/perl5/site_perl/Bio/SeqFeature/Generic.pm:604 STACK toplevel /dssg/home/acct-clsjhh/clsjhh/zmj/software/inovirus_detector/srouxjgi-inovirus-dfc3d5c3b1ac/Inovirus_detector/Identify_candidate_fragments_from_gbk.pl:120 So what is the problem? Thanks a lot

simroux commented 2 years ago

Hi,

This looks like a format issue in the GenBank file you are trying to use as input, as there is a missing tag ("locus_tag" should be in there). My recommendation would be to try to run the pipeline from a fasta file of the same sequence instead ?

If you are already doing this, then I am not sure what happens, and would need to see the output folder.

Best, Simon

mujiezhang commented 2 years ago

Thanks for your suggestion!

  1. You means I can try using the fasta file to generate a new gbk file and run the pipeline again?
  2. I have another question. Take the Shewanella WP3 for example. There is an inovirus SW1 with a genome length about 7.7 kb and att side longer than 10 bp——It has been isolated and sequenced, in the chromosome of WP3. And the inovirus_detector succeed in finding the main genome of SW1, but the prediction length is 10.6 kb without att site. Beside, I tried three gbk files of WP3, one from Genbank, one from Refseq, and one gererated by Prokka. And the prediction length is 10.6 kb, 10.8 kb, 11.9 kb, respectively. Comparing with the true length 7.7 kb, the prediction 11.9 kb have about 4 kb distance, about 50% of the true genome, which may influence the subsequent analysis seriously. Do you have any suggestion?
simroux commented 2 years ago
  1. Yes, you can use Identify_candidate_fragments_from_fna.pl instead of Identify_candidate_fragments_from_gbk.pl in the first step (see https://github.com/simroux/Inovirus/tree/master/Inovirus_detector#example-with-a-fasta-file-as-input)

  2. I am not sure I understand the question. This set of script is an automated inovirus detector, but it is expected that the exact boundaries will not always be found. The tool in this case indicates that it could not identify att sites, so any boundary should be interpreted as "possible" at best. If there are better / refined coordinates for this inovirus, these should be used. And as you mention, any analysis (and specifically results interpretation) should be careful and always take into consideration that prophage boundaries were identified by an automatic tool and thus likely to include some errors.

mujiezhang commented 2 years ago
  1. Thanks for your suggestion. The Identify_candidate_fragments_from_fna.pl script was not in https://bitbucket.org/srouxjgi/inovirus/src/master/Inovirus_detector/, so I did not notice it. Thanks.
  2. Maybe I can describe this question more clearly. It is reasonable that this set of script will not always find the exact boundaries. My question is : for the same bacteria genome, if I use the Identify_candidate_fragments_from_gbk.pl script to predict inovirus from gbk file from Genbank, I get prediction 1 and if I use the Identify_candidate_fragments_from_fna.pl script to predict inovirus from fna file from Genbank, I get prediction 2. But the prediction 1 is always different from prediction 2. I guess it is due to the differences of protein prediction tools. So do you have any suggestion for getting a more reasonable result?
    Really thanks for your time!
simroux commented 2 years ago
  1. Yes, this is something we added later on, but we kept the bitbucket repo exactly as it was when the manuscript was published.
  2. I am not sure what a "reasonable" result is here. I think you are correct: there are different gene predictions by different tools, leading to different predicted boundaries. When you have an experimentally validated prophage, then these boundaries should be used and not the predicted ones. Without an experimental validation, there is often no obvious way to pick which prediction is the correct one. One possible option is to look at the reads data to see if you can identify reads spanning the prophage insertion site (see e.g. https://doi.org/10.1186/s40168-021-01033-w), or to look for a similar bacteria without the prophage (https://doi.org/10.1093/nar/gkaa156 - see Fig. 2). It is possible however that none of this works, and there is no way to tell for sure what are the exact boundaries of this element.
mujiezhang commented 2 years ago

I got it. Thank you very much for your kindly help !