linsalrob / PhiSpy

Prediction of prophages from bacterial genomes
MIT License
70 stars 21 forks source link

Output formatting and tempRepeatDNA #9

Closed JoeHeffron closed 4 years ago

JoeHeffron commented 5 years ago

Very helpful program! I could use a little help interpreting the output, though:

A) Could someone please provide a guide to the functional categories in the gene output (unfortunately named 'pp' in the "output_tbl.txt" file)? The phantome.org site seems to be out of commission, so I could not follow the link in the Akhter et al., 2012, paper.

B) Also, is there any use after evaluation for the numerous tempRepeatDNA.XXXX.pp.X.fasta files generated in the process? At first I assumed these were fasta files for putative prophages, but there is no connection between the regions listed in "output.tbl" and those in the file generation output, e.g.:

generated by PhiSpy.py : ... Finding repeats in pp 4 from 597108 to 766988 Not checking repeats for pp 5 because it is too big: 242343 Finding repeats in pp 6 from 1896092 to 1944030 ... Finding repeats in pp 19 from 3506144 to 3527984 Finding repeats in pp 20 from 3625704 to 3715725 ... etc.

vs.

from output.tbl: MSMTP | pp | contig | start | end RS05 | 0 | NZ_CP009505 | 1348725 | 1585237 RS14 | 1 | NZ_CP009505 | 3512461 | 3524330

Also, unless I'm misinterpreting the output, prophages seem to begin numbering at 0, not 1 as stated in the instructions.

Thank you!

deprekate commented 5 years ago

Joe,

We are glad you like the program, sorry for the late reply, we have had other projects on the burner lately, though currently we are now focusing on overhauling PhiSpy.

A) Could someone please provide a guide to the functional categories in the gene output (unfortunately named 'pp' in the "output_tbl.txt" file)? The phantome.org site seems to be out of commission, so I could not follow the link in the Akhter et al., 2012, paper.

I will try and get an more detailed explanation of the various columns. The pp column in question is the score given to that gene, based on its function/name. A gene with the keyword 'integrase' gets a score of 1.5, a gene with a phage related function ('capsid', 'tail', 'portal', etc) gets a 1.0, a gene with an 'uncharacterized/hypothetical' gets a score of 0.5, and for all others a score of 0.0 is given.

B) Also, is there any use after evaluation for the numerous tempRepeatDNA.XXXX.pp.X.fasta files generated in the process? At first I assumed these were fasta files for putative prophages, but there is no connection between the regions listed in "output.tbl" and those in the file generation output, e.g.:

These files were supposed to be deleted by PhiSpy, but somewhere in the edits the delete command got removed . I just migrated the repeatfinder binary into a pypi module; so in the next version (3.5) these files will not be created at all : )

You were correct in your guess that they were putative prophages. They are used by PhiSpy to look for terminal repeats, so as to demarcate the prophage region. The reason they might not have correlated is because we also take an extra 2,000 bases outside the putative prophage region to search. Also if the putative prophage did not pass the threshold, it is excluded from the output file.

Also, unless I'm misinterpreting the output, prophages seem to begin numbering at 0, not 1 as stated in the instructions.

That file might have been numbered wrong. I don't have access to the code that created that particular file (output.tbl), I know some of the output files in the early versions were error prone, which is why the code that created those files was removed (which caused issues for people asking why they were missing)

linsalrob commented 4 years ago

This has been implemented now, so these files are no longer created.