linsalrob / PhiSpy

Prediction of prophages from bacterial genomes
MIT License
70 stars 20 forks source link

Unable to finish + excessive RAM usage #45

Closed dcarrillouu closed 3 years ago

dcarrillouu commented 3 years ago

Hi,

I am running PhISpy 4.2.6 on a set of contigs, one contig per PhiSpy run, and for some of them it is not able to finish the analysis. After identifying potential prophages, RAM usage scalates until the process gets killed. I have tried in several machines with up to 125 Gb of RAM with the same result.

Here one of those problematic contigs, downloaded from NCBI and provided to PhiSpy directly. Attached the phispy.log of the run. Instalation was via conda as described in the documentation. Test run with Streptococcus_pyogenes_M1_GAS ran smoothly, here the log file phispy.log

Any idea of which could be the reason of this behaviour?

Thank you.

pdec commented 3 years ago

Hi,

thank you for using PhiSpy and reporting this issue. It seems like a problem during the process of finding repeats surrounding indicated prophage region. I was not able to reproduce it as I get different results...

Have you tried running PhiSpy on that contig again?

Best, Przemek

dcarrillouu commented 3 years ago

Hi Przemek,

Yes, I tried several times in the same and different machines, with the same outcome. With different results you mean that you were able to analyze the contig I provided?

Thanks!

pdec commented 3 years ago

Hey,

yes, each time I run the analysis of your contig I was successful. I tried two different machines, with two different installations (conda or setup.py). I run PhiSpy dozens of times and it always finished without error.

Best, Przemek

dcarrillouu commented 3 years ago

Mmmm interesting. I will try again with a fresh installation, will let you know you soon.

Thank you!

dcarrillouu commented 3 years ago

Can you copy here the exact command you used?

beardymcjohnface commented 3 years ago

Does increasing --min_repeat_len help at all?

pdec commented 3 years ago

Can you copy here the exact command you used?

Sure, the only thing I changed from the default settings was the number of threads I used. PhiSpy.py -o PhiSpy_issue_45/ --threads 10 NZ_JAAIMV010000005.1.gb I run it on fresh PhiSpy installations.

beardymcjohnface suggestion is right, could you check different values of --min_repeat_len?

linsalrob commented 3 years ago

The issue is that this record, and others like it, are WGS Scaffold records, not complete GenBank records.

If you look at the bottom of the record, instead of the DNA sequence, it has this line:

CONTIG join(JAAIMV010000005.1:1..190533)

PhiSpy uses the DNA sequence for some of the analysis, and not having it there is breaking the parsing. (We rely on biopython to parse the record).

If you head back to the top of the record it has this accession information

ACCESSION NZ_JAAIMV010000005 NZ_JAAIMV010000000

If you click on that link it will take you to the master record for the WGS, which has two options:

WGS JAAIMV010000001-JAAIMV010000105 WGS_SCAFLD NZ_JAAIMV010000001-NZ_JAAIMV010000105

You want the WGS record not the WGS Scaffold record. I created a simple perl script that uses cURL to get the records. With that script, you can run it as

perl get_wgs_eutils.pl JAAIMV010000001-JAAIMV010000105 30 JAAIMV010000000_WGS.gbk

It will write a file called JAAIMV010000000_WGS.gbk that contains all of the complete records, including DNA sequences, that you can run PhiSpy on.

There are probably ~20 contigs that look like prophages.

I have left this issue open, because we need to correctly handle WGS records by throwing a warning and stopping rather than crashing!

dcarrillouu commented 3 years ago

Thank you Rob! Everything is clear now. Thank you also for the perl script, I have it already running in my pipeline.

linsalrob commented 3 years ago

Glad that we resolved this issue.