linsalrob / PhiSpy

Prediction of prophages from bacterial genomes
MIT License
70 stars 21 forks source link

Huge memory consumption and runtime with NNNNs #52

Closed deprekate closed 3 years ago

deprekate commented 3 years ago

I have been waiting hours for one (our of 20k) genome to finish. There are stretches of N in the dna that I think is causing the problem.

The genome: g.500.gb

The output of phispy:

$ tail phispy.log 
2021-05-10 13:31:12 INFO     g_500_c_0  763490  764494  1   Dropped. Not enough genes
2021-05-10 13:31:12 INFO     g_500_c_0  556601  556960  1   Dropped. Not enough genes
2021-05-10 13:31:12 INFO     PROPHAGE: 1 Contig: g_500_c_0 Start: 160559 Stop: 171884
2021-05-10 14:07:05 INFO     There were 6 repeats with the same length as the best. One chosen somewhat randomly!
2021-05-10 14:07:05 INFO     PROPHAGE: 2 Contig: g_500_c_0 Start: 334826 Stop: 374323
2021-05-10 14:11:32 INFO     PROPHAGE: 3 Contig: g_500_c_0 Start: 444209 Stop: 475789
2021-05-10 14:40:12 INFO     There were 12 repeats with the same length as the best. One chosen somewhat randomly!
2021-05-10 14:40:12 INFO     PROPHAGE: 4 Contig: g_500_c_0 Start: 589283 Stop: 606509
2021-05-10 15:29:30 INFO     There were 12 repeats with the same length as the best. One chosen somewhat randomly!
2021-05-10 15:29:30 INFO     PROPHAGE: 5 Contig: g_500_c_0 Start: 622677 Stop: 710948

And the region that contains (some) the N's

$ grep '^   622[6789]' g.500.gb 
   622621 catcgattac aaatattgtc cgccatcaaa acacatccag cgggatggtt tgttgcatga
   622681 atcatgcgat attttgggtg ttggttctgt tggatacgga tcggcatatt caacacgtat
   622741 ttatagcttt cgctcaaacg ctnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
   622801 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
   622861 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
   622921 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
   622981 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn

How much resources PhiSpy is using:

$ top -n1 | grep "PID\|PhiSpy"
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                           
 4345 katelyn   20   0  214.1g 123.3g   1560 R 100.0  97.9 142:11.65 PhiSpy.py
linsalrob commented 3 years ago

Yes, that makes sense. I will revert a change I made this week. I thought it might be problematic but my tests did not show an issue.

Rob

On Tue, 11 May 2021 at 09:28, Katelyn @.***> wrote:

I have been waiting hours for one (our of 20k) genome to finish. There are stretches of N in the dna that I think is causing the problem.

The genome: g.500.gb https://edwards.sdsu.edu/~katelyn/g.500.gb

The output of phispy:

$ tail phispy.log 2021-05-10 13:31:12 INFO g_500_c_0 763490 764494 1 Dropped. Not enough genes 2021-05-10 13:31:12 INFO g_500_c_0 556601 556960 1 Dropped. Not enough genes 2021-05-10 13:31:12 INFO PROPHAGE: 1 Contig: g_500_c_0 Start: 160559 Stop: 171884 2021-05-10 14:07:05 INFO There were 6 repeats with the same length as the best. One chosen somewhat randomly! 2021-05-10 14:07:05 INFO PROPHAGE: 2 Contig: g_500_c_0 Start: 334826 Stop: 374323 2021-05-10 14:11:32 INFO PROPHAGE: 3 Contig: g_500_c_0 Start: 444209 Stop: 475789 2021-05-10 14:40:12 INFO There were 12 repeats with the same length as the best. One chosen somewhat randomly! 2021-05-10 14:40:12 INFO PROPHAGE: 4 Contig: g_500_c_0 Start: 589283 Stop: 606509 2021-05-10 15:29:30 INFO There were 12 repeats with the same length as the best. One chosen somewhat randomly! 2021-05-10 15:29:30 INFO PROPHAGE: 5 Contig: g_500_c_0 Start: 622677 Stop: 710948

And the region that contains (some) the N's

$ grep '^ 622[6789]' g.500.gb 622621 catcgattac aaatattgtc cgccatcaaa acacatccag cgggatggtt tgttgcatga 622681 atcatgcgat attttgggtg ttggttctgt tggatacgga tcggcatatt caacacgtat 622741 ttatagcttt cgctcaaacg ctnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 622801 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 622861 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 622921 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 622981 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn

How much resources PhiSpy is using:

$ top -n1 | grep "PID|PhiSpy" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4345 katelyn 20 0 214.1g 123.3g 1560 R 100.0 97.9 142:11.65 PhiSpy.py

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/linsalrob/PhiSpy/issues/52, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGMFB6C7D5GJVO5CXR53STTNBXKRANCNFSM44TJ5F2A .

linsalrob commented 3 years ago

Reverted, will try and implement a different algorithm

linsalrob commented 3 years ago

I looked at modifying repeatfinder to ignore the N's but it uses two-bit encoding of the DNA sequence to quickly find the repeats, so not obvious how to do that (without increasing to 3-bit encoding).

Also, its not clear that repeatfinder will discriminate a run of A from a run of N at the moment.

I think this example genome is particularly tricky and crashes without my python implementation looking for multiple N's.

@deprekate do you have any great ideas?

deprekate commented 3 years ago

How are N's treated in 2bit?

Rather than ignoring them, what if N's are converted to a random ACTG nucleotide? It should randomize a stretch of N's pretty well

linsalrob commented 3 years ago

I have implemented this in PhiSpy's repeatfinder module, and it now completes on g.500.gb in ~200 seconds,

$ /usr/bin/time -v PhiSpy.py -o g.5000 g.500.gb
        Command being timed: "PhiSpy.py -o g.5000 g.500.gb"
        User time (seconds): 194.23
        System time (seconds): 4.31
        Percent of CPU this job got: 190%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:44.30
        Maximum resident set size (kbytes): 579908

Note that in genomes with a lot of N's you will now find spurious, non-identical matches where one side is a sequence and the other a run of N's, but (a) this is less likely to happen with longer repeat lengths (essentially with frequency 1 in k4 bp) and (b) they are unlikely to be recorded by PhiSpy as they have to flank the phage.

Please let me know if this works.

linsalrob commented 3 years ago

I have implemented a faster version as noted above