Closed deprekate closed 3 years ago
Yes, that makes sense. I will revert a change I made this week. I thought it might be problematic but my tests did not show an issue.
Rob
On Tue, 11 May 2021 at 09:28, Katelyn @.***> wrote:
I have been waiting hours for one (our of 20k) genome to finish. There are stretches of N in the dna that I think is causing the problem.
The genome: g.500.gb https://edwards.sdsu.edu/~katelyn/g.500.gb
The output of phispy:
$ tail phispy.log 2021-05-10 13:31:12 INFO g_500_c_0 763490 764494 1 Dropped. Not enough genes 2021-05-10 13:31:12 INFO g_500_c_0 556601 556960 1 Dropped. Not enough genes 2021-05-10 13:31:12 INFO PROPHAGE: 1 Contig: g_500_c_0 Start: 160559 Stop: 171884 2021-05-10 14:07:05 INFO There were 6 repeats with the same length as the best. One chosen somewhat randomly! 2021-05-10 14:07:05 INFO PROPHAGE: 2 Contig: g_500_c_0 Start: 334826 Stop: 374323 2021-05-10 14:11:32 INFO PROPHAGE: 3 Contig: g_500_c_0 Start: 444209 Stop: 475789 2021-05-10 14:40:12 INFO There were 12 repeats with the same length as the best. One chosen somewhat randomly! 2021-05-10 14:40:12 INFO PROPHAGE: 4 Contig: g_500_c_0 Start: 589283 Stop: 606509 2021-05-10 15:29:30 INFO There were 12 repeats with the same length as the best. One chosen somewhat randomly! 2021-05-10 15:29:30 INFO PROPHAGE: 5 Contig: g_500_c_0 Start: 622677 Stop: 710948
And the region that contains (some) the N's
$ grep '^ 622[6789]' g.500.gb 622621 catcgattac aaatattgtc cgccatcaaa acacatccag cgggatggtt tgttgcatga 622681 atcatgcgat attttgggtg ttggttctgt tggatacgga tcggcatatt caacacgtat 622741 ttatagcttt cgctcaaacg ctnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 622801 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 622861 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 622921 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn 622981 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
How much resources PhiSpy is using:
$ top -n1 | grep "PID|PhiSpy" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4345 katelyn 20 0 214.1g 123.3g 1560 R 100.0 97.9 142:11.65 PhiSpy.py
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/linsalrob/PhiSpy/issues/52, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGMFB6C7D5GJVO5CXR53STTNBXKRANCNFSM44TJ5F2A .
Reverted, will try and implement a different algorithm
I looked at modifying repeatfinder
to ignore the N
's but it uses two-bit encoding of the DNA sequence to quickly find the repeats, so not obvious how to do that (without increasing to 3-bit encoding).
Also, its not clear that repeatfinder will discriminate a run of A
from a run of N
at the moment.
I think this example genome is particularly tricky and crashes without my python implementation looking for multiple N
's.
@deprekate do you have any great ideas?
How are N's treated in 2bit?
Rather than ignoring them, what if N's are converted to a random ACTG nucleotide? It should randomize a stretch of N's pretty well
I have implemented this in PhiSpy
's repeatfinder
module, and it now completes on g.500.gb
in ~200 seconds,
$ /usr/bin/time -v PhiSpy.py -o g.5000 g.500.gb
Command being timed: "PhiSpy.py -o g.5000 g.500.gb"
User time (seconds): 194.23
System time (seconds): 4.31
Percent of CPU this job got: 190%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:44.30
Maximum resident set size (kbytes): 579908
Note that in genomes with a lot of N
's you will now find spurious, non-identical matches where one side is a sequence and the other a run of N
's, but (a) this is less likely to happen with longer repeat lengths (essentially with frequency 1 in k4 bp) and (b) they are unlikely to be recorded by PhiSpy as they have to flank the phage.
Please let me know if this works.
I have implemented a faster version as noted above
I have been waiting hours for one (our of 20k) genome to finish. There are stretches of N in the dna that I think is causing the problem.
The genome: g.500.gb
The output of phispy:
And the region that contains (some) the N's
How much resources PhiSpy is using: