mossmatters / HybPiper

Recovering genes from targeted sequence capture data
GNU General Public License v3.0
111 stars 45 forks source link

exonerate_hits.py pauses at last gene #49

Open Paul-Donat opened 4 years ago

Paul-Donat commented 4 years ago

For multiple samples at the "last writing amino acid sequence" step, exonerate stops. It seems to be entering a waiting loop that it never exits from. the --timeout will stop exonerate but it doesn't resolve the issue.

I have manually been able to run the exonerate command that is stalling out.

running on a 48 core centos 7 server.

any insight into what might be happening would helpful.

mossmatters commented 4 years ago

How long does the exonerate command take when you run it manually? Is it getting stuck on the same gene for many samples? The issue leading to the --timeout workaround originally came up when folks were assembling using reference proteins that have very repetitive sequences, or even when the reference was not coding sequence at all. That would be my first thing to check.

joelnitta commented 2 years ago

I am having the same issue. I don't think it is a problem with my reference sequences because it only comes up in a subset of samples, not a subset of genes.

@mossmatters can you help with two things:

  1. How can I run exonerate manually (what command should be given?)

  2. How can I tell what gene it is getting stuck on?

I've been logging the stderr from reads_first.py. The second-to-last gene output looks like this:

ETA: 60s Left: 170 AVG: 0.37s  local:1/19/100%/0.4s 
ETA: 60s Left: 170 AVG: 0.37s  local:1/19/100%/0.4s 
ETA: 60s Left: 170 AVG: 0.37s  local:1/19/100%/0.4s 
ETA: 61s Left: 170 AVG: 0.37s  local:1/19/100%/0.4s 
ETA: 61s Left: 170 AVG: 0.37s  local:1/19/100%/0.4s 
ETA: 61s Left: 170 AVG: 0.37s  local:1/19/100%/0.4s 
ETA: 61s Left: 170 AVG: 0.37s  local:1/19/100%/0.4s 
ETA: 61s Left: 170 AVG: 0.37s  local:1/19/100%/0.4s 
ETA: 61s Left: 170 AVG: 0.37s  local:1/19/100%/0.4s There were 2 exonerate hits for cluster3423/cluster3423_baits.fasta.
maol-cluster3423Writing amino acid sequence, length: 183

So we can see that the gene called cluster3423 worked correctly.

But the last gene starts like this:

ETA: 61s Left: 169 AVG: 0.35s  local:1/20/100%/0.4s 
ETA: 60s Left: 169 AVG: 0.35s  local:1/20/100%/0.4s 
ETA: 60s Left: 169 AVG: 0.35s  local:1/20/100%/0.4s 
ETA: 60s Left: 169 AVG: 0.35s  local:1/20/100%/0.4s 

... and continues on and on for thousands of lines:

ETA: 294148s Left: 169 AVG: 1740.95s  local:1/20/100%/1741.0s 
ETA: 294157s Left: 169 AVG: 1741.00s  local:1/20/100%/1741.0s 
ETA: 294165s Left: 169 AVG: 1741.05s  local:1/20/100%/1741.1s 

ETA just keeps going up without ever finishing. Since it only prints the gene name when it finishes, it's not clear to me how to see what gene it is currently stuck on.

mossmatters commented 2 years ago

Hi Joel et al - We are currently working on a HybPiper 2.0 that addresses a lot of these issues. This same issue was coming up on testing and seems to be related to low-complexity DNA in the target file. If there is a repetitive region in the target, there are a lot of contigs with potential mappings and exonerate takes a long time.

In HybPiper 1.3 you can currently run --timeout 4000 in reads_first.py to kill any spades or exonerate jobs that are taking a long time (i.e. 4000% the average time). Killing the exonerate process will allow HybPiper to finish and create all the output reports you need.

In the future we are also working on building filters for low-complexity DNA/protein sequences in the target files.

mossmatters commented 2 years ago

As for running exonerate_hits.py manually, you just need to descend into the proper gene file and call the python script using the geneName_contigs.fasta and geneName_baits.fasta files.

You can find which gene is missing by looking at exonerate_genelist.txt and comparing that to the genes that actually have sequences generated - the one with the hangup is likely to be missing. It is also likely to have a contigs.fasta file that is very very big with lots of low-complexity DNA.