Open biomendi opened 6 years ago
Sorry for the late reply, teaching has taken a lot out of me this semester :)
The first one is easy-- the Biopython warning is expected and does not affect anything, it is just telling you that the length of some of your coding sequences are not a multiple of 3.
It seems like the main error is with GNU Parallel, and would explain why the assembly step fails. GNU Parallel is handling the assemblies based on the number of CPUs you have available. My first recommendation is to run the GNU Parallel command on its own, outside of HybPiper. First, go into your RS66eOsaxatilis
directory and then run:
time parallel --eta spades.py --only-assembler --threads 1 --cov-cutoff 8 --12 {}/{}_interleaved.fasta -o {}/{}_spades :::: spades_genelist.txt > spades.log
If you do that, do you get the same error from GNU Parallel?
Hi, the command on its own ran successfully. It seems that SPAdes was only executed on those 5655 genes mentioned before, but not on all the others (my reference file has 17879 genes). Why is that? Also, I have noticed that SPAdes did not output the file contigs.fasta in all the genes (it did in 4405 out of 5655 genes). Within those 4405 genes, the file contigs.fasta is empty in most of them (only in 350 genes the file contigs.fasta is not empty). What would be the next step?
Thanks a lot!
Next step would be to run exonerate_hits.py which needs two files: contigs.fasta and baits.fasta. However, I have found out that baits.fasta does not exist within any of the 17879 gene directories.
I guess that's why we see: [DISTRIBUTE]: 17879 proteins had no good matches.
So after that, I really have no idea on how to proceed... Please help!
Are you running this on a cluster? We have noticed issues before with files not properly syncing when HybPiper is running on a cluster, where the login/storage is separated from the compute nodes.
HybPiper creates a lot of files very fast, and sometimes the scripts that sync the storage on compute nodes with storage in your home directory do not keep up. Then HybPiper moves on to another stage of the pipeline, expecting the files to be there. If the genes missing the _baits.fasta
files are listed in the file genes_with_reads.txt
then I suspect the file system is the cause of the error.
On many clusters there is the option to use /scratch
space for temporary storage of files within a job. Because of these syncing issues I always use /scratch
when I run HybPiper.
Hi again and thanks for your reply, Matt!
I have now successfully run Spades and found out 708 genes with a gene*_contig.fasta
file. I created the exonerate_genelist.txt
using those genes and I am now doing the analysis with exonerate_hits.py
. Interestingly, exonerate will run as usual for a few genes:
There were 3 exonerate hits for gene11074/gene11074_baits.fasta.
Ocruralis-gene11074Writing amino acid sequence, length: 35
While, for some other genes, it will give me an error. Actually, I have observed two kinds of errors and one warning. The first one (very frequent!) saying "KeyError" (Biopython?):
Ocruralis-gene105Traceback (most recent call last):
File "../../HybPiper/exonerate_hits.py", line 630, in <module>
if __name__ == "__main__":main()
File "../../HybPiper/exonerate_hits.py", line 570, in main
paralog_test(proteinHits[prot],protein_dict[prot],prefix)
KeyError: 'Ocruralis-gene105'
The second one (less frequent) is this:
Ocruralis-gene10971Traceback (most recent call last):
File "../../HybPiper/exonerate_hits.py", line 630, in <module>
if __name__ == "__main__":main()
File "../../HybPiper/exonerate_hits.py", line 605, in main
nucl_sequence = fullContigs(proteinHits[prot],sequence_dict,assembly_dict,protein_dict,prefix,args.threshold)
File "../../HybPiper/exonerate_hits.py", line 156, in fullContigs
return str(sequence_dict[prot["assemblyHits"][0]].seq.reverse_complement())
File "/anaconda2/lib/python2.7/site-packages/Bio/Seq.py", line 890, in reverse_complement
return self.complement()[::-1]
File "/anaconda2/lib/python2.7/site-packages/Bio/Seq.py", line 848, in complement
raise ValueError("Mixed RNA/DNA found")
ValueError: Mixed RNA/DNA found
And the warning:
There were 1 exonerate hits for gene10667/gene10667_baits.fasta.
Ocruralis-gene10667/anaconda2/lib/python2.7/site-packages/Bio/Seq.py:2309: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
BiopythonWarning)
Traceback (most recent call last):
File "../../HybPiper/exonerate_hits.py", line 630, in <module>
if __name__ == "__main__":main()
File "../../HybPiper/exonerate_hits.py", line 610, in main
amino_sequence = myTranslate(nucl_sequence)
File "../../HybPiper/exonerate_hits.py", line 472, in myTranslate
aminoseq = nucseq.translate()
File "/anaconda2/lib/python2.7/site-packages/Bio/Seq.py", line 1105, in translate
cds, gap=gap)
File "/anaconda2/lib/python2.7/site-packages/Bio/Seq.py", line 2338, in _translate_str
"Codon '{0}' is invalid".format(codon))
Bio.Data.CodonTable.TranslationError: Codon 'T10' is invalid
Any ideas? Thanks a lot!
EDIT: error solved - the issue was due to incorrect naming of --prefix
destination
Another weird behavior found - I'm so sorry for being so annoying (but I really wanna understand all the steps in HybPiper). This time when running intronerate.py
[CMD] exonerate -m protein2genome -q sequences/FAA/gene10691.FAA -t sequences/intron/gene10691_supercontig.fasta --verbose 0 --showalignment no --showvulgar no --showtargetgff yes > intronerate_raw.gff
Traceback (most recent call last):
File "../../HybPiper/intronerate.py", line 307, in <module>
if __name__ == "__main__":main()
File "../../HybPiper/intronerate.py", line 286, in main
kept_hits = filter_gff(hits,merge=args.merge)
File "../../HybPiper/intronerate.py", line 119, in filter_gff
kept_indicies = range_connectivity(range_list)
File "/Users/server_mac1/SANTI/HybPiper/exonerate_hits.py", line 324, in range_connectivity
max_length = max(ends) - min(starts)
ValueError: max() arg is an empty sequence
And the AA sequence in question is:
>gene10691
DR*LTYT*NIEAKPLMD*SIN*
The following AA sequence will also give the same error
>gene1362
TVTS*HLIGTSPVSSSIILLVSSRIFCS
Regarding the error for intronerate.py
, it looks like something is off with the annotation of intron and exon sequences. Could you post the intronerate_raw.gff
file from gene10691
? That may give us some clues. I suspect it has something to do with the protein sequence having stop codons, but there may be something else going on.
Unfortunately, intronerate_raw.gff
turns out to be empty in both genes. But I have noticed that many other genes (which did not fail) also have stop codons (even on the test dataset some AA sequences that I randomly checked will have some stop codons).
Hey Matt!
I'm a new user of HybPiper too, and I managed to run the pipeline for most of the way. However, when I am trying to do Intronerate.py,Ii get the same error as biomendi, and my intronerate_raw.gff file is also blank. I have no stop codons in my file, so I'm not really sure what's going on.
Could I please trouble you to help us with it?
Thanks for your help!
Randolph.
So sorry for the delayed reply (again) but I've finally got some time to work out some issues on HybPiper. It is curious that exonerate would find enough evidence to extract a (coding) sequence using exonerate_hits.py
in the main script, but then fail to annotate the same thing during intronerate.py
.
If either of you who had this issue can send me a directory for a gene where this is happening (one level in from the prefix
directory, so prefix/geneName
) , I would really appreciate it!
I am running HybPiper using a reference that has 17879 genes. I'm testing the pipeline on one sample before going for more. First, I got the following BiopythonWarning which I am not sure how crucial it is:
Then, SPAdes assembly encountered an error with 5655 genes (no contigs found due to low coverage). Here is the output:
And finally, HybPiper dies with the following error:
I would really appreciate any help! :)