mossmatters / HybPiper

Recovering genes from targeted sequence capture data
GNU General Public License v3.0
111 stars 45 forks source link

SPAdes assembly ERROR due to low coverage #35

Open biomendi opened 6 years ago

biomendi commented 6 years ago

I am running HybPiper using a reference that has 17879 genes. I'm testing the pipeline on one sample before going for more. First, I got the following BiopythonWarning which I am not sure how crucial it is:

/anaconda2/lib/python2.7/site-packages/Bio/Seq.py:2309: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
  BiopythonWarning)
[DISTRIBUTE]: 17879 proteins had no good matches.

Then, SPAdes assembly encountered an error with 5655 genes (no contigs found due to low coverage). Here is the output:

Running SPAdes on 5655 genes
time parallel --eta spades.py --only-assembler --threads 1 --cov-cutoff 8 --12 {}/{}_interleaved.fasta -o {}/{}_spades :::: spades_genelist.txt > spades.log
: This should not happen. You have found a bug.
Please contact <parallel@gnu.org> and follow
https://www.gnu.org/software/parallel/man.html#REPORTING-BUGS

Include this in the report:
* The version number: 
* The bugid: Can't dup STDIN: Bad file descriptor
* The command line being run
* The files being read (put the files on a webserver if they are big)

If you get the error on smaller/fewer files, please include those instead.

real    0m0.986s
user    0m0.116s
sys     0m0.025s
ERROR: One or more genes had an error with SPAdes assembly. This may be due to low coverage. No contigs found for the following genes:
gene100
gene10004
gene10005
gene10007
... etc

And finally, HybPiper dies with the following error:

...
gene9997
gene9998
gene9999
Traceback (most recent call last):
  File "/Users/server_mac1/SANTI/HybPiper/spades_runner.py", line 189, in <module>
    if __name__ == "__main__":main()
  File "/Users/server_mac1/SANTI/HybPiper/spades_runner.py", line 183, in main
    spades_failed,spades_duds = rerun_spades("failed_spades.txt",cov_cutoff=args.cov_cutoff,paired=is_paired,cpu=args.cpu)
  File "/Users/server_mac1/SANTI/HybPiper/spades_runner.py", line 97, in rerun_spades
    all_kmers = [int(x[1:]) for x in os.listdir(os.path.join(gene,"{}_spades".format(gene))) if x.startswith("K")]
OSError: [Errno 2] No such file or directory: 'gene100/gene100_spades'
WARNING: Something went wrong with the assemblies! Check for failed assemblies and re-run! 
HybPiper was called with these arguments:
../reads_first.py -r RS66eOsaxatilis_left_P_qtrim.fq RS66eOsaxatilis_right_P_qtrim.fq -b RefSeq_cruralis.fa --prefix RS66eOsaxatilis --bwa

Making nucleotide bwa index in current directory.
[CMD]: bwa index RefSeq_cruralis.fa
[CMD]: time bwa mem -t 24 RefSeq_cruralis.fa /Users/server_mac1/SANTI/HybPiper/oreobates_GREEN_data/RS66eOsaxatilis_left_P_qtrim.fq /Users/server_mac1/SANTI/HybPiper/oreobates_GREEN_data/RS66eOsaxatilis_right_P_qtrim.fq  | samtools 
[CMD] time python /Users/server_mac1/SANTI/HybPiper/distribute_reads_to_targets_bwa.py RS66eOsaxatilis.bam /Users/server_mac1/SANTI/HybPiper/oreobates_GREEN_data/RS66eOsaxatilis_left_P_qtrim.fq /Users/server_mac1/SANTI/HybPiper/oreo

[DISTRIBUTE]: time python /Users/server_mac1/SANTI/HybPiper/distribute_targets.py /Users/server_mac1/SANTI/HybPiper/oreobates_GREEN_data/RefSeq_cruralis.fa --bam RS66eOsaxatilis.bam
ERROR: No genes had assembled contigs! Exiting!
ERROR: cannot find genes_with_paralog_warnings.txt in RS66eOsaxatilis!
Traceback (most recent call last):
  File "../intronerate.py", line 307, in <module>
    if __name__ == "__main__":main()
  File "../intronerate.py", line 268, in main
    genelist = [x.split()[0] for x in open('genes_with_seqs.txt').readlines()]
IOError: [Errno 2] No such file or directory: 'genes_with_seqs.txt'

I would really appreciate any help! :)

mossmatters commented 6 years ago

Sorry for the late reply, teaching has taken a lot out of me this semester :)

The first one is easy-- the Biopython warning is expected and does not affect anything, it is just telling you that the length of some of your coding sequences are not a multiple of 3.

It seems like the main error is with GNU Parallel, and would explain why the assembly step fails. GNU Parallel is handling the assemblies based on the number of CPUs you have available. My first recommendation is to run the GNU Parallel command on its own, outside of HybPiper. First, go into your RS66eOsaxatilis directory and then run:

time parallel --eta spades.py --only-assembler --threads 1 --cov-cutoff 8 --12 {}/{}_interleaved.fasta -o {}/{}_spades :::: spades_genelist.txt > spades.log

If you do that, do you get the same error from GNU Parallel?

biomendi commented 6 years ago

Hi, the command on its own ran successfully. It seems that SPAdes was only executed on those 5655 genes mentioned before, but not on all the others (my reference file has 17879 genes). Why is that? Also, I have noticed that SPAdes did not output the file contigs.fasta in all the genes (it did in 4405 out of 5655 genes). Within those 4405 genes, the file contigs.fasta is empty in most of them (only in 350 genes the file contigs.fasta is not empty). What would be the next step?

Thanks a lot!

biomendi commented 6 years ago

Next step would be to run exonerate_hits.py which needs two files: contigs.fasta and baits.fasta. However, I have found out that baits.fasta does not exist within any of the 17879 gene directories.

I guess that's why we see: [DISTRIBUTE]: 17879 proteins had no good matches.

So after that, I really have no idea on how to proceed... Please help!

mossmatters commented 6 years ago

Are you running this on a cluster? We have noticed issues before with files not properly syncing when HybPiper is running on a cluster, where the login/storage is separated from the compute nodes.

HybPiper creates a lot of files very fast, and sometimes the scripts that sync the storage on compute nodes with storage in your home directory do not keep up. Then HybPiper moves on to another stage of the pipeline, expecting the files to be there. If the genes missing the _baits.fasta files are listed in the file genes_with_reads.txt then I suspect the file system is the cause of the error.

On many clusters there is the option to use /scratch space for temporary storage of files within a job. Because of these syncing issues I always use /scratch when I run HybPiper.

biomendi commented 6 years ago

Hi again and thanks for your reply, Matt!

I have now successfully run Spades and found out 708 genes with a gene*_contig.fasta file. I created the exonerate_genelist.txt using those genes and I am now doing the analysis with exonerate_hits.py. Interestingly, exonerate will run as usual for a few genes:

There were 3 exonerate hits for gene11074/gene11074_baits.fasta.
Ocruralis-gene11074Writing amino acid sequence, length: 35

While, for some other genes, it will give me an error. Actually, I have observed two kinds of errors and one warning. The first one (very frequent!) saying "KeyError" (Biopython?):

Ocruralis-gene105Traceback (most recent call last):
  File "../../HybPiper/exonerate_hits.py", line 630, in <module>
    if __name__ == "__main__":main()
  File "../../HybPiper/exonerate_hits.py", line 570, in main
    paralog_test(proteinHits[prot],protein_dict[prot],prefix)
KeyError: 'Ocruralis-gene105' 

The second one (less frequent) is this:

Ocruralis-gene10971Traceback (most recent call last):
  File "../../HybPiper/exonerate_hits.py", line 630, in <module>
    if __name__ == "__main__":main()
  File "../../HybPiper/exonerate_hits.py", line 605, in main
    nucl_sequence = fullContigs(proteinHits[prot],sequence_dict,assembly_dict,protein_dict,prefix,args.threshold)
  File "../../HybPiper/exonerate_hits.py", line 156, in fullContigs
    return str(sequence_dict[prot["assemblyHits"][0]].seq.reverse_complement())
  File "/anaconda2/lib/python2.7/site-packages/Bio/Seq.py", line 890, in reverse_complement
    return self.complement()[::-1]
  File "/anaconda2/lib/python2.7/site-packages/Bio/Seq.py", line 848, in complement
    raise ValueError("Mixed RNA/DNA found")
ValueError: Mixed RNA/DNA found

And the warning:

There were 1 exonerate hits for gene10667/gene10667_baits.fasta.
Ocruralis-gene10667/anaconda2/lib/python2.7/site-packages/Bio/Seq.py:2309: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
  BiopythonWarning)
Traceback (most recent call last):
  File "../../HybPiper/exonerate_hits.py", line 630, in <module>
    if __name__ == "__main__":main()
  File "../../HybPiper/exonerate_hits.py", line 610, in main
    amino_sequence = myTranslate(nucl_sequence)
  File "../../HybPiper/exonerate_hits.py", line 472, in myTranslate
    aminoseq = nucseq.translate()
  File "/anaconda2/lib/python2.7/site-packages/Bio/Seq.py", line 1105, in translate
    cds, gap=gap)
  File "/anaconda2/lib/python2.7/site-packages/Bio/Seq.py", line 2338, in _translate_str
    "Codon '{0}' is invalid".format(codon))
Bio.Data.CodonTable.TranslationError: Codon 'T10' is invalid

Any ideas? Thanks a lot!

EDIT: error solved - the issue was due to incorrect naming of --prefix destination

biomendi commented 6 years ago

Another weird behavior found - I'm so sorry for being so annoying (but I really wanna understand all the steps in HybPiper). This time when running intronerate.py

[CMD] exonerate -m protein2genome -q sequences/FAA/gene10691.FAA -t sequences/intron/gene10691_supercontig.fasta --verbose 0 --showalignment no --showvulgar no --showtargetgff yes > intronerate_raw.gff
Traceback (most recent call last):
  File "../../HybPiper/intronerate.py", line 307, in <module>
    if __name__ == "__main__":main()
  File "../../HybPiper/intronerate.py", line 286, in main
    kept_hits = filter_gff(hits,merge=args.merge)
  File "../../HybPiper/intronerate.py", line 119, in filter_gff
    kept_indicies = range_connectivity(range_list)
  File "/Users/server_mac1/SANTI/HybPiper/exonerate_hits.py", line 324, in range_connectivity
    max_length = max(ends) - min(starts)
ValueError: max() arg is an empty sequence

And the AA sequence in question is:

>gene10691
DR*LTYT*NIEAKPLMD*SIN*

The following AA sequence will also give the same error

>gene1362
TVTS*HLIGTSPVSSSIILLVSSRIFCS
mossmatters commented 6 years ago

Regarding the error for intronerate.py, it looks like something is off with the annotation of intron and exon sequences. Could you post the intronerate_raw.gff file from gene10691? That may give us some clues. I suspect it has something to do with the protein sequence having stop codons, but there may be something else going on.

biomendi commented 6 years ago

Unfortunately, intronerate_raw.gff turns out to be empty in both genes. But I have noticed that many other genes (which did not fail) also have stop codons (even on the test dataset some AA sequences that I randomly checked will have some stop codons).

RandolphQuek commented 6 years ago

Hey Matt!

I'm a new user of HybPiper too, and I managed to run the pipeline for most of the way. However, when I am trying to do Intronerate.py,Ii get the same error as biomendi, and my intronerate_raw.gff file is also blank. I have no stop codons in my file, so I'm not really sure what's going on.

Could I please trouble you to help us with it?

Thanks for your help!

Randolph.

mossmatters commented 5 years ago

So sorry for the delayed reply (again) but I've finally got some time to work out some issues on HybPiper. It is curious that exonerate would find enough evidence to extract a (coding) sequence using exonerate_hits.py in the main script, but then fail to annotate the same thing during intronerate.py.

If either of you who had this issue can send me a directory for a gene where this is happening (one level in from the prefix directory, so prefix/geneName) , I would really appreciate it!