mossmatters / HybPiper

Recovering genes from targeted sequence capture data
GNU General Public License v3.0
108 stars 45 forks source link

issue running get_seq_lengths.py #43

Open angelajmcd opened 5 years ago

angelajmcd commented 5 years ago

Hi Matt,

I've been trying to run get_seq_lengths.py and I keep getting this error:

Traceback (most recent call last): File "/Applications/HybPiper/get_seq_lengths.py", line 68, in seq_length = len(SeqIO.read(read_file,'fasta').seq) File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/Bio/SeqIO/init.py", line 730, in read raise ValueError("More than one record found in handle") ValueError: More than one record found in handle

I have checked my files, updated software and dependencies, and am not sure what to make of the "More than one record found in handle" message... any thoughts?

Hope you're well! Angela

mossmatters commented 5 years ago

SeqIO.read() is expecting there to be only one sequence in the FASTA file. In this stage it's reading one of the HybPiper outputs, which should only have one sequence. It means that somehow multiple sequences were written to one or more of your sequence output files. For a HybPiper output directory with the prefix SampleA, the nucleotide sequence for gene001 this would be a file located at SampleA/gene001/SampleA/sequences/FNA/gene001.FNA The file should contain one and only one sequence.

From the output of the get_seq_lengths.py script you should be able to determine which sample is the one causing trouble-- it's the one in namelist.txt that's after the last one in the output. You'll need to look at the .FNA files for each of the genes for that sample to see if there are any that have more than one sequence. Here's a goofy but effective one-liner command, which you should run from within the HybPiper output directory for the offending sample:

parallel --tag 'grep ">" {}/SRR1855705/sequences/FNA/{}.FNA | wc -l' ::: $(cut -f 1 genes_with_seqs.txt)

Let me know what you find, as there should not be more than one sequence in those files!

angelajmcd commented 5 years ago

Thanks for the quick response; I really appreciate it! Admittedly, I didn't get the oneliner parallel command to work right away... I'm going to try it again this afternoon but wanted to get back to you in the meantime. I did look in the FNA folders and found some sequences were just strings of inverted "?"(and, previously when I've run hybpiper, I think the file is just empty or nonexistent, so I thought that was weird) but didn't find anything (yet) with multiple sequences. This was or the first sample in my namelist.txt file which may be why I couldn't get get_seq_lengths.py to work at all, probably. When I edited namelist.txt to move the first sample to the fourth line, I would get output from the first three samples, which is good.

So... then I tried to just re-run reads_first.py for that one sample. I then found that I seem to have some problem with my mac-installed python and I couldn't get biopython (or pip, for that matter) to work. I kept getting errors that reads couldn't get distributed to directories, which should be a biopython issue, I think. I ended up homebrew installing python2 and biopython and I think the reads_first script is running normally now.

(I know this is off-topic and unrelated to your software, but do you know of good online resources available for checking whether the python(s) you have installed are ok/functional, and also maybe how to fix them if they're busted? Is it possible to repair the python that comes installed on mac?)

Thanks again!

angelajmcd commented 5 years ago

Alright, back at it! Re-running the previous sample worked beautifully. BUT, I have another sample that did the same thing and I was able to get the parallel line to work (was in the wrong directory earlier, as I often am...).

For the next sample, there are three gene regions with >1 sequence in them; one has 19 sequences while the others have 23 and 25 respectively. The sequences are all the same length... is it possible they are just reads? Not sure what's up with that... or how to explain it better! But I will re-run reads_first.py for the individual samples for now, no problem.

mossmatters commented 5 years ago

Are you running HybPiper on a cluster, or is it running on your Mac? The weird "?" and other strange file results remind me of problems other have had running HybPiper on a cluster. There can be weird syncing issues because files are generated so fast. The solution is to have HybPiper work in a "scratch" space on the local compute node instead.

As for Python, one standard these days is conda: https://docs.conda.io/en/latest/

You can install most of the packages in HybPiper via bioconda: https://bioconda.github.io/recipes/mira/README.html

One thing conda allows you to do is create separate, clean environments for running python, and they won't interact with any other environments.

Let me know how the re-run went!

angelajmcd commented 5 years ago

It's running locally on a Mac. Not sure what would have caused the issue. Thanks for the python advice! Will look into bioconda, which I don't believe I've tried.

The re-runs are great! Not totally sure what happened initially or why re-running fixed things, but I'm glad it worked. I ran your parallel code for some of the re-runs and they look good. Thanks, again, for your help.

angelajmcd commented 5 years ago

Hey Matt,

Sorry to keep bugging you, but I'm running intronerate.py now and getting errors I'm not sure what to do with. This has happened for a couple of samples so far, it runs and then stops part way through and I get this error: Traceback (most recent call last): File "/Applications/HybPiper/intronerate.py", line 307, in if name == "main":main() File "/Applications/HybPiper/intronerate.py", line 286, in main kept_hits = filter_gff(hits,merge=args.merge) File "/Applications/HybPiper/intronerate.py", line 194, in filter_gff return [hits[kept_indicies[x]] for x in sorted(non_overlapping_indicies)]#.sort()]
IndexError: list index out of range

I think it's happened multiple times for the same locus... could there be a problem with that locus? I get that this step is Exonerate... I think the .FNA sequence length and the exon length in the genes_with_seqs.txt file match. The intronerate.gff file didn't get made for the locus it quit on but otherwise the rest of the files that result from the script (compared to a successfully completed locus) look to be there.

Let me know what you think.