linsalrob / PhiSpy

Prediction of prophages from bacterial genomes
MIT License
70 stars 20 forks source link

Error when sequence ID is too long #53

Open jcmckerral opened 3 years ago

jcmckerral commented 3 years ago

There is a small issue where one of the biopython functions has a character length limit on sequence IDs, a more informative error message might be useful. A fasta ID

>SEQID_TOO_LONG_BIOPY_HAS_CHAR_LIMIT

results in a genbank file which will give a PhiSpy traceback/error

[USERID]$ PhiSpy.py testgenome.gb -o phispyTest
Traceback (most recent call last):
  File "$PATH/anaconda3/bin/PhiSpy.py", line 125, in <module>
    main(sys.argv)
  File "$PATH/anaconda3/bin/PhiSpy.py", line 48, in main
    args_parser.record = PhiSpyModules.SeqioFilter(filter(lambda x: len(x.seq) > args_parser.min_contig_size, SeqIO.parse(handle, "genbank")))
  File "$PATH/anaconda3/lib/python3.8/site-packages/PhiSpyModules/seqio_filter.py", line 33, in __init__
    for n, item in enumerate(content):
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/SeqIO/Interfaces.py", line 73, in __next__
    return next(self.records)
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 516, in parse_records
    record = self.parse(handle, do_features)
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 499, in parse
    if self.feed(handle, consumer, do_features):
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 465, in feed
    self._feed_first_line(consumer, self.line)
  File "$PATH/anaconda3/lib/python3.8/site-packages/Bio/GenBank/Scanner.py", line 1572, in _feed_first_line
    raise ValueError("Did not recognise the LOCUS line layout:\n" + line)
ValueError: Did not recognise the LOCUS line layout:
LOCUS       SEQID_TOO_LONG_BIOPY_HAS_CHAR_LIMIT bp   DNA linear

Changing the ID to

>SEQID_SHORT

resolves the problem.

liaochenlanruo commented 2 years ago

Traceback (most recent call last): File "/home/liu/miniconda3/envs/component/bin/PhiSpy.py", line 10, in sys.exit(run()) File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/main.py", line 122, in run main(sys.argv) File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/main.py", line 44, in main args_parser.record = PhiSpyModules.SeqioFilter(filter(lambda x: len(x.seq) > args_parser.min_contig_size, SeqIO.parse(handle, "genbank"))) File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/seqio_filter.py", line 33, in init for n, item in enumerate(content): File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 74, in next return next(self.records) File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 516, in parse_records record = self.parse(handle, do_features) File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 499, in parse if self.feed(handle, consumer, do_features): File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 465, in feed self._feed_first_line(consumer, self.line) File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 1571, in _feed_first_line raise ValueError("Did not recognise the LOCUS line layout:\n" + line) ValueError: Did not recognise the LOCUS line layout: LOCUS NODE_52_length_15591_cov_14.37480715591 bp DNA linear

qianxin-kxy commented 1 year ago

I have also encountered this issue, but I have hundreds of gbk files to process, so is there any way to batch shorten the IDs in the files

ShanlinKe commented 1 year ago

I have also encountered this issue, but I have hundreds of gbk files to process, so is there any way to batch shorten the IDs in the files

I met the same issue. Any clues on this?

linsalrob commented 1 year ago

Can you point me to a file where this issue occurs so that I can fix it?

TSZUoE commented 1 year ago

Hi, I also had this issue. I initially tried to add the whitespace manually but that didn't work. My genbank files were annotated in PROKKA. Re-annotating using the --compliant flag for PROKKA fixed the issue for me as it parses the locus line in a different way.

ghost commented 1 week ago

@linsalrob @qianxin-kxy @jcmckerral thank you and the easy way would be to do this before running:


# this will remove all the spaces with the pipes
for i in *.fasta; do sed -i -e "s/ /|/g" ${i}; done 
# cut the pipe at the place you want
for i in *.fasta; do cut -f 1 -d "|" ${i}; done 
# all headers shorted. 
Thank you 
Gaurav
ghost commented 1 week ago

@ShanlinKe @TSZUoE see my response in this thread above.

if you have the C++ code, pointer declaration snippet, paste here, will do the convertible for the same

# this will remove all the spaces with the pipes
for i in *.fasta; do sed -i -e "s/ /|/g" ${i}; done 
# cut the pipe at the place you want
for i in *.fasta; do cut -f 1 -d "|" ${i}; done 
# all headers shorted. 

Thank you Gaurav