Euroscarf plasmids do not follow genbak format specification

Related to #175

tldr: importing Euroscarf plasmids throws parsing errors. We should decide how to deal with them (handle or pass) to extract header information and support Euroscarf plasmids.

I tried to load several Euroscarf plasmids from file and got parsing errors:

First:

Traceback (most recent call last):
  File "/home/daniel/dev/Genestorian/ShareYourCloning_backend/main.py", line 234, in read_from_file
    for parsed_seq in seqio_parse(handle, sequence_file_format):
  File "/home/daniel/dev/Genestorian/ShareYourCloning_backend/.venv/lib/python3.10/site-packages/Bio/SeqIO/Interfaces.py", line 91, in __next__
    return next(self.records)
  File "/home/daniel/dev/Genestorian/ShareYourCloning_backend/.venv/lib/python3.10/site-packages/Bio/GenBank/Scanner.py", line 512, in parse_records
    record = self.parse(handle, do_features)
  File "/home/daniel/dev/Genestorian/ShareYourCloning_backend/.venv/lib/python3.10/site-packages/Bio/GenBank/Scanner.py", line 495, in parse
    if self.feed(handle, consumer, do_features):
  File "/home/daniel/dev/Genestorian/ShareYourCloning_backend/.venv/lib/python3.10/site-packages/Bio/GenBank/Scanner.py", line 461, in feed
    self._feed_first_line(consumer, self.line)
  File "/home/daniel/dev/Genestorian/ShareYourCloning_backend/.venv/lib/python3.10/site-packages/Bio/GenBank/Scanner.py", line 1299, in _feed_first_line
    raise ValueError(
ValueError: LOCUS line does not contain valid entry (linear, circular, ...):
LOCUS       pCM189       8374 bp    DNA   CIRCULAR SYN        15-SEP-2011

This has an easy workaround, which is adding the uppercase LINEAR and CIRCULAR to the allowed topologies in Bio.GenBank.Scanner. Alternatively, we could "pre-parse" the first line of a gb file to modify the topology to lowercase.

However, after fixing this, I got a second error with some of them:

Traceback (most recent call last):
  File "/home/daniel/dev/Genestorian/ShareYourCloning_backend/main.py", line 234, in read_from_file
    for parsed_seq in seqio_parse(handle, sequence_file_format):
  File "/home/daniel/dev/Genestorian/ShareYourCloning_backend/.venv/lib/python3.10/site-packages/Bio/SeqIO/Interfaces.py", line 91, in __next__
    return next(self.records)
  File "/home/daniel/dev/Genestorian/ShareYourCloning_backend/.venv/lib/python3.10/site-packages/Bio/GenBank/Scanner.py", line 512, in parse_records
    record = self.parse(handle, do_features)
  File "/home/daniel/dev/Genestorian/ShareYourCloning_backend/.venv/lib/python3.10/site-packages/Bio/GenBank/Scanner.py", line 495, in parse
    if self.feed(handle, consumer, do_features):
  File "/home/daniel/dev/Genestorian/ShareYourCloning_backend/.venv/lib/python3.10/site-packages/Bio/GenBank/Scanner.py", line 461, in feed
    self._feed_first_line(consumer, self.line)
  File "/home/daniel/dev/Genestorian/ShareYourCloning_backend/.venv/lib/python3.10/site-packages/Bio/GenBank/Scanner.py", line 1305, in _feed_first_line
    raise ValueError(
ValueError: LOCUS line does not contain space at position 52:
LOCUS       pCM189       8374 bp    DNA   CIRCULAR SYN        15-SEP-2011

Apparently the gb format specification is a nightmare. Someone is working on a solution to return a warning instead of an error, so we should keep an eye.

Examples of plasmids that gave me these problems, with their respective headers:

http://www.euroscarf.de/plasmid_details.php?accno=P30326 (the one from the traceback) LOCUS pCM189 8374 bp DNA CIRCULAR SYN 15-SEP-2011
http://www.euroscarf.de/plasmid_details.php?accno=P30083 (both errors) LOCUS pKM265 4536 bp DNA CIRCULAR SYN 21-JUN-2013
http://www.euroscarf.de/plasmid_details.php?accno=P30385 LOCUS pUG6-tTA 5823 bp DNA CIRCULAR SYN 20-JAN-2010
http://www.euroscarf.de/plasmid_details.php?accno=P30413 (only first error) LOCUS pFA6a-13myc-natMX6 4469 bp DNA CIRCULAR SYN 02-JUL-2014

I didn't ping the Biopython people because this seems to be a problem of euroscarf not following the specification.

As a general question, should we try to fix this format errors in plasmids that we retrieve from somewhere else, or just ignore the problems and get the key information for SYC?

Hi @dgruano I have come across things like this in the past, I think the easiest thing is to create a custom parser for the first line. For that, you would have to override _feed_first_line in GenBankScanner and then override parse in GenBankIterator:

from Bio.SeqIO.InsdcIO import GenBankIterator, GenBankScanner
import glob
import re

class MyGenBankScanner(GenBankScanner):
    def _feed_first_line(self, consumer, line):
        # All the things you may set
        # consumer.data_file_division
        # consumer.date
        # consumer.locus
        # consumer.molecule_type
        # consumer.residue_type
        # consumer.size
        # consumer.topology
        # A regex for LOCUS       pKM265       4536 bp    DNA   circular  SYN        21-JUN-2013
        m = re.match(r'LOCUS\s+(\S+)\s+(\d+ bp)\s+(\S+)\s+(\S+)\s+(\S+)', line)
        name, size, molecule_type, topology, _ = m.groups()
        consumer.locus(name)
        consumer.size(size[:-3])
        consumer.molecule_type(molecule_type)
        consumer.topology(topology)
        # return super().parse_records(handle)

class MyGenBankIterator(GenBankIterator):

    def parse(self, handle):
        """Start parsing the file, and return a SeqRecord generator."""
        records = MyGenBankScanner(debug=0).parse_records(handle)
        return records

files = glob.glob('plasmids/*.dna')

for f in files:
    print(f)
    it = MyGenBankIterator(f)
    try:
        next(it.parse(f))
    except Exception as e:
        print(e)

This passes the tests for the examples you put there. Not sure if it will work for all, but it looks like they are all like that.

manulera / ShareYourCloning_backend

Euroscarf plasmids do not follow genbak format specification #197