Open dgruano opened 1 month ago
Hi @dgruano I have come across things like this in the past, I think the easiest thing is to create a custom parser for the first line. For that, you would have to override _feed_first_line
in GenBankScanner
and then override parse
in GenBankIterator
:
from Bio.SeqIO.InsdcIO import GenBankIterator, GenBankScanner
import glob
import re
class MyGenBankScanner(GenBankScanner):
def _feed_first_line(self, consumer, line):
# All the things you may set
# consumer.data_file_division
# consumer.date
# consumer.locus
# consumer.molecule_type
# consumer.residue_type
# consumer.size
# consumer.topology
# A regex for LOCUS pKM265 4536 bp DNA circular SYN 21-JUN-2013
m = re.match(r'LOCUS\s+(\S+)\s+(\d+ bp)\s+(\S+)\s+(\S+)\s+(\S+)', line)
name, size, molecule_type, topology, _ = m.groups()
consumer.locus(name)
consumer.size(size[:-3])
consumer.molecule_type(molecule_type)
consumer.topology(topology)
# return super().parse_records(handle)
class MyGenBankIterator(GenBankIterator):
def parse(self, handle):
"""Start parsing the file, and return a SeqRecord generator."""
records = MyGenBankScanner(debug=0).parse_records(handle)
return records
files = glob.glob('plasmids/*.dna')
for f in files:
print(f)
it = MyGenBankIterator(f)
try:
next(it.parse(f))
except Exception as e:
print(e)
This passes the tests for the examples you put there. Not sure if it will work for all, but it looks like they are all like that.
Related to #175
tldr: importing Euroscarf plasmids throws parsing errors. We should decide how to deal with them (handle or pass) to extract header information and support Euroscarf plasmids.
I tried to load several Euroscarf plasmids from file and got parsing errors:
First:
This has an easy workaround, which is adding the uppercase LINEAR and CIRCULAR to the allowed topologies in
Bio.GenBank.Scanner
. Alternatively, we could "pre-parse" the first line of a gb file to modify the topology to lowercase.However, after fixing this, I got a second error with some of them:
Apparently the gb format specification is a nightmare. Someone is working on a solution to return a warning instead of an error, so we should keep an eye.
Examples of plasmids that gave me these problems, with their respective headers:
LOCUS pCM189 8374 bp DNA CIRCULAR SYN 15-SEP-2011
LOCUS pKM265 4536 bp DNA CIRCULAR SYN 21-JUN-2013
LOCUS pUG6-tTA 5823 bp DNA CIRCULAR SYN 20-JAN-2010
LOCUS pFA6a-13myc-natMX6 4469 bp DNA CIRCULAR SYN 02-JUL-2014
I didn't ping the Biopython people because this seems to be a problem of euroscarf not following the specification.
As a general question, should we try to fix this format errors in plasmids that we retrieve from somewhere else, or just ignore the problems and get the key information for SYC?