Closed GoogleCodeExporter closed 8 years ago
Jenny, I believe this is a bug in seqdb __repr__, and I've submitted a fix.
Could
you try
git://github.com/ctb/pygr.git
branch 'seqdb_review'?
(I don't have seqregion so I can't replicate your bug; could you remind me of
where
it is if you still have problems?)
Original comment by the.good...@gmail.com
on 24 Mar 2009 at 11:04
Original comment by the.good...@gmail.com
on 25 Mar 2009 at 1:00
Hi Titus,
I cloned your seqdb_review brach:
qing@1[seqdb_review]$ git clone git://github.com/ctb/pygr.git seqdb_review
Initialized empty Git repository in
/home/qing/workspace2009/seqdb_review/seqdb_review/.git/
remote: Counting objects: 4649, done.
remote: Compressing objects: 100% (1695/1695), done.
remote: Total 4649 (delta 3227), reused 4200 (delta 2929)
Receiving objects: 100% (4649/4649), 2.16 MiB | 319 KiB/s, done.
Resolving deltas: 100% (3227/3227), done.
Unfortunately, I still get the same error message:
qing@1[ensembl]$ export
PYTHONPATH='/home/qing/workspace2009/seqdb_review/seqdb_review/'
qing@1[ensembl]$ time python issue76.py
Traceback (most recent call last):
File "issue76.py", line 37, in <module>
orientation='seq_region_strand'))
File "/home/qing/workspace2009/seqdb_review/seqdb_review/pygr/annotation.py", line
137, in __init__
(repr(seqDB),))
KeyError: ' cannot create annotation object; sequence database {} may not be
correct'
Attached is the seqregion.py. Or, you can get it from my
git://github.com/jqian/pyensembl.git jenny_pyensembl_new branch.
Thanks,
Jenny
Original comment by jqian....@gmail.com
on 25 Mar 2009 at 7:04
Attachments:
This does not appear to be a Pygr bug -- the error message appears to be
correct. I
used the python debugger to inspect the cause of the error more closely (which
was
partly obscured by our error trapping):
>>> pdb.pm()
> /Users/leec/projects/pygr/pygr/annotation.py(137)__init__()
-> (repr(seqDB),))
(Pdb) self.get_annot_obj(k, self.sliceDB[k])
*** KeyError: 'unknown coordinate system 15'
The implication is that when you created the SeqRegion, you needed to include
coord
system 15 in the coordSystems dictionary argument.
I used the mysql client to verify this:
mysql> select * from seq_region where seq_region_id=225896;
+---------------+-----------+-----------------+--------+
| seq_region_id | name | coord_system_id | length |
+---------------+-----------+-----------------+--------+
| 225896 | NT_113964 | 15 | 204131 |
+---------------+-----------+-----------------+--------+
Original comment by cjlee...@gmail.com
on 25 Mar 2009 at 7:50
Sorry, Chris, but I don't think that's the problem.
The same script (attached) has no problem to create an annotation object of a
feature
table from an older human core database, the homo_sapiens_core_47_36i. Even
though
the seq_region table of homo_sapiens_core_47_36i also contains coord_system_id
15.
mysql> use homo_sapiens_core_47_36i;
mysql> select * from seq_region where seq_region_id=225896;
+---------------+-----------+-----------------+--------+
| seq_region_id | name | coord_system_id | length |
+---------------+-----------+-----------------+--------+
| 225896 | NT_113964 | 15 | 204131 |
+---------------+-----------+-----------------+--------+
1 row in set (0.16 sec)
Original comment by jqian....@gmail.com
on 26 Mar 2009 at 6:34
Attachments:
Hi Jenny,
just because the issue76.py script doesn't crash on the 36i dataset does not
alter
the analysis of *why* it fails on the original dataset you reported the bug for,
namely that the script fails to specify one of the coordinate systems (15)
required
for using that dataset. You should understand that the error message arose
only by
chance; AnnotationDB.__init__ takes the very first annotation row returned by
the
iterator and checks that it constructs successfully, so that users will get an
error
message immediately if they have given wrong information to the constructor.
If this
specific exon had not by chance been the first annotation row, this crash would
not
have occurred. So it's not too surprising that the same script doesn't
generate this
crash on a different dataset.
There is a simple fix for the crash that you reported: in your issue76.py
script, add
coord system 15 to the coordSystems argument passed to your SeqRegion
constructor.
However, there is no way I can make this fix as part of Pygr, because the
issue76.py
script is not part of Pygr; it is not even in the pygr git repository. In
fact, the
class that raises the initial error (ensembl.SeqRegion) is not part of Pygr,
either...
The error message reported by AnnotationDB in this case seems like correct
behavior.
Original comment by cjlee...@gmail.com
on 26 Mar 2009 at 7:58
Hi Chris,
I took your advice and added all the coord systems listed in the ensembl
coord_system
table to the coordSystems argument passed to my SeqRegion constructor (see
attached
issue76.py). I still got the same error message. I'd highly appreciate it if
you
could suggest possible causes of the problem.
Thanks,
Jenny
mysql> select * from coord_system;
+-----------------+------------+-------------+---------+------+-----------------
---------------+
| coord_system_id | species_id | name | version | rank | attrib
|
+-----------------+------------+-------------+---------+------+-----------------
---------------+
| 17 | 1 | chromosome | NCBI36 | 1 | default_version
|
| 15 | 1 | supercontig | NULL | 2 | default_version
|
| 4 | 1 | contig | NULL | 4 |
default_version,sequence_level |
| 11 | 1 | clone | NULL | 3 | default_version
|
| 101 | 1 | chromosome | NCBI35 | 101 |
|
+-----------------+------------+-------------+---------+------+-----------------
---------------+
5 rows in set (0.16 sec)
Original comment by jqian....@gmail.com
on 27 Mar 2009 at 6:42
Attachments:
Hi Jenny,
I'm sorry this is such a frustrating process. The reason you're still getting
this
error message is that the table you added as coord system 15 does *not* contain
this
sequence ID -- so pygr correctly gives you the same error messsage. Your
issue76.py
coordSystems does not actually provide the 5 coord systems required by ensembl,
but
only two. It treats coord systems 4, 15 and 11 as if they were identical in
ensembl,
which they are not (and similarly for coord systems 17 and 101). Presumably
ensembl
would not have created 4, 15, and 11 as separate coordinate systems if they were
actually identical.
FYI, here is how I analyzed this error message to see exactly what's going on:
leec$ python -i issue76.py
Traceback (most recent call last):
File "issue76.py", line 36, in <module>
orientation='seq_region_strand'))
File "/Users/leec/projects/pygr/pygr/annotation.py", line 137, in __init__
(repr(seqDB),))
KeyError: ' cannot create annotation object; sequence database {} may not be
correct'
>>> import pdb
>>> pdb.pm()
> /Users/leec/projects/pygr/pygr/annotation.py(137)__init__()
-> (repr(seqDB),))
(Pdb) k
292783L
(Pdb) self.sliceDB[k]
<pygr.classutil.TupleO_homo_sapiens_core_53_36o.exon object at 0x2db9030>
(Pdb) self.sliceDB[k].id
292783L
(Pdb) self.getSliceAttr(self.sliceDB[k], 'id')
225896L
(Pdb) self.seqDB[225896L]
*** KeyError: 'id 225896 non-existent or not unique'
(Pdb) self.seqDB.seqRegionDB[225896L]
<pygr.classutil.TupleO_homo_sapiens_core_53_36o.seq_region object at 0x10bfcd0>
(Pdb) self.seqDB.seqRegionDB[225896L].coord_system_id
15L
(Pdb) self.seqDB.prefixDict[15]
(Pdb) self.seqDB.prefixDict[15] is None
True
(Pdb) self.seqDB.coordSystems[15][225896L]
*** KeyError: 'id 225896 non-existent or not unique'
I then verified directly using the mysql client that this ID is indeed missing
from
the ensembl dna table:
mysql> use homo_sapiens_core_53_36o
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> desc dna;
+---------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+---------+-------+
| seq_region_id | int(10) unsigned | NO | PRI | 0 | |
| sequence | longtext | NO | | | |
+---------------+------------------+------+-----+---------+-------+
2 rows in set (0.17 sec)
mysql> select * from dna where seq_region_id=225896;
Empty set (0.17 sec)
It is unfortunate that ensembl has such a complicated set of coordinate systems
for
their annotations. However, to make your code work with their database, you'll
have
to provide coord system interfaces that work successfully with the coord
systems that
they mandate.
Original comment by cjlee...@gmail.com
on 27 Mar 2009 at 5:23
Original issue reported on code.google.com by
jqian....@gmail.com
on 24 Mar 2009 at 5:59Attachments: