ensembl.SeqRegion gives KeyError if missing coord system or bad seqID

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?

Please see the attached script.

Error message:
qing@1[ensembl]$ python issue76.py
Traceback (most recent call last):
  File "issue76.py", line 37, in <module>
    orientation='seq_region_strand'))
  File "/home/qing/workspace2009/pyensembl/pygr/seqdb.py", line 507, in
__init__
    (repr(seqDB),))
KeyError: ' cannot create annotation object; sequence database {} may not
be correct'

Original issue reported on code.google.com by jqian....@gmail.com on 24 Mar 2009 at 5:59

Attachments:

issue76.py

GoogleCodeExporter commented 8 years ago

Jenny, I believe this is a bug in seqdb __repr__, and I've submitted a fix.  
Could
you try

git://github.com/ctb/pygr.git

branch 'seqdb_review'?

(I don't have seqregion so I can't replicate your bug; could you remind me of 
where
it is if you still have problems?)

Original comment by the.good...@gmail.com on 24 Mar 2009 at 11:04

GoogleCodeExporter commented 8 years ago

Original comment by the.good...@gmail.com on 25 Mar 2009 at 1:00

Added labels: ReviewBy-cjlee112

GoogleCodeExporter commented 8 years ago

Hi Titus,

I cloned your seqdb_review brach:

qing@1[seqdb_review]$ git clone git://github.com/ctb/pygr.git seqdb_review
Initialized empty Git repository in
/home/qing/workspace2009/seqdb_review/seqdb_review/.git/
remote: Counting objects: 4649, done.
remote: Compressing objects: 100% (1695/1695), done.
remote: Total 4649 (delta 3227), reused 4200 (delta 2929)
Receiving objects: 100% (4649/4649), 2.16 MiB | 319 KiB/s, done.
Resolving deltas: 100% (3227/3227), done.

Unfortunately, I still get the same error message:  

qing@1[ensembl]$ export 
PYTHONPATH='/home/qing/workspace2009/seqdb_review/seqdb_review/'

qing@1[ensembl]$ time python issue76.py
Traceback (most recent call last):
  File "issue76.py", line 37, in <module>
    orientation='seq_region_strand'))
  File "/home/qing/workspace2009/seqdb_review/seqdb_review/pygr/annotation.py", line
137, in __init__
    (repr(seqDB),))
KeyError: ' cannot create annotation object; sequence database {} may not be 
correct'

Attached is the seqregion.py.  Or, you can get it from my
git://github.com/jqian/pyensembl.git jenny_pyensembl_new branch.

Thanks,
Jenny

Original comment by jqian....@gmail.com on 25 Mar 2009 at 7:04

Attachments:

seqregion.py

GoogleCodeExporter commented 8 years ago

This does not appear to be a Pygr bug -- the error message appears to be 
correct.  I
used the python debugger to inspect the cause of the error more closely (which 
was
partly obscured by our error trapping):

>>> pdb.pm()
> /Users/leec/projects/pygr/pygr/annotation.py(137)__init__()
-> (repr(seqDB),))
(Pdb) self.get_annot_obj(k, self.sliceDB[k])
*** KeyError: 'unknown coordinate system 15'

The implication is that when you created the SeqRegion, you needed to include 
coord
system 15 in the coordSystems dictionary argument.

I used the mysql client to verify this:

mysql> select * from seq_region where seq_region_id=225896;
+---------------+-----------+-----------------+--------+
| seq_region_id | name      | coord_system_id | length |
+---------------+-----------+-----------------+--------+
|        225896 | NT_113964 |              15 | 204131 | 
+---------------+-----------+-----------------+--------+

Original comment by cjlee...@gmail.com on 25 Mar 2009 at 7:50

Changed state: Invalid
Added labels: Milestone-Release0.8

GoogleCodeExporter commented 8 years ago

Sorry, Chris, but I don't think that's the problem.  

The same script (attached) has no problem to create an annotation object of a 
feature
table from an older human core database, the homo_sapiens_core_47_36i.  Even 
though
the seq_region table of homo_sapiens_core_47_36i also contains coord_system_id 
15.

mysql> use homo_sapiens_core_47_36i;
mysql> select * from seq_region where seq_region_id=225896;
+---------------+-----------+-----------------+--------+
| seq_region_id | name      | coord_system_id | length |
+---------------+-----------+-----------------+--------+
|        225896 | NT_113964 |              15 | 204131 |
+---------------+-----------+-----------------+--------+
1 row in set (0.16 sec)

Original comment by jqian....@gmail.com on 26 Mar 2009 at 6:34

Changed state: Assigned

Attachments:

issue76_homo_sapiens_47_36i.py

GoogleCodeExporter commented 8 years ago

Hi Jenny,
just because the issue76.py script doesn't crash on the 36i dataset does not 
alter
the analysis of *why* it fails on the original dataset you reported the bug for,
namely that the script fails to specify one of the coordinate systems (15) 
required
for using that dataset.  You should understand that the error message arose 
only by
chance; AnnotationDB.__init__ takes the very first annotation row returned by 
the
iterator and checks that it constructs successfully, so that users will get an 
error
message immediately if they have given wrong information to the constructor.  
If this
specific exon had not by chance been the first annotation row, this crash would 
not
have occurred.  So it's not too surprising that the same script doesn't 
generate this
crash on a different dataset.

There is a simple fix for the crash that you reported: in your issue76.py 
script, add
coord system 15 to the coordSystems argument passed to your SeqRegion 
constructor. 
However, there is no way I can make this fix as part of Pygr, because the 
issue76.py
script is not part of Pygr; it is not even in the pygr git repository.  In 
fact, the
class that raises the initial error (ensembl.SeqRegion) is not part of Pygr, 
either...

The error message reported by AnnotationDB in this case seems like correct 
behavior.

Original comment by cjlee...@gmail.com on 26 Mar 2009 at 7:58

Changed title: ensembl.SeqRegion gives KeyError if missing coord system
Changed state: Invalid
Added labels: ReviewBy-marecki
Removed labels: ReviewBy-cjlee112

GoogleCodeExporter commented 8 years ago

Hi Chris,

I took your advice and added all the coord systems listed in the ensembl 
coord_system
table to the coordSystems argument passed to my SeqRegion constructor (see 
attached
issue76.py).  I still got the same error message.  I'd highly appreciate it if 
you
could suggest possible causes of the problem.

Thanks,
Jenny

mysql> select * from coord_system;
+-----------------+------------+-------------+---------+------+-----------------
---------------+
| coord_system_id | species_id | name        | version | rank | attrib          

          |
+-----------------+------------+-------------+---------+------+-----------------
---------------+
|              17 |          1 | chromosome  | NCBI36  |    1 | default_version 

          |
|              15 |          1 | supercontig | NULL    |    2 | default_version 

          |
|               4 |          1 | contig      | NULL    |    4 |
default_version,sequence_level |
|              11 |          1 | clone       | NULL    |    3 | default_version 

          |
|             101 |          1 | chromosome  | NCBI35  |  101 |                 

          |
+-----------------+------------+-------------+---------+------+-----------------
---------------+
5 rows in set (0.16 sec)

Original comment by jqian....@gmail.com on 27 Mar 2009 at 6:42

Attachments:

issue76.py

GoogleCodeExporter commented 8 years ago

Hi Jenny,
I'm sorry this is such a frustrating process.  The reason you're still getting 
this
error message is that the table you added as coord system 15 does *not* contain 
this
sequence ID -- so pygr correctly gives you the same error messsage.  Your 
issue76.py
coordSystems does not actually provide the 5 coord systems required by ensembl, 
but
only two.  It treats coord systems 4, 15 and 11 as if they were identical in 
ensembl,
which they are not (and similarly for coord systems 17 and 101).  Presumably 
ensembl
would not have created 4, 15, and 11 as separate coordinate systems if they were
actually identical.

FYI, here is how I analyzed this error message to see exactly what's going on:
leec$ python -i issue76.py 
Traceback (most recent call last):
  File "issue76.py", line 36, in <module>
    orientation='seq_region_strand'))
  File "/Users/leec/projects/pygr/pygr/annotation.py", line 137, in __init__
    (repr(seqDB),))
KeyError: ' cannot create annotation object; sequence database {} may not be 
correct'
>>> import pdb
>>> pdb.pm()
> /Users/leec/projects/pygr/pygr/annotation.py(137)__init__()
-> (repr(seqDB),))
(Pdb) k
292783L
(Pdb) self.sliceDB[k]
<pygr.classutil.TupleO_homo_sapiens_core_53_36o.exon object at 0x2db9030>
(Pdb) self.sliceDB[k].id
292783L
(Pdb) self.getSliceAttr(self.sliceDB[k], 'id')
225896L
(Pdb) self.seqDB[225896L]
*** KeyError: 'id 225896 non-existent or not unique'
(Pdb) self.seqDB.seqRegionDB[225896L]
<pygr.classutil.TupleO_homo_sapiens_core_53_36o.seq_region object at 0x10bfcd0>
(Pdb) self.seqDB.seqRegionDB[225896L].coord_system_id
15L
(Pdb) self.seqDB.prefixDict[15]
(Pdb) self.seqDB.prefixDict[15] is None
True
(Pdb) self.seqDB.coordSystems[15][225896L]
*** KeyError: 'id 225896 non-existent or not unique'

I then verified directly using the mysql client that this ID is indeed missing 
from
the ensembl dna table:
mysql> use homo_sapiens_core_53_36o
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> desc dna;
+---------------+------------------+------+-----+---------+-------+
| Field         | Type             | Null | Key | Default | Extra |
+---------------+------------------+------+-----+---------+-------+
| seq_region_id | int(10) unsigned | NO   | PRI | 0       |       | 
| sequence      | longtext         | NO   |     |         |       | 
+---------------+------------------+------+-----+---------+-------+
2 rows in set (0.17 sec)

mysql> select * from dna where seq_region_id=225896;
Empty set (0.17 sec)

It is unfortunate that ensembl has such a complicated set of coordinate systems 
for
their annotations.  However, to make your code work with their database, you'll 
have
to provide coord system interfaces that work successfully with the coord 
systems that
they mandate.

Original comment by cjlee...@gmail.com on 27 Mar 2009 at 5:23

Changed title: ensembl.SeqRegion gives KeyError if missing coord system or bad seqID

ofanoyi / pygr

ensembl.SeqRegion gives KeyError if missing coord system or bad seqID #76