inveniosoftware / invenio

Invenio digital library framework
https://invenio.readthedocs.io
MIT License
625 stars 292 forks source link

dbquery: run_sql() should optionally return list-of-dictionaries #830

Closed tiborsimko closed 10 years ago

tiborsimko commented 10 years ago

Originally

0) Prelude and motivation. run_sql() currently returns tuple-of-tuples:

In [21]: res = run_sql("SELECT id,name,dbquery FROM collection WHERE id<3");

In [22]: res
Out[22]:
((1, 'Atlantis Institute of Fictive Science', None),
 (2, 'Preprints', 'collection:PREPRINT'))

The returned values are usually transformed intonamed symbols in the business logic, as it would not be very readable to work withposition of entities in the resulting tuple representing rows.

One can transform tuple-of-tuples into list-of-dictionaries from the get go, via:

In [23]: resd = [dict(id=row[0],name=row[1],dbquery=row[2]) for row in run_sql("SELECT id,name,dbquery FROM collection WHERE id<3")]

In [24]: resd
Out[24]:
[{'dbquery': None, 'id': 1, 'name': 'Atlantis Institute of Fictive Science'},
 {'dbquery': 'collection:PREPRINT', 'id': 2, 'name': 'Preprints'}]

This format of SQL results is especially nice for the forthcoming migration to Jinja templates, because one will then be able to use symbols in templates easily, see collection.name and friends in the following example:

#!python
from jinja2 import Template
from invenio.dbquery import run_sql

tmpl_collection_table = Template("""\
<table border="1">
  <thead>
    <tr>
      <th>ID</th>
      <th>Name</th>
      <th>Query Definition</th>
    </tr>
  </thead>
  <tbody>
  {% for collection in collections %}
    <tr>
      <td>{{ collection.id }}</td>
      <td>{{ collection.name }}</td>
      <td>{{ collection.dbquery }}</td>
    </tr>
  {% endfor %}
  </tbody>
</table>
""")

collections = [dict(id=row[0],name=row[1],dbquery=row[2]) \
               for row in run_sql("SELECT id,name,dbquery FROM collection")]

print tmpl_collection_table.render(collections=collections)

It is useful to generalise this technique of using SQL results throughout Invenio, but the dict() boilerplate code should then be eliminated.

-1)* The goal of this ticket is therefore to enrich run_sql() with a new option called say with_dict=True that would return directly list-of-dictionaries instead of tuple-of-tuples. This will enable programmers to write elegantly:

#!python
collections = run_sql("SELECT id,name,dbquery FROM collection", with_dict=True)

instead of currently convoluted:

#!python
collections = [dict(id=row[0],name=row[1],dbquery=row[2]) \
               for row in run_sql("SELECT id,name,dbquery FROM collection")]

The with_dict option would be set to False by default, for backwards compatibility, and for use cases when speed difference may be important. Otherwise, once introduced, the majority of run_sql() callers should probably switch to using it, for better code readability.

Note that the new option with_dict would behave somewhat similarly to how current with_desc option behaves, but it would return more directly exploitable results.

-2)* Beware of SQL queries like:

SELECT DATE_FORMAT(creation_date, '%%Y') FROM bibrec

or:

SELECT id+nbrecs,name FROM collection

or (even though the following technique is bad style):

SELECT * FROM collection

when constructing names of keys of the resulting dictionaries representing rows.

-3)* Extensive regression test cases covering the above examples should be naturally added.

-P.S.* We can also inspire ourselves from how tornado.database DB wrapper behaves in this respect, see [[wiki:Talk/WebFrameworks#a4.6.Tornado]].

invenio-developers commented 10 years ago

Originally by Raquel Jimenez Encinar raquel.jimenez.encinar@cern.ch on 2011-12-08

In [95e3d250ee142a73ac64c1be66a7222d330206f3]:

#CommitTicketReference repository="" revision="95e3d250ee142a73ac64c1be66a7222d330206f3"
dbquery: add option with_dict to run_sql()

- Enriches run_sql() with a new option called
  with_dict that returns directly list-of-dictionaries
  instead of tuple-of-tuples. (closes #830)
* Adds 4 regression test cases.
* Changes with_desc parameter type from integer to boolean
  in run_sql and run_sql_with_limit in order to
  keep coherence with the with_dict parameter.
* Updates miscutil-dbquery.webdoc.
badzil commented 10 years ago

Originally on 2011-12-22

MySQLdb has another type of cursor named DictCursor that returns dictionaries, thus removing the additional step of building the dictionary. It might be worth looking into it as it shows a significant speed difference.

In [10]: %time res = test_with_dictcursor("SELECT * FROM bibrec_bib03x LIMIT 1000000")
CPU times: user 4.53 s, sys: 0.02 s, total: 4.56 s
Wall time: 5.00 s

In [11]: %time res = invenio.dbquery.run_sql("SELECT * FROM bibrec_bib03x LIMIT 1000000", with_dict=True)
CPU times: user 10.34 s, sys: 0.14 s, total: 10.48 s
Wall time: 10.90 s

See this blog post for an example.