Getting OID of the table from which the given column was fetched

xpuu commented 6 years ago

Our team recently switched from PHP to Python3. In PHP we were in some cases using function pg_field_table, which basically calls to PQftable returning the OID of the parent table (or 0 for formula/function).

We think, that getting this kind of information might be useful for other developers too. Because it's already present in libpq result attributes anyway, it's also cheap to obtain. The easiest and most logical (for us) seemed to add table OID as a new item to cursor.description tuple. We made a few lines patch and succesfully tested it in real life.

Do you think, that providing table OID (in whatever way you decide is correct), could make it's way to official release? It'd makes things much easier for us. And if so, how can we help to let this happen?

fogzot commented 6 years ago

cursor.description is well defined in PEP 249 and I don't think it is a good idea to extend it: if some future PEP adds elements/attributes then we'll have to make backward incompatible changes to psycopg to maintain PEP compatibility.

I can think of two ways to add such kind of information:

1) create a new cursor.extendedDescription attribute that holds a tuple of psycopg-specific column descriptors. 2) add a new method cursor.describe(column) that returns an extended description for the column. The description can include data from cursor.description to make all data accessible from a single place.

I better like (2) because being an explicit call it will allow for future expansion like adding data that requires a roundtrip to the backend without any extra overhead if not called.

If you want to send a patch remember to also include tests and documentation. :)

xpuu commented 6 years ago

Makes sense, molte grazie. We'll go for second option. Describe method will return a tuple (or namedtuple if available) containing (PQftable, PQftablecol) - parent table OID and column number within parent table. Is that ok?

fogzot commented 6 years ago

I think it is better if the content of the tuple is an object , not another tuple. Explicit naming is better that positional. Something along the lines of:

({'table_oid': 234, 'table_column': 1}, {'table_oid': 234, 'table_column': 3, ...)

xpuu commented 6 years ago

Well, this is "describe" for the whole resultset. But in previous round, we've agreed (to keep things simple) that describe method will return info just for one specific column. Does it mean, that result for one column should be dict too? My boss would be pleased, he doesn't like tuples at all. On the other hand dicts can't be unpacked. Some says a named tuple should be used in this case. What's your preference?

fogzot commented 6 years ago

Yes, sorry - I was still thinking about the cursor.description format. I don't like simple tuples either because their content is non discoverable: you need to know that element at index 0 is the oid, and even if your editor/IDE is smart it has no way to suggest anything useful (except for showing you the documentation). A named tuple is fino, IMHO.

dvarrazzo commented 6 years ago

Oh, I thought I had added something to this conversation yesterday. I lost it somehow :\ ok...

The conversation is getting confused, let's make some order :)

cursor.description is a sequence - I seem to remember a list, and this is fine.
its items are mandated by the DBAPI to be 7-items sequences
psycopg implements the latter with namedtuples with names copied from the specs name type_code display_size internal_size precision scale null_ok, which still fit the specs requirement because they are also sequences (item[0] == item.name, although the latter is easier to use)
previously we used to use simple tuples because we supported Python versions that didn't have namedtuples and this is no more the case.

@xpuu: the info you want to add come from functions such as PQftable, PQftablecol and it's worth taking a look if there is any other useful in these family. They are per column, so they should be "added to the tuple" as you suggested initally. However, we cannot extend the tuple to be >7 elements because it violates the DBAPI.

What we should do is:

forget the tuples
forget the namedtuples :)
we make an object ColumnDescription which:
- inherits from the namedtuple-7 as defined above (either in practice or it just implements the same interface)
- has extra attributed populated from the PQf* function we notice we can support.

this way eventual code unpacking the description with idioms like for (name, oid, _, _, _, _, _) in cur.description: ... will not break, item[0], item.name will work as before, plus you can have item.table not accessible as an item but nobody cares.

Notice that inheriting from a nameduple is not difficult in Python itself:

class ColumnDescription(namedtuple('ColumnDescription', 'name type_code display_size internal_size precision scale null_ok')):
    def __init__(self, *args):
        self.table = None

cd = ColumnDescription(*range(7))
cd.table = 'asdf'

>>> cd
ColumnDescription(name=0, type_code=1, display_size=2, internal_size=3, precision=4, scale=5, null_ok=6)

>>> cd.name
0

>>> cd.table
'asdf'

never done it in C but it should be doable (also the above Python representation is not necessarily right because that object is immutable so it should actually implement __new__ not __init__ but can't looks for the docs right at the moment...

fogzot commented 6 years ago

@dvarrazzo very good summary but I don't agree with your proposed solution because:

1) it extends an API that has been fixed in stone by the PEP and even if it will probably never change the most probable change would be along the lines of "please, use named tuples to maintain backward compatibility and add the following attributes..." and that would break our ColumnDescription. 2) if we decide to include any kind of information that requires a network trip we will need to add yet another method/function.

That's why I proposed a new method cursor.describe(column_index) that returns a named tuple with the information you have in ColumnDescription above.

dvarrazzo commented 6 years ago

@fogzot about newer versions on the api, I don't hold my breath. The api is firmly rooted in the '90s, it doesn't mention 1: unicode support, 2: python 3 changes, 3: asynchronous communications, and the last attempts of discussions on the dbsig ml orbited around... placeholders. All in all I don't see them dictating to add something mandatory as implement description items as namedtuple; at most they would demand an objects with these attributes which could be implemented as a namedtuple or not, and all that can be asked to maintain backward compatibility is for the objects to be also a 7-items sequence. Making it a namedtuple was an extension of ours from several years ago; other adapters may have used lists, or their own objects implementing the sequence protocol etc.

About your point 2: these functions in my understanding don't take a network trip. If they did I would prefer too to have them exposed as functions instead of smuggled as attributes. But in my understanding the attribute requested are on the result objects already on the client side. For instance, we already use PQfname (for the name of the description item) but we don't use PQftable. Time to make a list! From libpq 10 docs:

PQntuples(PGresult *), PQnfields(PGresult *): used to establish number of rows and columns in the result
PQfname(PGresult *, int): we already expose it as cur.description's name
PQfnumber(PGresult *, char *): function mapping column names to field index. not useful for us
PQftable(PGresult *, int): oid of the table returning the column, if a simple reference. We can expose this
PQfformat(PGresult *, int): informs if the columns it textual (0) or binary (1). psycopg only asks for textual columns
PQftype(PGresult *, int): we already expose it for cur.description's type_code
PQfmod(PGresult *, int): we use it to calculate internal_size, precision, scale but we don't expose it directly.
PQfsize(PGresult *, int): we use it to calculate internal_size, but we don't expose it directly (it is an internal representation anyway).

...other functions work on the result records, or on prepared statements, so I guess the buck stops here.

I'd say the results of all the functions with interface (PGresult *, int) are good candidates to be exposed as attributes of ColumnDescription. name and type_code we already do it. We can expose table_oid and table_col and instruct the users about how to query pg_class and pg_attribute to retrieve the names for them. For completeness and for people to deal with their own data we may expose PQfmod and PQfsize. So we have the current state:

name or [0]-> PQfname
type_code or [1] -> PQftype
display_size or [2] -> calculated from results
internal_size or [3] -> calculated from PQfmod, PQfsize
precision or [4] -> calculated from PQfmod only for decimal
scale or [5] -> calculated from PQfmod only for decimal
null_ok or [6] -> always None

to which we can add

table_oid -> PQtable (maps into pg_class.oid, pg_attribute.attrelid)
table_col -> PQtablecol (maps into pg_attribute.attnum)
modifier -> PQfmod
size -> PQfsize (although it's more an internal size of internal_size...)
format -> PQfformat (long shot with this becoming != 0)

Adding these attribute has no overhead: the functions only use the client-side result. They are ordered for usefulness (for me) and we may implement all the 5 or only some of them.

dvarrazzo commented 5 years ago

Added table_oid, table_column attributes. I don't think others are necessary, but let me know if anyone thinks otherwise.

adityatoshniwal commented 5 years ago

Hey @dvarrazzo,

Few of our users get the table_oid as negative value. It mostly happens with large DB's with lot many tables and columns. Doing a SELECT (-123434)::oid returns correct result, where -123434 is the table_oid returned. I'm sorry I am not able to share the simulation data, but could you please have go through on the changes again and see if we are using signed int for OIDs may be ?

Thanks for the great work :)

dvarrazzo commented 5 years ago

@adityatoshniwal does it mean that if you select 'my_table_name'::regclass::oid you get returned "4294843862"?

dvarrazzo commented 5 years ago

@adityatoshniwal you are right: the Oid type is unsigned. Opened #961

psycopg / psycopg2

Getting OID of the table from which the given column was fetched #661