Open rizar opened 4 years ago
Thanks for the question! The primary keys can be multi-part, so in that case the key is actually (mountain_name
, state_name
) which is necessary to handle cases where mountains with the same name appear in multiple states (or a single mountain crosses multiple states). This doesn't actually come up in the data for mountains, but does for rivers (which have a very similar schema).
Wow, that was a quick response, thanks!
Is this also the reason why primary key constraints are not declared in https://github.com/jkkummerfeld/text2sql-data/blob/master/data/geography-db.sql ?
Lucky timing with when I took a break to check email :)
On not being declared, that's an interesting point. I'm not sure what the MySQL default behaviour is (use all fields?) but it certainly would be better to have the DB file and our text schema match. I'll label this issue as an enhancement and keep it open.
Following up on this for my own future reference.
Suhr et al.'s work also explored improvements to the schemas, see the data at https://github.com/google-research/language/tree/master/language/xsp
There is some subtlety to the process though, as raised in this issue: https://github.com/google-research/language/issues/80
Great to see that you want to get things right. Let me also share some extra analysis here.
In principle, the purpose of primary and foreign key constraints is to inform database software and allow it perform consistency checks. But in the context of Lang2SQL translation they start playing a new role. Especially when we consider 0-shot evaluation on a unseen database, foreign key links let the model guess how to join tables in the query. There is typically close to no information on how tables should be joined in the question. The column names sometimes give a hint in the right direction, but sometimes the column names are not informative (check out e.g. aid
, pid
, sid
, msid
columns in IMDB and YELP schemas).
So it is important to get the foreign key links right in the database in order for Lang2SQL on a unseen database to be feasible. But when the database schema is not fully normalized, this can be non-trivial. For example, in the IMDB database the msid
columns appear to refer to either a movie id (Movie.mid
) or a TV series id (TV_Series.sid
). In principle, one can not impose any foreign key constraint here, because an msid
does not have to be neither an mid
nor an msid
, though it must appear in one of said columns. But if you do not feed a foreign key link to your model in one way or another, you can not expect it to work on this database, unless you add some IMDB-specific training data.
The few options available in such situation are as follows:
As far as I understand, Suhr et al chose the option of adding "virtual" foreign key links (see e.g. the tuple ('made_by', 'movie', 'msid', 'mid')
here). They are also direction-less (as confirmed in https://github.com/google-research/language/issues/80). My understanding is that these virtual foreign keys are effectively cues for the heuristic algorithms that fill the from clause based on the model's predictions. In our current project we are also trying to do 0-shot generalization on the datasets that you are kindly curating, and we also decided to add a minimal number of virtual foreign keys to keep the schema graph connected.
So far I have performed a detailed audit of GeoQuery, Restaurants, IMDB, Yelp and Academic. Yelp and Academic are perfectly normalized, and legit foreign key constraints are easy to define (although it would be nice in they were defined in the MySQL dump). I have described the situation with IMDB above. As for GeoQuery and Restaurants, even though these databases are not normalized at all, it is quite clear what foreign key links are sensible (both as the constraints on DB content and as hints to Lang2SQL models).
Interesting, and tricky! You may want to contact the authors of https://arxiv.org/abs/2010.02840 as I spoke to them and they had to do some similar work (in their case, adapting the schemas to match the format of Spider).
Also, I should note, I don't expect either Cathy or myself to have the time to make this update in the near future (6 months at least). She is busy with her work at IBM and I am on the job market this cycle. We'd be happy to review and accept a pull request though or will hopefully get to it down the line.
It appears to me that primary key information in geography-schema.csv is wrong. For example, how can
state_name
be the primary key in themountain
table?