iucogs / pubsplay

Reimplementation of the Pubs Citations Tracker using Scala and the Play framework
0 stars 0 forks source link

The Author Sorting Problem #6

Open JaimieMurdock opened 10 years ago

JaimieMurdock commented 10 years ago

The Author Sorting Problem

The Schema

citation_authors citation_id INT author_id INT rank TINYINT

authors author_id INT firstname VARCHAR lastname VARCHAR ...

The Abstract Query

SELECT *
FROM citation
ORDER BY citation.year DESC, authors ASC

What does ORDER BY authors ASC mean?

ORDER BY author.rank, author.lastname, author.firstname ASC

PROBLEM: must be recursive, gives an order of citation objects, not author objects SOLUTION: something with GROUP BY authors_citation.cit_id???

PROBLEM: objects may have asymmetrical number of ranks. if same order does not exist, citation with fewer authors is ranked above one with more, if all authors for paper with fewer authors shares all authors SOLUTION: ???

Possible Solutions

juliusakula commented 10 years ago

I don't quite understand the schema, but you should consider adding this to the question: http://www.sqlfiddle.com/#!2/b582e . SQLFiddle is very cool.

JaimieMurdock commented 10 years ago

Thanks for the tip. I've got a clearer schema with a sketch at the query: http://www.sqlfiddle.com/#!2/014ad9/1

JaimieMurdock commented 10 years ago

Getting close using GROUP_CONCAT(): http://www.sqlfiddle.com/#!2/014ad9/10

JaimieMurdock commented 10 years ago

and we have a winner: http://www.sqlfiddle.com/#!2/014ad9/11

Working on Indexes for author table now.

JaimieMurdock commented 10 years ago

And a final solution : http://www.sqlfiddle.com/#!2/1bf6fd/1

How do we implement this?

pjcraig commented 10 years ago

Worst case, we can use raw SQL with slick. I'm not sure at this point slick can map Scala onto that sophisticated of a query, though I'll look into it.

Can you give us a SQL script to update the DB schema to reflect this new model?

JaimieMurdock commented 10 years ago

Here is the final version of the SQLFiddle playground: http://www.sqlfiddle.com/#!2/0e562/1

I updated the database by adding an index to the authors table in the pubs database. I did not run this on pubscopy or any others:

ALTER TABLE authors ADD INDEX `last_first` (lastname, firstname);

The sort query runs in 1.37 seconds over the entire database now:

SELECT citations.citation_id, 
       GROUP_CONCAT(authors.lastname, ', ', authors.firstname
                    ORDER BY author_of.position_num, authors.lastname, authors.firstname
                    SEPARATOR ", ") AS author_string,
       citations.year
FROM citations
INNER JOIN author_of ON citations.citation_id = author_of.citation_id
INNER JOIN authors ON author_of.author_id = authors.author_id
GROUP BY author_of.citation_id
ORDER BY author_string, year;