rahuldave / semflow

Semantic Pipeline for ADS
6 stars 3 forks source link

Multiple authors #2

Open DougBurke opened 13 years ago

DougBurke commented 13 years ago

Just so that we don't forget from this morning's meeting.

I have just loaded in Chandra and HUT (see issue #1) and notice that I see the "multiple occurrence of an author" issue (using rdf2solr5.py). For 2002ApJ...573..157N the author list is reported to be:

Drake, J ; Fiore, F ; Fruscione, A ; Mathur, S ; Drake, J ; Elvis, M ; Bianchi, S ; Fiore, F ; Nicastro, F ; Elvis, M ; Elvis, M ; Marengo, M ; Fruscione, A ; Drake, J ; Marengo, M ; Nicastro, F ; Mathur, S ; Marengo, M ; Elvis, M ; Nicastro, F ; Drake, J ; Elvis, M ; Mathur, S ; Mathur, S ; Bianchi, S ; Drake, J ; Zezas, A ; Fiore, F ; Zezas, A ; Bianchi, S ; Zezas, A ; Zezas, A ; Marengo, M ; Marengo, M ; Nicastro, F ; Fiore, F ; Fiore, F ; Mathur, S ; Bianchi, S ; Fruscione, A ; Nicastro, F ; Fruscione, A ; Fruscione, A ; Zezas, A ; Bianchi, S

where you can see that we get many - not two - copies of the same name.

kayebohemier commented 13 years ago

I'll link to this in the meeting notes gDoc.

DougBurke commented 13 years ago

It appears to be related to a paper occurring in multiple missions. I have EUVE, FUSE, HPOL, HUT, WUPPE and IUE loaded (loaded in reverse order to this list).

which is FUSE only shows no author duplication

which is EUVE only shows no author duplication

which is EUVE and FUSE shows the author duplication.

DougBurke commented 13 years ago

Looking at the RDF I can see why this is happening: when an author is created for the paper the URI has a uuid appended to it - e.g. http://ads.harvard.edu/sem/agents/PersonName/Elvis%2C_M/ccf57a95-70c3-4bdd-885c-e42f3e226b26 - and as this uuid is different each time the paper is added to the store (once per mission it is associated with) then you get multiple copies of the same author. There are similar issues with hasAbstract and hasAggregation, in that you get multiple versions of them, but this is because they point to blank nodes, so the store can not identify them as the same thing.

DougBurke commented 13 years ago

The changeset above removes repeated authors but should be considered something of a hack since it really should be handled by not adding multiple authoredBy statements for the same author to the RDF store.

EDIT The commit has now been merged but I am leaving this open since it should be fixed upstream.