rahuldave / appsem

js server and ui for semflow
4 stars 1 forks source link

percent encoding and Unicode characters and HTML oddities #4

Open DougBurke opened 13 years ago

DougBurke commented 13 years ago

EDIT converting to a general-purpose issue for oddities in the output

These are likely two different issues but mentioned here as they look the same to users. It's also a known issue but I wanted it recorded.

a) some facet values contain percent-encoded values - e.g.

%20 is space, %3B is ; and %2F is /.

b) unicode

Things like acute a being displayed as a and the appearance of the 'undisplayable' unicode symbol for the fancy double quote character or Angstrom or related.

DougBurke commented 13 years ago

These appear to be all due to issues in semflow/rdf2solr5.py rather than in appsem.I have made changes that should fix both issues - e.g. this change which builds on the preceeding commits.

Will close when the changes get merged into main.

DougBurke commented 13 years ago

My testing on linux looks good.

DougBurke commented 13 years ago

Still need to be vigilant. With the new "display the title of a saved paper" functionality we see the same problem:

search for 2001A&A...380..251G - e.g. http://localhost:3000/semantic2/alpha/explorer/publications#fq=bibcode%3A2010A%26A...514A..64P&q=_%3A_

title="Physical and morphological properties of z ~ 3 Lyman break galaxies: dependence on Lyα line emission"

if this is saved and then 'saved search' selected you can see the \alpha is not displayed correctly.

DougBurke commented 13 years ago

The 'saved paper title' issue should be fixed by

https://github.com/DougBurke/appsem/commit/ad98e0f609dac8070a57020f32f0fcd8765e466e

I note there are several places where the charset is not explicitly set to UTF-8 in server.js so leaving this bug open as a reminder.

Edit: tested this fix on OS-X and Linux and looks good.

Edit: sample bibcodes with UTF-8 characters include

2006Natur.441..724R   (beta)
2005A&A...433.1031D  (beta)
2004ApJ...606.1174B   (Angstrom)
2005ApJ...622..680R    (Angstrom)
2005ApJ...633L..37I      (eta)
2006ApJ...642.1098H   (eta)
2004ApJ...606...85C     (alpha)
2009Ap&SS.320..145B   (degree symbol)
DougBurke commented 13 years ago

Here's another problematic paper: bibcode 2005PASP..117...13S which we have a title of

A Revised Geometry for the Magnetic Wind of &thetas;1 Orionis C

rather than

A Revised Geometry for the Magnetic Wind of θ1 Orionis C

from http://labs.adsabs.harvard.edu/ui/abs/2005PASP..117...13S

DougBurke commented 13 years ago

2007ApJS..168...58 has a title of

Abundances and Behavior of 12CO, 13CO, and C2 in Translucent Sight Lines

where the 12, 13 and 2 are super scripts. When saved the title is rendered as (on the search page)

 Abundances and Behavior of <SUP>12</SUP>CO, <SUP>13</SUP>CO, and C<SUB>2</SUB> in Translucent Sight Lines 

2006A&A...458..541B is another example since its title is

Establishing HZ43 A, Sirius B, and RX J185635-3754 as soft X-ray standards: a cross-calibration between the Chandra LETG+HRC-S, the EUVE spectrometer, and the ROSAT PSPC

vs

 Establishing <ASTROBJ>HZ43 A</ASTROBJ>, <ASTROBJ>Sirius B</ASTROBJ>, and <ASTROBJ>RX J185635-3754</ASTROBJ> as soft X-ray standards: a cross-calibration between the Chandra LETG+HRC-S, the EUVE spectrometer, and the ROSAT PSPC 

Example bibcodes showing this include:

2005PASP..117...13S    (sup)
2007ApJS..168...58S    (sup)
2006A&A...458..541B   (astroobj)
2005PASP..117...13S    (&theta;)

This has been fixed in https://github.com/rahuldave/appsem/commit/4336a4a37de4b878e566fef826094fda8fba0297 (the display, that is, not the data ingest issue that leads to the ASTROOBJ tags)