simile-widgets / exhibit

Publishing Framework for Large-Scale Data-Rich Interactive Web Pages
MIT License
175 stars 94 forks source link

Bookmark Base64 encoding #182

Open jjon opened 7 years ago

jjon commented 7 years ago

Here's a corner case bug, but I'm not sure where, exactly, it lies.

When integrating an Exhibit into a WordPress site, I discovered that the URL generated by the "bookmark" function merely opened the page with the default set of data items, ignoring state. Tom Woodward, on the simile-widgets list very helpfully pointed out that the Base64 payload of the generated URL was corrupted.

Exhibit.History.getState() retrieves an object with a title property. When the dataset has been filtered, the title property is a string comprised of a page title followed by a string subtitle generated by Exhibit.History.pushState()

title += " {" + subtitle + "}"; (line 235 in history.js)

In WordPress, that title string is a concatenation of the page template's "slug" and the site name. By some off-stage php chicanery, these are concatenated with a separator which is an en dash (\u2013). It is this character (as well as the em dash (\u2014)) that causes Bookmark.generateBookmarkHash(state) to produce a corrupted base64 string. When a browser tries to interpret the URL so generated, it simply ignores the corrupted payload, and loads the default page and dataset.

Working at the browser console, I observe the following:

state = Exhibit.History.getState()
Object {normalized: true, title: "Collection Exhibit – Ocean Acidification Curriculum Collection {Text search foo}", url: "http://www.oacurriculumcollection.org/collection-exhibit/", hash: ".//collection-exhibit/?&_suid=148763896047903145240874606472", data: Object…}

Note the en dash in the title. Then if we generate the Base64 string for the URL and decode it we get gibberish:

Base64.decode(Exhibit.Bookmark.generateBookmarkHash(state))
"{"normalized":true,"title":"Collection Exhibit$ÈØÙX[ˆXÚYYšXØ][ۈÝ\œšXÝ[[HÛÛXÝ[ۈ‹\›Žˆš‹ËÝÝݲØXÝ\œšXÝ[[XÛÛXÝ[ۋ›Ü™ËØÛÛXÝ[ۋY^Xš]ȋš\ÚŽˆ‹‹ËØÛÛXÝ[ۋY^Xš]ÏɗÜÝZYLM
͍͌LÍMMMŒŽNLŒÌNLÌLMŒȋ™]HŽžÈ˜ÛÛ\ۙ[ȎžßKœÝ]HŽN_KšYŽˆŒM
͍͌LÍMMMŒŽNLŒÌNLÌLMŒȋ˜ÛX[•\›Žˆš‹ËÝÝݲØXÝ\œšXÝ[[XÛÛXÝ[ۋ›Ü™ËØÛÛXÝ[ۋY^Xš]ȋš\ÚY\›Žˆš‹ËÝÝݲØXÝ\œšXÝ[[XÛÛXÝ[ۋ›Ü™ËØÛÛXÝ[ۋY^Xš]ËØÛÛXÝ[ۋY^Xš]ÏɗÜÝZYLM
͍͌LÍMMMŒŽNLŒÌNLÌLMŒȟ"

If we then alter the title property of the state object thus:

state.title = state.title.replace("\u2013", "--")
"Collection Exhibit -- Ocean Acidification Curriculum Collection {Text search foo}"

Then do encode/decode as before:

Base64.decode(Exhibit.Bookmark.generateBookmarkHash(state))
"{"normalized":true,"title":"Collection Exhibit -- Ocean Acidification Curriculum Collection {Text search foo}","url":"http://www.oacurriculumcollection.org/collection-exhibit/","hash":".//collection-exhibit/?&_suid=148763731640905208773945768443","data":{"components":{"facet-text--default-0":{"type":"facet","state":{"text":"foo"}}},"state":61,"lengthy":true},"id":"148763731640905208773945768443","cleanUrl":"http://www.oacurriculumcollection.org/collection-exhibit/","hashedUrl":"http://www.oacurriculumcollection.org/collection-exhibit//collection-exhibit/?&_suid=148763731640905208773945768443"}"

We get the uncorrupted JSON string we need for the bookmark URL. My work-around for this is crude, but effective. I simply execute document.title = document.title.replace(/\u2013/, "--"); in an onLoad function, and all is well. But, I found it strange that ONLY \u2013 and \u2014 will corrupt the JSON string in response to Exhibit.Bookmark.generateBookmarkHash(state). So far as I can tell, literally ANY other character will work, whether ascii or not. Is the Base64 function at fault?

Anyway, not exactly crucial, inasmuch as there's an easy fix, but puzzling nonetheless.

jjon commented 7 years ago

So, yes. After a little further experimentation it becomes clear that the Base64 methods are giving incorrect results for em dash and en dash. Using those methods (from http://api.simile-widgets.org/exhibit/STABLE/lib/base64.js) at the chrome console I get the following results:

Base64.encode('—') // em dash
"A=="
Base64.encode('–') // en dash
"w=="
Base64.encode('-') // hyphen
"LQ=="

Whereas, using the python base64 module, I get this:

>>> base64.b64encode('—') # em dash
'4oCU'
>>> base64.b64encode('–') # en dash
'4oCT'
>>> base64.b64encode('-') # hyphen
'LQ=='

Unfortunately, I don't know nearly enough about the bitwise manipulation of strings to offer a solution.

j

jjon commented 7 years ago

Hmm. I guess nobody wanted to embarrass me by pointing out that base64 is for encoding 8-bit characters! So, there's nothing at all wrong with the base64 methods. My problem is thus not a bug in Exhibit; however, it does seem that Exhibit.History.init should be armored against this sort of thing. It seems like if Exhibit.Bookmark.generateBookmarkHash is going to return base64, then the title property of the state object should be sanitized somewhere along the line.