greghendershott / frog

Frog is a static blog generator implemented in Racket, targeting Bootstrap and able to use Pygments.
916 stars 96 forks source link

Accented chars in title lead to 404 #174

Closed logc closed 7 years ago

logc commented 7 years ago

I am using Frog to write my blog in Spanish. I just noticed that if an article title includes accented characters, e.g. 'día', the generated site under blog/ includes the accented char as part of the subdirectory name, e.g. blog/2016/11/17/día.

When this generated site is deployed, the URL to reach the article is correctly url-encoded into blog/2016/11/17/d%C3%ADa, but this address leads to a "404 - Not Found" error when the article title is clicked.

In fact, I was trying to avoid this problem by naming the files without accents, e.g. '2016-11-17-dia.md'. Wouldn't it be easier if the generated site took its URLs from the file names, instead of the titles?

gerdint commented 7 years ago

My blog is in Swedish so I was running into similiar problems at some point. But in your case may the problem be the coding system of the file system? (either on your developer machine or web server).

I cannot recall how I made this work (I develop on macOS and deploy to Github Pages), but on my blog (see for instance this post on my blog) it indeed does seem to work.

logc commented 7 years ago

Thanks for the hint! It may be the case. I write the site on macOS, and deploy to a Debian server via rsync. I will have a look at your post.

gerdint commented 7 years ago

Right. I suppose both filesystems support Unicode of some sort. But since you're dealing with diacritics, you may want to make sure that filename string is [Unicode] normalized as well perhaps.

tautologico commented 7 years ago

I ran into this kind of trouble too, developing on OS X and deploying on linux (using another site generator instead of frog). As @tger commented, it has to do with conventions for representing diacritics, even when both systems are using utf-8. OS X and Linux seem to use different conventions (single character versus character + modifier), so eventually I resorted to developing on my mac and then building the version for deploy on a linux machine.

gerdint commented 7 years ago

Apparently macOS HFS+ does Unicode normalization for you, while on Linux you probably need to do it yourself (could depend on the filesystem). One solution could be to have Frog normalize the filename.

I understand that APFS (Apple's new filesystem) will NOT normalize automatically, which would make the behaviour more in line with Linux (meaning the file system just treats the file name as an array of bytes, and it's up to the application to do any normalization).

greghendershott commented 7 years ago
  1. I'm trying to learn about Unicode normalization and understand the correct thing to do, here.

    I see these four forms and it looks like Racket provides a function to convert to each.

    Does anyone know which of the four forms we ought to use?

  2. Having said all that, maybe what you said @logc would be sufficient:

    Wouldn't it be easier if the generated site took its URLs from the file names, instead of the titles?

greghendershott commented 7 years ago
  1. Having said all that, maybe what you said @logc would be sufficient:

    Wouldn't it be easier if the generated site took its URLs from the file names, instead of the titles?

Sorry I was a bit sleepy when I wrote that. The URI path is taken from a file name -- but it's the destination file name, which is created by running things through the permalink pattern.

Some people want the full title in the URI path because it looks better, and/or (IIRC) because they believe it's helpful for search engine results.


Also: ~AFAIK something like blog/2016/11/17/día is a perfectly valid URI and need not be percent-encoded. So, I'll try to figure out where/why/how that is happening.~ EDIT: I'm an idiot; it does need to be percent-encoded.

gerdint commented 7 years ago

I see these four forms and it looks like Racket provides a function to convert to each. Does anyone know which of the four forms we ought to use?

HFS+ apparently uses a modified form of Form D (see https://en.wikipedia.org/wiki/HFS_Plus). So using that may solve the problem for people developing on macOS and deploying on Linux. Not sure if this is a good solution though.

greghendershott commented 7 years ago

@tger Thank you for the information!

@logc By the way, I mentioned the permalink pattern -- in your .frogrc:

# Pattern for blog post permalinks
# Optional: Default is "/blog/{year}/{month}/{title}.html".
# Here's an example of the Jekyll "pretty" style:
permalink = /blog/{year}/{month}/{day}/{title}/index.html
# There is also {filename}, which is the `this-part` portion of
# your post's YYYY-MM-DD-this-part.md file name. This is in case
# you don't like Frog's encoding of your post title and want to
# specify it exactly yourself, e.g. to match a previous blog URI.

Note the {filename} option. In other words, you could change this from the default using {title} to instead use {filename}. Like say /blog/{year}/{month}/{day}/{filename}/index.html.

I think this would let you use the work-around you asked about originally:

In fact, I was trying to avoid this problem by naming the files without accents, e.g. '2016-11-17-dia.md'. Wouldn't it be easier if the generated site took its URLs from the file names, instead of the titles?

I'm sorry this didn't occur to me earlier!

(Of course, even if that work-around is successful for you, I don't think you ought to need to do this -- I still want to figure out what's happening with the percent-encoded URIs and 404s.)

greghendershott commented 7 years ago

I've been learning/thinking/working on this. Currently I think the right things to do are:

  1. Normalize (to NFD) the filenames generated by Frog -- or at least the portion of the filenames that originates from a post title (via the {title} portion of a perma-link pattern).

  2. Percent-encode all the path segments in URIs generated by Frog. So for example http:/example.com/blog/2016/11/17/día is not a valid URI -- even though browsers seem to be tolerant of that. Instead it should be http:/example.com/blog/2016/11/17/di%CC%81a.html -- a percent-encoding of the NFD normalization we used to name the file.

    These percent-encoded URIs should be used throughout files generated by Frog:

    • href and meta in HTML
    • XML feed files
    • sitemap.txt

I have some code to do all this, that I expect to push as a commit on a test branch.

gerdint commented 7 years ago

I have some code in responsive image support that does that:

(require net/uri-codec)

;; URI encode path to handle spaces and non-ascii characters
(define (uri-encode-path path)
  ;; (absolute-path? . -> . path?)
  (let ([ps (for/list ([ps (explode-path path)])
              (uri-path-segment-encode (path->string ps)))])
    (apply build-path "/" (cdr ps))))

It is used when generating theimg srcset URLs. Would be nice if this was part of Frog. But as I understand it "Frog extensions" should not import any standard Frog modules, so perhaps I would need to duplicate it anyhow.

Currently the responsive image support also makes use of some path-related utility procedures from the frog/path module. I guess the clean way is to pass all needed paths as parameters to the plugin enhance-body procedure. A bit unwieldy, but I suppose that's the price you pay for having this kind of extension system (and no public plugin API). Or do you think it would be OK to expose some central functionality useful for extensions? (implying API stability guarantees)

greghendershott commented 7 years ago

@tger Maybe we could open a fresh issue to discuss that since it's mostly OT for this? But: frog/paths and frog/params may be required by the user's frog.rkt -- and by your "extension" package. You'll have access to the functions and parameters. Note that because frog requires the user's frog.rkt which in turn requires your extension, all these modules will share the same instance of the parameters (have the same values). So, I don't think there will need to be excessive arguments supplied to you. Maybe we could discuss more on an issue dedicated to you trying to this, after I merge #194 and any URI commits for this issue #174.