export: Exclude certain paths from being crawled

deanpress commented 4 years ago

Is your feature request related to a problem? Please describe.

I have some pages that call data from an external API by a case-sensitive URL parameter.

When I export my page, the dynamic that Sapper can find are crawled and cached, generating paths like user/myUserId/index.html.

The problem is that the URLs become inconsistent: non-crawled pages can only be accessed using the case sensitive MyUserId, while crawled pages can also be accessed with myuserid (since these are also static paths stored by the crawler).

I don't believe that crawling these dynamic pages makes sense for our use case.

Describe the solution you'd like

Exclude certain paths from the export feature's crawler so desired URLs remain case sensitive and dynamic.

acewf commented 4 years ago

if understand correctly the behaviour you explain!
i guess you want is not build specific pages, did you try to use the --entry parameter to build the ones you need? https://sapper.svelte.dev/docs#How_it_works

i didn't see on the documentation how to exclude, but the entry works as the ones you want to include!

Also seems there are a couple of PR's trying to implement similar behaviour

https://github.com/sveltejs/sapper/pull/856 https://github.com/sveltejs/sapper/pull/1020

and also a issue https://github.com/sveltejs/sapper/issues/1019

deanpress commented 4 years ago

@acewf I did try setting an entry, but from my experience and after reviewing how entry is handled: https://github.com/sveltejs/sapper/blob/910d28e3419409b5498933e8c13d37cc03b652a0/src/api/export.ts#L138 )

The entry setting is a starting point for the crawler.

In my case, I'd only want the index.svelte routes crawled but not the [id] routes.

It seems those PRs would fix my issues, thanks! Might have to fork Sapper and include the patch manually if it's not merged with the main repo.

acewf commented 4 years ago

the entry supports multi routes, so you can have for example something like --entry "/home /contact" event if you are using slugs, you can write this slugs on the entry parameter.

But i guess it would be easier with a exclude parameter

deanpress commented 4 years ago

@acewf But the paths entered in entry will cause sapper to crawl the anchors it finds on the page. If all your entry pages have anchors, then it will still crawl all the pages that aren't desired to be crawled, correct?

vipero07 commented 4 years ago

This sounds potentially like an issue with how you have the MyUserId page configured. Per the docs you don't want to export if there are user sessions or authentication (if MyUserId is the id of a logged in user, this would be what they are saying not to do). However if MyUserId is just some userId in a link on the site you could try using preload fetch in the MyUserId page.

deanpress commented 4 years ago

@vipero07 It's just an example. There are no user sessions. Only public data fetched from third party APIs. Using preload fetch or onMount fetch makes no difference, since the page is always crawled and an index.html page is generated for it.

vipero07 commented 4 years ago

I believe I understand now. I guess as part of your build process you could just delete the route to those pages. But a better solution would be some way to not include the route as you said. I'm not sure if <a rel=nofollow ...> would work but you could try that, and if it doesn't work... it or something similar probably should (something more sapper specific like nocrawl). That would align with the rel=prefetch idiom.

lunchboxer commented 4 years ago

I also have a desire to exclude certain links from being crawled. I have pages that I want served under the same root, but aren't handled by sapper. They're processed and put in the static folder. When the crawler gets to links to them it breaks with error

> The "url" argument must be of type string. Received null

I'd like to just add something like nocrawl to the link, since an automatic solution seems unlikely to me.

lunchboxer commented 4 years ago

giving export a regexp string of urls to ignore would be also work.

roblevintennis commented 4 years ago

I haven't yet got to the bottom of why, but fwiw all of the sudden, I too am getting this when I've previously been able to build my site (so I'm wondering if something updated on a yarn install or something 🤷‍♂️ ):

> The "url" argument must be of type string. Received null

UPDATE: I just did a yarn upgrade sapper which resulted in this in my lock file:

 sapper@^0.27.0:
-  version "0.27.13"
-  resolved "https://registry.yarnpkg.com/sapper/-/sapper-0.27.13.tgz#75c965ea28c052ead9e6094e0b77a189199f534c"
-  integrity sha512-LSx7kAE/ukcGLrcKxwoU45puj76HYVm/zQlwiYUHvZ9kRCZbzRAhrIaCna+3BqS0iH6IsAy3aTguZ002SkAc6A==
+  version "0.27.16"
+  resolved "https://registry.yarnpkg.com/sapper/-/sapper-0.27.16.tgz#df2854853f11b968f5ad9d54354fe7dc0cd57680"
+  integrity sha512-q8dohkbhga6xO+0a8h84odFyoilQ0D0vJtF8NHra/DQmSeN2R2MXUfwhw3EyvLms3T1x8H3v+qw642Qf5JXA9g==

all to say I've when from 0.27.13 to latest 0.27.16 and running export no longer produces the crawl error, so something perhaps was fixed between patch versions 13 and latest 16 (at time of writing)

Unfortunately though, now when I run npx serve sapper/export it looks like my main index.html file is no longer built and I just see some static assets.

UPDATE 2: Ok, this is rather embarrassing but perhaps it's best I leave this just in case someone else google here, but I had another server on localhost:3000 running that I forgot about :-) and after shutting that down sapper export is working again.

Perhaps it's concerning that there's no port conflict detection as one usually gets when you try to start a server on a popular port and there's already one running. Because I just saw:

> Crawling http://localhost:3000/

I had none of the usual port conflict help I'm used to (I know, I'm being spoiled, but food for thought on what might be ideal). I have not investigated how sapper works all this so if it's too hard or unrealistic to do a port check so be it. Anyway, my site is back to building.

finnbear commented 3 years ago

My website contains megabytes of static images, and I wanted to reference one of them in an <img src="/image/path.webp"/>. I get this error as a result: 1.41 kB (404) /image/path.webp/index.html. I can't put all of the images in sapper's static directory, as I deploy that entire directory to the cloud on each rebuild and deploying the images would take too long (and the metadata for them is standardized with terraform). It would be nice if the exclude path feature was per <a>/<img> so I wouldn't have to change the sapper export command, or alternatively, if the feature accepted a regular expression of paths to exclude.

sveltejs / sapper

export: Exclude certain paths from being crawled #1081