kiliman / tailwindui-crawler

tailwindui-crawler downloads the component HTML files locally
MIT License
765 stars 97 forks source link

Crawler hangs on download or throws ETIMEDOUT error (REVISITED) #49

Closed jensolafkoch closed 3 years ago

jensolafkoch commented 3 years ago

Today, the ETIMEDOUT error was back again. Difference is that I crawled all languages today (instead of just HTML), but I don't think it has anything to do with the timeout, as a lot of directories were processed correctly before:

‼️ FetchError: request to https://tailwindui.com/img/category-thumbnails/sections/content-sections.png failed, reason: connect ETIMEDOUT 172.67.217.20:443 at ClientRequest. (D:\laragon\www\xxx-tailwindui-crawler\node_modules\node-fetch\lib\index.js:1461:11) at ClientRequest.emit (events.js:315:20) at TLSSocket.socketErrorListener (_http_client.js:469:9) at TLSSocket.emit (events.js:315:20) at emitErrorNT (internal/streams/destroy.js:106:8) at emitErrorCloseNT (internal/streams/destroy.js:74:3) at processTicksAndRejections (internal/process/task_queues.js:80:21) { type: 'system', errno: 'ETIMEDOUT', code: 'ETIMEDOUT' }

(Maybe it would be helpful to show the timeout value as part of the error output by default?)

jensolafkoch commented 3 years ago

I believe it hangs when accessing images. Next time it was:

FetchError: request to https://tailwindui.com/img/category-thumbnails/sections/heroes.png failed,

kiliman commented 3 years ago

Sorry. That’s another place where fetch is called. I’ll add retry logic.

kiliman commented 3 years ago

I refactored the code to use fetchWithRetry everywhere. Please try the latest on master branch.

jensolafkoch commented 3 years ago

Both attempts (with just html and then all languages) finished successfully. (FYI: It took a minute or two after the last directory/file was written locally before the "Done" message appeared finally.)

kiliman commented 3 years ago

Yes. If you set BUILDINDEX=1 then the final index.html page also downloads the thumbnails used in the component categories.

image

There are 64 images there so if your Internet is slow, it may take a while.

Right now it downloads everything all the time. I can add a check to see if the file already exists and send an If-Modified-Since header to only download updated files.

I logged timings and yeah, it adds up.

333 ms https://tailwindui.com/img/category-thumbnails/sections/heroes.png
286 ms https://tailwindui.com/img/category-thumbnails/sections/feature-sections.png
307 ms https://tailwindui.com/img/category-thumbnails/sections/cta-sections.png
191 ms https://tailwindui.com/img/category-thumbnails/sections/pricing.png
360 ms https://tailwindui.com/img/category-thumbnails/sections/header.png
499 ms https://tailwindui.com/img/category-thumbnails/sections/faq-sections.png
178 ms https://tailwindui.com/img/category-thumbnails/sections/newsletter-sections.png
192 ms https://tailwindui.com/img/category-thumbnails/sections/stats-sections.png
316 ms https://tailwindui.com/img/category-thumbnails/sections/testimonials.png
168 ms https://tailwindui.com/img/category-thumbnails/sections/blog-sections.png
159 ms https://tailwindui.com/img/category-thumbnails/sections/contact-sections.png
151 ms https://tailwindui.com/img/category-thumbnails/sections/team-sections.png
277 ms https://tailwindui.com/img/category-thumbnails/sections/content-sections.png
341 ms https://tailwindui.com/img/category-thumbnails/sections/footers.png
395 ms https://tailwindui.com/img/category-thumbnails/sections/logo-clouds.png
165 ms https://tailwindui.com/img/category-thumbnails/elements/headers.png
398 ms https://tailwindui.com/img/category-thumbnails/elements/banners.png
182 ms https://tailwindui.com/img/category-thumbnails/elements/flyout-menus.png
182 ms https://tailwindui.com/img/category-thumbnails/page-examples/landing-pages.png
182 ms https://tailwindui.com/img/category-thumbnails/page-examples/pricing-pages.png
171 ms https://tailwindui.com/img/category-thumbnails/page-examples/contact-pages.png
294 ms https://tailwindui.com/img/category-thumbnails/application-shells/stacked.png
412 ms https://tailwindui.com/img/category-thumbnails/application-shells/sidebar.png
284 ms https://tailwindui.com/img/category-thumbnails/application-shells/multi-column.png
415 ms https://tailwindui.com/img/category-thumbnails/headings/page-headings.png
302 ms https://tailwindui.com/img/category-thumbnails/headings/card-headings.png
434 ms https://tailwindui.com/img/category-thumbnails/headings/section-headings.png
154 ms https://tailwindui.com/img/category-thumbnails/data-display/description-lists.png
284 ms https://tailwindui.com/img/category-thumbnails/data-display/stats.png
303 ms https://tailwindui.com/img/category-thumbnails/lists/tables.png
365 ms https://tailwindui.com/img/category-thumbnails/lists/stacked-lists.png
111 ms https://tailwindui.com/img/category-thumbnails/lists/grid-lists.png
177 ms https://tailwindui.com/img/category-thumbnails/lists/feeds.png
198 ms https://tailwindui.com/img/category-thumbnails/forms/form-layouts.png
411 ms https://tailwindui.com/img/category-thumbnails/forms/input-groups.png
273 ms https://tailwindui.com/img/category-thumbnails/forms/select-menus.png
393 ms https://tailwindui.com/img/category-thumbnails/forms/sign-in-forms.png
161 ms https://tailwindui.com/img/category-thumbnails/forms/radio-groups.png
353 ms https://tailwindui.com/img/category-thumbnails/forms/toggles.png
159 ms https://tailwindui.com/img/category-thumbnails/forms/action-panels.png
158 ms https://tailwindui.com/img/category-thumbnails/feedback/alerts.png
278 ms https://tailwindui.com/img/category-thumbnails/navigation/navbars.png
189 ms https://tailwindui.com/img/category-thumbnails/navigation/pagination.png
332 ms https://tailwindui.com/img/category-thumbnails/navigation/tabs.png
296 ms https://tailwindui.com/img/category-thumbnails/navigation/vertical-navigation.png
291 ms https://tailwindui.com/img/category-thumbnails/navigation/sidebar-navigation.png
345 ms https://tailwindui.com/img/category-thumbnails/navigation/breadcrumbs.png
168 ms https://tailwindui.com/img/category-thumbnails/navigation/steps.png
339 ms https://tailwindui.com/img/category-thumbnails/overlays/modals.png
163 ms https://tailwindui.com/img/category-thumbnails/overlays/slide-overs.png
309 ms https://tailwindui.com/img/category-thumbnails/overlays/notifications.png
400 ms https://tailwindui.com/img/category-thumbnails/elements/avatars.png
161 ms https://tailwindui.com/img/category-thumbnails/elements/dropdowns.png
177 ms https://tailwindui.com/img/category-thumbnails/elements/badges.png
462 ms https://tailwindui.com/img/category-thumbnails/elements/buttons.png
170 ms https://tailwindui.com/img/category-thumbnails/elements/button-groups.png
278 ms https://tailwindui.com/img/category-thumbnails/layout/containers.png
341 ms https://tailwindui.com/img/category-thumbnails/layout/panels.png
177 ms https://tailwindui.com/img/category-thumbnails/layout/list-containers.png
163 ms https://tailwindui.com/img/category-thumbnails/layout/media-objects.png
403 ms https://tailwindui.com/img/category-thumbnails/layout/dividers.png
353 ms https://tailwindui.com/img/category-thumbnails/page-examples/home-screens.png
179 ms https://tailwindui.com/img/category-thumbnails/page-examples/detail-screens.png
337 ms https://tailwindui.com/img/category-thumbnails/page-examples/settings-screens.png
📝  Writing /components/index.html

🏁  Done!
kiliman commented 3 years ago

Ok, I added support for If-Modified-Since... but it doesn't seem to improve performance all that much since the files are pretty small, it still takes time to make the request.

Anyway, I also include more logging so hopefully you can see what's going on.

Get latest and let me know how it goes for you.

jensolafkoch commented 3 years ago

Looks fine. There are still some long delays (much longer than the ms values shown) so maybe my local dev environment (Windows, with Laragon as kind of XAMPP) has something todo with it? Anyway, as long as it runs till the end, I'm happy! :-)

The messages are cut off, no big deal:

cutoff

kiliman commented 3 years ago

The time displayed is only showing the last successful one.

I guess I can log when it times out and has to retry. I'm pretty much brute forcing the connection. Keep trying until it's successful or 3 retries.

Yes, I truncated to 80 characters to prevent terminal wrapping since these URLs were getting very long.

jensolafkoch commented 3 years ago

I am absolutely happy as it is - thanks again for your package! :-)

jensolafkoch commented 3 years ago

If you check for existing files, do you also purge files an directories and preview images which are no longer exist? I'm just curious whether to start from scratch every now and then or just update the current tree.

kiliman commented 3 years ago

Not at the moment. But I'm rethinking the process. If you use the GitHub action, it always does a fresh checkout of the target repository. So all the file times will be right now. So when it sends the If-Modified-Since header, 99.9% of the time, you'll get a 304 Not Modified result.

A couple of options.

  1. Just always download the latest version, even if the file exists. This was the old behavior. I was hoping it would be faster to check first before downloading, but it looks like the connection time is roughly the same, since the files are relatively small.
  2. Create an assets.json file that stores a list of all the URLs downloaded and their associated etag. Then send the If-None-Match: etag header. Still not sure if it buys us anything in time saved, but it's more accurate. Also, I can compare the assets just downloaded with the files on disk and remove any that are not in the list. Again, the files are small, so the space saving is negligible.

I'll have to think on it.

kiliman commented 3 years ago

I implemented option 2. It uses etags and will remove any files that are no longer referenced.

kiliman commented 3 years ago

I'm going to close this. Re-open if you experience more timeouts. Thanks!