jbolda / gatsby-source-airtable

MIT License
216 stars 43 forks source link

Check the Cache before Re-downloading Attachments #190

Closed socmov closed 4 years ago

socmov commented 4 years ago

When trying to connect to bases with a lot of attachments Develop keeps failing because the connection with airtable times out trying to re-download every image on every build.

Issues

  1. Gatsby-Source-Airtable doesn't lookup the cache when downloading images The localFileCheck function (line 294) doesn't look to see if images already exists - it downloads them every time. Gatsby-source-wordpress does a look up to see if the file already exists in the cache before re-downloading https://github.com/gatsbyjs/gatsby/blob/3af9ec0c9a81196a3054be336633d8adae024d9f/packages/gatsby-source-wordpress/src/normalize.js#L514

  2. Using a last modified time to validate the cache. There is a risk that images will never get updated with the approach above (Without a manual wiping of the cache). Airtable doesn't have a per attachment last modified time but it does have a per row last modified time. We could use the same modified time approach that the wordpress plugin uses when looking up the cache.

  3. Gatsby clears the cache on every build. Highlighted in this issue - with workaround https://github.com/gatsbyjs/gatsby/issues/10161

Let me know if this is something that you would be open to. If so, I can see if I can hire someone to create a PR (this is above what I can do).

jbolda commented 4 years ago

We are using createRemoteFileNode which is a function within gatsby-source-filesystem. To my understanding, it is caching any of the files downloaded. We do not need to check and manage the file cache as this function handles it for us.

To my knowledge, Airtable doesn't give us (last I checked) any timestamp in the API that we can hook into to check on last modified data so we pull the data via the API and create nodes everytime.

We have had reports of rate limiting which could very well be the issue here. You may want to play around with the concurrency option. Also it seems that there is something going on that it clears your cache on every build. (I haven't seen that myself in any of my sites.)

Hopefully that helps give you some direction.

jbolda commented 4 years ago

Also, it may be worth testing out the next version (see #187) as we send the file ext through to createRemoteFileNode which I am wondering if it may resolve some of these weird file node issues that seem to be cropping up.

socmov commented 4 years ago

I have it working now but it still takes about 10 mins to start up / build every time. Airtable lets you add a Last Updated field to a table. It isn't mandatory but it tracks the last updated date. Could it be an option to let people enter the Last Updated field that they have added to their table if they have one?

jbolda commented 4 years ago

As a built-in piece or are you manually creating a table with that field? For it to be usable, we would also need to be able to query and only return results modified after that datetime. Then from there, we would need to be able to replace stale nodes (which I think would be mostly taken care of by gatsby-core).

Is it working due to switching to the next version, v2.1.1? Is the 10 minutes the time taken for build without a cache? How long does it take with the cache existing? The images should be getting cached so I would expect there to be a difference.

benknight commented 4 years ago

I'm having similar issues as described here. I have a large base with about 1300 records all with multiple attachments. On each Gatsby build, with or without cache, this plugin redownloads images:

(my local machine:)
fetch all Airtable rows from 7 tables: 17.966s
success source and transform nodes - 175.397s

(Github action is much faster:)
fetch all Airtable rows from 7 tables: 1.476s
success source and transform nodes - 11.954s

However the bigger issue comes later on in the build. Since the thumbnails are being redownloaded each time, thumbnails are being regenerated as well, and this step alone takes as long as 15-20 minutes.

success Generating image thumbnails - 922.476s - 12546/12546 13.60/s

I've also tried enabling incremental builds via GATSBY_EXPERIMENTAL_PAGE_BUILD_ON_DATA_CHANGES=true and I believe because the images are redownloaded on every build, Gatsby considers each and every page to have changed and therefore rebuilds 100% of the site every time, regardless of the incremental builds flag.

@jbolda can you advise?

jbolda commented 4 years ago

@benknight is this on the latest version of gatsby-source-airtable?

benknight commented 4 years ago

@jbolda yes this is 2.1.1

    "gatsby": "^2.21.31",
    "gatsby-image": "^2.4.3",
    "gatsby-plugin-algolia": "^0.5.0",
    "gatsby-plugin-manifest": "^2.4.3",
    "gatsby-plugin-offline": "^3.2.2",
    "gatsby-plugin-prefetch-google-fonts": "^1.4.3",
    "gatsby-plugin-react-helmet": "^3.3.1",
    "gatsby-plugin-robots-txt": "^1.5.0",
    "gatsby-plugin-sass": "^2.3.1",
    "gatsby-plugin-sharp": "^2.6.3",
    "gatsby-plugin-sitemap": "^2.4.2",
    "gatsby-source-airtable": "^2.1.1",
    "gatsby-source-filesystem": "^2.3.3",
    "gatsby-transformer-sharp": "^2.5.2",
jbolda commented 4 years ago

@benknight Do you have a specific example that you can point to where this happens? It looks like you are using Github Actions so being able to see the full logs from that would be helpful.

Assuming we are talking about your cocolist project, all the examples I can find it looks like Gatsby is scrubbing your cache: https://github.com/benknight/cocolist/runs/682645699?check_suite_focus=true#step:8:29 . Also note, you need the keep your public in your cache action as that is also used to store final output, and may be part of the reason that your .cache thinks it needs to scrub.

Everything in success Generating image thumbnails is sort of out of this plugins scope which is unfortunate as it surfaces here. All we are doing is passing the url to createRemoteFileNode which could possibly have a bug in it not persisting the cache. It seems that that now expects an extension which it didn't before and we added in v2.1.1. The other situation is that some other thing in your instance is deleting / marking those nodes to be deleted. It would be nice to know what is happening so at the very least we can add it to the docs.

benknight commented 4 years ago

Hey @jbolda let me back up, I think I may have provided too much info my original post that confused the central issue.

This is not related to Github Actions, I'm seeing the same behavior on my local machine.

What seems to be the issue, as the OP described, is that this plugin is re-downloading attachments on every build. I believe because of this, later down the line thumbnails are also regenerated which is a 15-20 minute step alone. They key lines of the output were what I provided above:

fetch all Airtable rows from 7 tables: 17.966s
success source and transform nodes - 175.397s  <--- attachments redownloaded ~3 minutes

… a few moments later …

success Generating image thumbnails - 922.476s - 12546/12546 13.60/s <--- 15 minutes spent regenerating thumbnail for re-downloaded attachments
jbolda commented 4 years ago

@benknight I don't believe you have confused me 😄 . I was specifically looking for Github Actions as then we have a shared "state" of logs that we both can look at. A snippet of logs with number of seconds gives very little context to go off.

To be clear, this plugin uses gatsby-source-filesystem for the createRemoteFileNode function. This handles the image / file downloading and caching as previously noted. I have tested it locally and in CI / Github Actions and have observed the expected behavior. The WordPress plugin uses it's own cache and checks the file nodes there. It doesn't appear Airtable has a global option in the API to pull data in any time based way.

I specifically pointed out that line in the CI as something in your setup was deleting your cache which would cause the image to be downloaded and processed again. This plugin does query and create the full set of nodes every time, but that only includes the file url which is then passed to createRemoteFileNode. That function creates it's own node with a check in cache for the images.

this plugin is re-downloading attachments on every build

I am not disagreeing that it appears your images are being reprocessed. I am disagreeing on this quoted point (with the stipulation that I am open to being proved wrong). My proposition is that the issue is within gatsby-core and a general cache issue, or with createRemoteFileNode and how it handles it's specific cache. Any of these need a lot of context, ideally a consistent reproduction, and a bunch of someone's time. None of these which have surfaced yet.

benknight commented 4 years ago

Fair enough, thanks for taking the time to thoroughly respond. I admit I got "excited" when I saw this issue and assumed lazily it must be the diagnosis for my long build times, but I think I need to dig a bit deeper to find out what's going on here.

jbolda commented 4 years ago

Certainly not unfounded as bugs are sure to be abound as cache is a really hard thing to get right.

Is this on the cocolist project of yours? I would be interested in looking at some logs to see if anything jumps out at me.

benknight commented 4 years ago

@jbolda I certainly appreciate it if you want to take a look. You're correct, it's the Cocolist project. The most recent build log will be found here:

https://github.com/benknight/cocolist/runs/685181371?check_suite_focus=true

You can see I already modified my action config to cache the Gatsby public directory as well.

jbolda commented 4 years ago

If you want additional examples to check out, I just ran a build with the cache on Github Actions. While my image step (and number of images) is much less intensive than yours, it cuts my build time in half. With incremental builds on, it only updates the two pages that have changed.

Build without existing cache which creates the cache: https://github.com/jbolda/jacobbolda.com/runs/685912076?check_suite_focus=true#step:9:37 (This is the image process step, 510 of them). Build with cache: https://github.com/jbolda/jacobbolda.com/runs/686009444?check_suite_focus=true#step:9:35 (No images are created in this build, this is the step right before images.)

benknight commented 4 years ago

Okay well I guess I have no idea what's going on with my build then ^^ It's completely redownloading nodes and regenerating ~13,000 thumbnails with/without cache. Thanks anyway for helping me troubleshoot.

jbolda commented 4 years ago

Looks like your most recent builds are working mostly. Build times about a third. https://github.com/benknight/cocolist/runs/701946567?check_suite_focus=true#step:9:61

I would recommend adding a cache key based on your lock file hash to both the public and .cache as well. As is, that cache will never be updated so all your future builds will start from a cache from the first time it built successfully.

Also with incremental builds, --log-pages is nice so you can see what pages were actually rebuilt (and if the number of related assets built make sense).

benknight commented 4 years ago

@jbolda yeah I realized shortly after my last message that the build times suddenly cut down, so it looks like my configurations are finally mostly correct. Thanks again for helping my troubleshoot. This issue can probably be closed :)

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had any recent activity for 30 days. It will be closed if no further activity occurs within 7 days. Remove stale label, comment, and/or apply "ongoing issue" label to prevent this from being closed. Thank you for your contributions.