luttje / glua-api-snippets

Scrapes the Garry's Mod Wiki in order to build Lua Language Server comments that will provide IDE suggestions and autocompletion.
MIT License
15 stars 5 forks source link

Optimize scraping #8 #31

Open luttje opened 6 months ago

luttje commented 6 months ago

In #8 I stated that it would be perfectly achievable to only scrape the changed pages, instead of the entire wiki. This PR attempted to implement that, however I ran into an issue:

I seem to have been mistaken, thinking there would be a list on the gmod wiki with all changes. Using that we could scrape only updates. However the only list I can find is https://wiki.facepunch.com/gmod/~recentchanges which shows only recent changes (last 30 days?) and it doesn't allow pagination to discover more changes.

If anyone has got any ideas around this I'm open to suggestions.

I'll leave this PR as a draft until a solution is found. I won't actively look into a solution myself, so help is greatly appreciated. In any case this is marked low-priority, since the scraping of the entire wiki works fine (besides being a bit wasteful).

aske02 commented 6 months ago

wiki.facepunch.com/gmod/~pagelist?format=json's updateCount could work. I haven't checked if they update, but I would assume so. It would be as simple as saving the count when scraping and comparing the count next time.

luttje commented 6 months ago

@aske02 Wow, I can't believe I missed that. I even looked at this data and somehow concluded "this is not useful".

You're right, we could put the update count into the __metadata.json and use that to figure out what's entirely new, updated, or deleted.

I probably won't implement this until I have some more time available, I've got a ton of other projects that I want to focus my attention on now.

Nevertheless, thanks so much for helping out with this, and so quickly as well! Much appreciated!