patrickhulce / third-party-web

Data on third party entities and their impact on the web.
https://www.thirdpartyweb.today/
MIT License
1.07k stars 101 forks source link

refactor: use query stream for automatic data update script #224

Closed Nigui closed 1 month ago

Nigui commented 1 month ago

Hello,

While customizing SQL queries for future PRs, I got an error due to very large query results from big query.

Issue occurs because of JSON.stringify can't be handle very large json results. It throws RangeError: Invalid string length. This issue makes script not scalable.
Moving to stream fixes the issue as data is handled row by row (written to file and inserted in db).

Overwrite lighthouse-infrastructure project

This MR also adds a new environment variable OVERRIDE_LH_PROJECT to overwrite project containing third-party-table where script will create a new table (the one where we store mapping between observed domain and canonical one) then queried by entity-per-page.sql. It'll help in case script runner has no write access to hardcoded project in sql script (i.e lighthouse-infrastructure).

vercel[bot] commented 1 month ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
third-party-web ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 7, 2024 2:38pm
github-actions[bot] commented 3 weeks ago

:tada: This PR is included in version 0.25.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket: