gratipay / gratipay.com

Here lieth a pioneer in open source sustainability. RIP
https://gratipay.news/the-end-cbfba8f50981
MIT License
1.12k stars 308 forks source link

Integrate npm #4148

Closed chadwhitacre closed 7 years ago

chadwhitacre commented 7 years ago

✈️ This is the flight deck for the Integrate npm project. ✈️


Current open-source crowdfunding options (Kickstarter, Patreon, Gratipay, OpenCollective, etc.) are consumer-grade. Our hunch is that a business-grade product with better aggregation can better serve the companies that want to pay for open source, because companies use hundreds or thousands of open source packages, not just a few.

Picking up from https://github.com/gratipay/gratipay.com/pull/4135#issuecomment-255122149 and https://github.com/gratipay/inside.gratipay.com/issues/852#issuecomment-255098337 ...

For wider context see:

JavaScript is the most popular language in open source and npm is the most popular package manager for JavaScript. A good first concrete step towards helping companies pay for open source (#4135), therefore, will be to add the ability to pay for any package on npm. Once we have npm deployed, we will have enough experience to inform a partnership with Libraries.io for the rest of the package managers.

Target

Our goal is to announce this feature in my lightning talk on Thursday, October 26 at Red Hat's All Things Open conference (https://github.com/gratipay/inside.gratipay.com/issues/757).

Our goal is to incrementally improve this feature throughout the first half of 2017, with an eye towards OSCON and $ustain in May.

Package names to test with

From https://github.com/gratipay/gratipay.com/pull/4135#issuecomment-262672635:

http://localhost:8537/on/npm/async/ http://localhost:8537/on/npm/iframe-resizer/ http://localhost:8537/on/npm/mongoose/ http://localhost:8537/on/npm/nodemon/ http://localhost:8537/on/npm/react/ http://localhost:8537/on/npm/react-helmet/ http://localhost:8537/on/npm/react-modal/ http://localhost:8537/on/npm/react-redux/ http://localhost:8537/on/npm/react-router/ http://localhost:8537/on/npm/react-router-redux/ http://localhost:8537/on/npm/redux/ http://localhost:8537/on/npm/redux-thunk/ http://localhost:8537/on/npm/webpack/

Todo

Prerequisites

Checkpoint 1: Inert /on/npm/foo/ Pages

Checkpoint 2: Giving to Packages

Checkpoint 3: Easy Sign-up

Nice to Have

Promotion


✈️ This is the flight deck for the Integrate npm project. ✈️

chadwhitacre commented 7 years ago
2016-10-27T00:35:26.231474+00:00 app[scheduler.9462]: processed 372271 packages in 247 seconds
2016-10-27T00:35:35.102567+00:00 heroku[scheduler.9462]: source=scheduler.9462 dyno=heroku.1942532.7efd96ec-cd0b-4718-b884-a779ee6f7de1 sample#load_avg_1m=0.77 sample#load_avg_5m=0.98
2016-10-27T00:35:35.102790+00:00 heroku[scheduler.9462]: source=scheduler.9462 dyno=heroku.1942532.7efd96ec-cd0b-4718-b884-a779ee6f7de1 sample#memory_total=140.20MB sample#memory_rss=34.25MB sample#memory_cache=105.94MB sample#memory_swap=0.00MB sample#memory_pgpgin=220943pages sample#memory_pgpgout=185564pages sample#memory_quota=512.00MB
2016-10-27T00:35:37.420115+00:00 heroku[scheduler.9462]: State changed from up to complete
2016-10-27T00:35:37.400471+00:00 heroku[scheduler.9462]: Process exited with status 0
chadwhitacre commented 7 years ago

Okay, so that looks like 247 seconds for the download/serialization, and another 11 seconds for the upsert. Call it ~260 seconds == 4.333 minutes.

chadwhitacre commented 7 years ago

We have a nightly backup from last night. I am taking another one just for safety.

chadwhitacre commented 7 years ago

Looks like packages adds 15 MB, or 21%.

screen shot 2016-10-26 at 8 48 25 pm

chadwhitacre commented 7 years ago

Curveball: some packages are unpublished without having a security package in their place. They show up in all and also on the registry, but their JSON is actually 404 and they have an unpublished time. They don't show up on site search or HTML.

chadwhitacre commented 7 years ago

I'm adding the node buildpack back to our app so we can use npm install marky-markdown.

[gratipay] $ heroku run "npm install marky-markdown && sync-npm readmes"
Running npm install marky-markdown && sync-npm readmes on ⬢ gratipay... up, run.4576 (Hobby)
bash: npm: command not found
[gratipay] $ heroku buildpacks
=== gratipay Buildpack URL
https://github.com/gratipay/buildpack-python.git#gratipay-prod
[gratipay] $ heroku buildpacks:add heroku/nodejs
Buildpack added. Next release on gratipay will use:
  1. https://github.com/gratipay/buildpack-python.git#gratipay-prod
  2. heroku/nodejs
Run git push heroku master to create a new release using these buildpacks.
[gratipay] $ git push heroku master
chadwhitacre commented 7 years ago
gratipay::MAROON=> select count(*) from packages where readme_raw is not null;
┌───────┐
│ count │
├───────┤
│    22 │
└───────┘
(1 row)

gratipay::MAROON=>
chadwhitacre commented 7 years ago

Okay! Gonna set this up in the scheduler.

chadwhitacre commented 7 years ago

Blorg. Hitting a memory limit. :-/

2016-10-27T03:30:45.084125+00:00 app[scheduler.6359]: svg2css
2016-10-27T03:30:45.088103+00:00 app[scheduler.6359]: splat-points-2d
2016-10-27T03:30:45.092127+00:00 app[scheduler.6359]: signeds3
2016-10-27T03:30:45.125973+00:00 app[scheduler.6359]: saltyrtc-task-webrtc
2016-10-27T03:30:45.127890+00:00 app[scheduler.6359]: resst-request
2016-10-27T03:30:45.723554+00:00 app[scheduler.6359]: 404 for zzzzzzzzzzzzzzzzzzz
2016-10-27T03:30:45.724534+00:00 app[scheduler.6359]: zzzttt
2016-10-27T03:30:45.979220+00:00 app[scheduler.6359]: 404 for zzzttt
2016-10-27T03:30:45.979404+00:00 app[scheduler.6359]: zzzss
2016-10-27T03:31:00.393328+00:00 heroku[scheduler.6359]: source=scheduler.6359 dyno=heroku.1942532.ee6fcf22-109a-49cf-aa0a-0136911a86ac sample#memory_total=641.52MB sample#memory_rss=103.62MB sample#memory_cache=4.98MB sample#memory_swap=532.91MB sample#memory_pgpgin=251526pages sample#memory_pgpgout=223723pages sample#memory_quota=512.00MB
2016-10-27T03:31:00.394012+00:00 heroku[scheduler.6359]: Process running mem=641M(124.3%)
2016-10-27T03:31:00.394066+00:00 heroku[scheduler.6359]: Error R14 (Memory quota exceeded)
chadwhitacre commented 7 years ago

Started again under scheduler ...

chadwhitacre commented 7 years ago
[gratipay] $ heroku logs --tail | grep scheduler
2016-10-27T03:50:20.064405+00:00 heroku[api]: Starting process with command `sync-npm readmes` by scheduler@ad
dons.heroku.com
2016-10-27T03:50:27.921066+00:00 heroku[scheduler.4775]: Starting process with command `sync-npm readmes`
2016-10-27T03:50:28.518982+00:00 heroku[scheduler.4775]: State changed from starting to up
2016-10-27T03:50:35.273088+00:00 heroku[scheduler.4775]: source=scheduler.4775 dyno=heroku.1942532.ea097600-5c
50-446b-bc32-d9ebbacfd652 sample#memory_total=35.95MB sample#memory_rss=31.01MB sample#memory_cache=4.95MB sam
ple#memory_swap=0.00MB sample#memory_pgpgin=14112pages sample#memory_pgpgout=5419pages sample#memory_quota=512
.00MB
2016-10-27T03:50:42.900770+00:00 app[scheduler.4775]: zzzzzzzzzzzzzzzzzzz
2016-10-27T03:50:42.933254+00:00 app[scheduler.4775]: utilise.clone
2016-10-27T03:50:42.936615+00:00 app[scheduler.4775]: svg2base64
2016-10-27T03:50:42.948038+00:00 app[scheduler.4775]: semver-bumper-for-file-text
2016-10-27T03:50:43.226081+00:00 app[scheduler.4775]: 404 for zzzzzzzzzzzzzzzzzzz
2016-10-27T03:50:43.226131+00:00 app[scheduler.4775]: zzzttt
2016-10-27T03:50:43.458823+00:00 app[scheduler.4775]: 404 for zzzttt
2016-10-27T03:50:43.458926+00:00 app[scheduler.4775]: zzzazzz
2016-10-27T03:50:44.033136+00:00 app[scheduler.4775]: 404 for zzzazzz
2016-10-27T03:50:44.033150+00:00 app[scheduler.4775]: zzz_012_censorify
2016-10-27T03:50:54.621207+00:00 app[scheduler.4775]: utilise.client
2016-10-27T03:50:55.719408+00:00 heroku[scheduler.4775]: source=scheduler.4775 dyno=heroku.1942532.ea097600-5c50-446b-bc32-d9ebbacfd652 sample#memory_total=360.68MB sample#memory_rss=355.62MB sample#memory_cache=5.00MB sample#memory_swap=0.06MB sample#memory_pgpgin=158661pages sample#memory_pgpgout=67876pages sample#memory_quota=512.00MB
2016-10-27T03:50:55.774905+00:00 app[scheduler.4775]: svg2android
2016-10-27T03:50:56.292132+00:00 app[scheduler.4775]: semver-bumper
chadwhitacre commented 7 years ago
gratipay::MAROON=> select count(*) from packages where readme_raw is not null;
┌───────┐
│ count │
├───────┤
│    82 │
└───────┘
(1 row)

gratipay::MAROON=>
chadwhitacre commented 7 years ago

Call it 50 per minute. That's over five days. 😶

chadwhitacre commented 7 years ago

Alright, I'm about done here. We didn't make it.

Oh well. :-)

chadwhitacre commented 7 years ago

I've turned off both recurring tasks.

chadwhitacre commented 7 years ago

Sorry, I've removed them from the scheduler so that they don't recur. The existing process is still chugging along. It just crossed 1,000 readmes. We're coming up on 1100 at 20 minutes. Holding steady about 54 per minute, so I guess we let this run for five days?

mattbk commented 7 years ago

So is it going to run weekly rather than daily then?

kaguillera commented 7 years ago

I don't think so @mattbk. If I am not mistaken @whit537 the five days is for the initial loading and processing of the npm readme. The future updates would be less since it should only update the ones that have changed and add new ones that should not be that much.

kaguillera commented 7 years ago

!m @whit537

mattbk commented 7 years ago

Thank goodness.

chadwhitacre commented 7 years ago

Yes, though I had forgotten that the Heroku scheduler kills processes when the time limit is up, so I am just restarting it now.

kaguillera commented 7 years ago

😞

On Oct 27, 2016 5:57 PM, "Chad Whitacre" notifications@github.com wrote:

Yes, though I had forgotten that the Heroku scheduler kills processes when the time limit is up, so I am just restarting it now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gratipay/gratipay.com/issues/4148#issuecomment-256781714, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNTSUTOuEBz5np9LhcZPd_tTissHeW8ks5q4R45gaJpZM4KcQZM .

chadwhitacre commented 7 years ago
from packages where readme_raw is not null) / (select count(*) from packages)::float;
┌───────────────────┐
│     ?column?      │
├───────────────────┤
│ 0.114843756295817 │
└───────────────────┘
(1 row)

gratipay::MAROON=>
chadwhitacre commented 7 years ago

26%

chadwhitacre commented 7 years ago

35%

I count 109,354 unique email addresses.

select unnest(emails) as email, count(name) from packages group by email;
chadwhitacre commented 7 years ago
npackages nemails
1,000+ 8
100-999 462
10-99 10,266
2-9 52,567
1 46,051
chadwhitacre commented 7 years ago

Looks like npm is stripping a leading H1.

screen shot 2016-10-28 at 4 17 13 pm

screen shot 2016-10-28 at 4 16 51 pm

chadwhitacre commented 7 years ago

Hmm ... https://www.npmjs.com/package/marky-markdown#npm-packages

chadwhitacre commented 7 years ago

Syncers are running again! https://github.com/gratipay/inside.gratipay.com/issues/884#issuecomment-259320916

I guess when we upsert we'll want to go off the time from back when we first started loading things up a couple weeks ago, back at https://github.com/gratipay/gratipay.com/issues/4148#issuecomment-256370122, to make sure we pick up any changes since then.

chadwhitacre commented 7 years ago

Fetcher crashed!

2016-11-09T02:45:47.778173+00:00 app[scheduler.7320]: Traceback (most recent call last):
2016-11-09T02:45:47.779195+00:00 app[scheduler.7320]:   File "/app/.heroku/python/bin/sync-npm", line 9, in <module>
2016-11-09T02:45:47.779275+00:00 app[scheduler.7320]:     load_entry_point('gratipay', 'console_scripts', 'sync-npm')()
2016-11-09T02:45:47.779327+00:00 app[scheduler.7320]:   File "/app/gratipay/package_managers/sync.py", line 160, in main
2016-11-09T02:45:47.779601+00:00 app[scheduler.7320]:     globals()[args.command.replace('-', '_')](env, args, db)
2016-11-09T02:45:47.779619+00:00 app[scheduler.7320]:   File "/app/gratipay/package_managers/sync.py", line 140, in fetch_readmes
2016-11-09T02:45:47.779835+00:00 app[scheduler.7320]:     k-component-page_readmes.fetch(db)
2016-11-09T02:45:47.779868+00:00 app[scheduler.7320]:
2016-11-09T02:45:47.779887+00:00 app[scheduler.7320]:   File "/app/gratipay/package_managers/readmes.py", line 93, in fetch
2016-11-09T02:45:47.781538+00:00 app[scheduler.7320]:     threaded_map(Fetcher(db), dirty, 4)
2016-11-09T02:45:47.781590+00:00 app[scheduler.7320]:   File "/app/gratipay/utils/threaded_map.py", line 23, in threaded_map
2016-11-09T02:45:47.781767+00:00 app[scheduler.7320]:     raise e.args[0]
2016-11-09T02:45:47.782241+00:00 app[scheduler.7320]: psycopg2.ProgrammingError: can't adapt type 'dict'
chadwhitacre commented 7 years ago

Meanwhile the Processor is starting.

chadwhitacre commented 7 years ago

Also crashed!

2016-11-09T02:50:45.011994+00:00 app[scheduler.7918]: Traceback (most recent call last):
2016-11-09T02:50:45.012249+00:00 app[scheduler.7918]:   File "/app/.heroku/python/bin/sync-npm", line 9, in <module>
2016-11-09T02:50:45.012320+00:00 app[scheduler.7918]:     load_entry_point('gratipay', 'console_scripts', 'sync-npm')()
2016-11-09T02:50:45.012355+00:00 app[scheduler.7918]:   File "/app/gratipay/package_managers/sync.py", line 160, in main
2016-11-09T02:50:45.012543+00:00 app[scheduler.7918]:     globals()[args.command.replace('-', '_')](env, args, db)
2016-11-09T02:50:45.012573+00:00 app[scheduler.7918]:   File "/app/gratipay/package_managers/sync.py", line 144, in process_readmes
2016-11-09T02:50:45.012635+00:00 app[scheduler.7918]:     _readmes.process(db)
2016-11-09T02:50:45.012663+00:00 app[scheduler.7918]:   File "/app/gratipay/package_managers/readmes.py", line 97, in process
2016-11-09T02:50:45.012887+00:00 app[scheduler.7918]:     dirty = db.all('SELECT id, package_manager, name, description, readme_raw '
2016-11-09T02:50:45.012917+00:00 app[scheduler.7918]:   File "/app/.heroku/python/lib/python2.7/site-packages/postgres/__init__.py", line 548, in all
2016-11-09T02:50:45.013917+00:00 app[scheduler.7918]:     return cursor.all(sql, parameters)
2016-11-09T02:50:45.013947+00:00 app[scheduler.7918]:   File "/app/.heroku/python/lib/python2.7/site-packages/postgres/cursors.py", line 145, in all
2016-11-09T02:50:45.014009+00:00 app[scheduler.7918]:     self.execute(sql, parameters)
2016-11-09T02:50:45.014036+00:00 app[scheduler.7918]:   File "/app/.heroku/python/lib/python2.7/site-packages/psycopg2/extras.py", line 288, in execute
2016-11-09T02:50:45.014475+00:00 app[scheduler.7918]:     return super(NamedTupleCursor, self).execute(query, vars)
2016-11-09T02:50:45.014555+00:00 app[scheduler.7918]: psycopg2.ProgrammingError: syntax error at or near "BY"
2016-11-09T02:50:45.014557+00:00 app[scheduler.7918]: LINE 1: ... packages WHERE readme_needs_to_be_processedORDER BY package...
chadwhitacre commented 7 years ago

I'm surprised the test suite didn't catch that one. Hmm ...

chadwhitacre commented 7 years ago

Ah! Now I see that we weren't testing that code. 😊

chadwhitacre commented 7 years ago

PR in https://github.com/gratipay/gratipay.com/pull/4176 ...

chadwhitacre commented 7 years ago

Hmm ... readme processor is running out of memory.

$ heroku logs --tail | grep scheduler
2016-11-09T03:50:28.311155+00:00 heroku[api]: Starting process with command `sync-npm process-readmes` by scheduler@addons.heroku.com
2016-11-09T03:50:35.164998+00:00 heroku[scheduler.8487]: Starting process with command `sync-npm process-readmes`
2016-11-09T03:50:35.846204+00:00 heroku[scheduler.8487]: State changed from starting to up
2016-11-09T03:50:47.749213+00:00 heroku[scheduler.8487]: source=scheduler.8487 dyno=heroku.1942532.1fbd95bf-81f1-4a26-aa9f-a30e88765924 sample#memory_total=142.36MB sample#memory_rss=137.41MB sample#memory_cache=4.95MB sample#memory_swap=0.00MB sample#memory_pgpgin=41624pages sample#memory_pgpgout=5691pages sample#memory_quota=512.00MB
2016-11-09T03:51:07.948771+00:00 heroku[scheduler.8487]: source=scheduler.8487 dyno=heroku.1942532.1fbd95bf-81f1-4a26-aa9f-a30e88765924 sample#memory_total=371.86MB sample#memory_rss=366.91MB sample#memory_cache=4.95MB sample#memory_swap=0.00MB sample#memory_pgpgin=100375pages sample#memory_pgpgout=5691pages sample#memory_quota=512.00MB
2016-11-09T03:51:28.261222+00:00 heroku[scheduler.8487]: source=scheduler.8487 dyno=heroku.1942532.1fbd95bf-81f1-4a26-aa9f-a30e88765924 sample#load_avg_1m=0.29
2016-11-09T03:51:28.261321+00:00 heroku[scheduler.8487]: source=scheduler.8487 dyno=heroku.1942532.1fbd95bf-81f1-4a26-aa9f-a30e88765924 sample#memory_total=1107.29MB sample#memory_rss=494.66MB sample#memory_cache=0.00MB sample#memory_swap=612.63MB sample#memory_pgpgin=406200pages sample#memory_pgpgout=280077pages sample#memory_quota=512.00MB
2016-11-09T03:51:28.261959+00:00 heroku[scheduler.8487]: Process running mem=1107M(216.3%)
2016-11-09T03:51:28.261959+00:00 heroku[scheduler.8487]: Error R15 (Memory quota vastly exceeded)
2016-11-09T03:51:28.262044+00:00 heroku[scheduler.8487]: Stopping process with SIGKILL
2016-11-09T03:51:28.473986+00:00 heroku[scheduler.8487]: Process exited with status 137
2016-11-09T03:51:28.519921+00:00 heroku[scheduler.8487]: State changed from up to complete
chadwhitacre commented 7 years ago

It's never even getting to the point where it emits a log line about processing even a single readme.

chadwhitacre commented 7 years ago

Ah, I bet it's because we're including readme_raw in the initial query over all unprocessed packages.

chadwhitacre commented 7 years ago

PR in https://github.com/gratipay/gratipay.com/pull/4177.

chadwhitacre commented 7 years ago

Blorg.

Traceback (most recent call last):
  File "/app/.heroku/python/bin/sync-npm", line 9, in <module>
    load_entry_point('gratipay', 'console_scripts', 'sync-npm')()
  File "/app/gratipay/package_managers/sync.py", line 160, in main
    globals()[args.command.replace('-', '_')](env, args, db)
  File "/app/gratipay/package_managers/sync.py", line 144, in process_readmes
    _readmes.process(db)
  File "/app/gratipay/package_managers/readmes.py", line 100, in process
    threaded_map(Processor(db), dirty, 4)
  File "/app/gratipay/utils/threaded_map.py", line 23, in threaded_map
    raise e.args[0]
OSError: net.js:639
    throw new TypeError('invalid data');
    ^

TypeError: invalid data
  at Socket.write (net.js:639:11)
  at /app/bin/our-marky-markdown.js:20:18
  at FSReqWrap.readFileAfterClose [as oncomplete] (fs.js:404:3)
chadwhitacre commented 7 years ago

Unable to reproduce locally. Too jetlagged, more tomorrow ...

chadwhitacre commented 7 years ago

Lessee here ...

chadwhitacre commented 7 years ago

I can repro the bug using a one-off dyno.

chadwhitacre commented 7 years ago

Still works locally. Hmm ...

chadwhitacre commented 7 years ago

https://github.com/gratipay/gratipay.com/issues/4148#issuecomment-259330483 points to net.js and fs.js. Let's see if we can get a better error log ...

chadwhitacre commented 7 years ago
~ $ sync-npm process-readmes
zzzzzzzzzzzzzzzzzzz
utilise.emitterify
svg2png-cli
semver-max
zzzzzzxl
react-router-native
plane-to-polygon
Traceback (most recent call last):
  File "/app/gratipay/utils/threaded_map.py", line 15, in g
    return func(*a, **kw)
  File "/app/gratipay/package_managers/readmes.py", line 72, in process
    processed = markdown.render_like_npm(raw)
  File "/app/gratipay/utils/markdown.py", line 44, in render_like_npm
    raise OSError(err)
OSError: net.js:639
    throw new TypeError('invalid data');
    ^

TypeError: invalid data
  at Socket.write (net.js:639:11)
  at /app/bin/our-marky-markdown.js:20:18
  at FSReqWrap.readFileAfterClose [as oncomplete] (fs.js:404:3)

Traceback (most recent call last):
  File "/app/.heroku/python/bin/sync-npm", line 9, in <module>
    load_entry_point('gratipay', 'console_scripts', 'sync-npm')()
  File "/app/gratipay/package_managers/sync.py", line 160, in main
    globals()[args.command.replace('-', '_')](env, args, db)
  File "/app/gratipay/package_managers/sync.py", line 144, in process_readmes
    _readmes.process(db)
  File "/app/gratipay/package_managers/readmes.py", line 100, in process
    threaded_map(Processor(db), dirty, 4)
  File "/app/gratipay/utils/threaded_map.py", line 23, in threaded_map
    raise e.args[0]
OSError: net.js:639
    throw new TypeError('invalid data');
    ^

TypeError: invalid data
  at Socket.write (net.js:639:11)
  at /app/bin/our-marky-markdown.js:20:18
  at FSReqWrap.readFileAfterClose [as oncomplete] (fs.js:404:3)

planet-news
planet-names
planet-maps-raster-2
planet-maps-raster
planet-maps
planet-icons
planet-hacker
planet-forty
planet-feeds
planetfall-archive
planet-facts
planet-express
planet-digest
planet-css
planet-client
planet-classic
planet-blank
planetary-system
planetary.js
planetary
planet
planer
planepacker
planemo
plane-mesh
chadwhitacre commented 7 years ago

Is it a problem with stdout?

chadwhitacre commented 7 years ago

The 1st argument given to .write() is expected to be a String or Buffer.

http://stackoverflow.com/a/26851940

chadwhitacre commented 7 years ago

What is marky returning that is not a String or Buffer?

chadwhitacre commented 7 years ago

I made a local modification to just process packages with names starting with pla, but wasn't able to trigger the error that way.