Closed chadwhitacre closed 7 years ago
I overwrote the our-marky-markdown.js
script on a Heroku dyno (using here strings) to give some debugging output.
html for nullisn't a string, it's a functionfunction (selector, context, r, opts)
Alright, so we need to call .html()
on the output of marky()
. But why doesn't it fail locally? I have node 4.3.1 where Heroku has 5.11.1. Maybe the write
API changed?
Nope.
Maybe I don't have the right data loaded up locally?
Well, they do differ.
gratipay=# select count(*) from packages;
┌────────┐
│ count │
├────────┤
│ 372147 │
└────────┘
(1 row)
gratipay=# select count(*) from packages where readme_raw is not null;
┌───────┐
│ count │
├───────┤
│ 134 │
└───────┘
(1 row)
gratipay=#
gratipay::MAROON=> select count(*) from packages; ┌────────┐
│ count │
├────────┤
│ 372271 │
└────────┘
(1 row)
gratipay::MAROON=> select count(*) from packages where readme_raw is not null;
┌────────┐
│ count │
├────────┤
│ 138260 │
└────────┘
(1 row)
gratipay::MAROON=>
Oh! I'm seeing marky-markdown v9.0.1 locally, but 8.1.0 on Heroku.
Cached build?
For the fetcher bug (https://github.com/gratipay/gratipay.com/issues/4148#issuecomment-259322332), I think we'll need to unthread in order to get better error messaging.
Actually, threaded_map
is supposed to output the original traceback to stdout. Are we losing that? I guess we'll have to reschedule to see ...
Well that is consternating. Even after https://github.com/gratipay/gratipay.com/pull/4178 I am still seeing 8.1.0 on a Heroku dyno. Hmm ...
sync-npm process-readmes
is working from a one-off dyno. 👍
I'm not seeing an error from sync-npm fetch-readmes
in a one-off dyno. I've rescheduled both that and process-readmes
.
Alright, the original traceback is in there, I had just confused it for a dupe.
2016-11-09T23:51:23.098776+00:00 app[scheduler.8465]: jstuningTraceback (most recent call last):
2016-11-09T23:51:23.098779+00:00 app[scheduler.8465]: File "/app/gratipay/utils/threaded_map.py", line 15, ing
2016-11-09T23:51:23.098780+00:00 app[scheduler.8465]: return func(*a, **kw)
2016-11-09T23:51:23.098781+00:00 app[scheduler.8465]: File "/app/gratipay/package_managers/readmes.py", line 55, in fetch
2016-11-09T23:51:23.098781+00:00 app[scheduler.8465]: , dirty.name
2016-11-09T23:51:23.098782+00:00 app[scheduler.8465]: File "/app/.heroku/python/lib/python2.7/site-packages/postgres/__init__.py", line 374, in run
2016-11-09T23:51:23.098783+00:00 app[scheduler.8465]: cursor.run(sql, parameters)
2016-11-09T23:51:23.098784+00:00 app[scheduler.8465]: File "/app/.heroku/python/lib/python2.7/site-packages/postgres/cursors.py", line 92, in run
2016-11-09T23:51:23.098784+00:00 app[scheduler.8465]: self.execute(sql, parameters)
2016-11-09T23:51:23.098785+00:00 app[scheduler.8465]: File "/app/.heroku/python/lib/python2.7/site-packages/psycopg2/extras.py", line 288, in execute
2016-11-09T23:51:23.098786+00:00 app[scheduler.8465]: return super(NamedTupleCursor, self).execute(query, vars)
2016-11-09T23:51:23.098787+00:00 app[scheduler.8465]: ProgrammingError: can't adapt type 'dict'
Interesting. So maybe name
or something can be a dict coming from npm?
That doesn't sound right ...
I'm working on a PR to add Sentry support to these procs so we have better visibility into and resiliency in the face of errors (surely these won't be the last).
PR in #4179.
So it turns out that the registry includes packages that have been unpublished—a lot of them, from what I can tell. They appear as JSON with a 404. E.g. below.
Currently we log "404" for these and leave readme_raw
untouched, which means it'll still be null next time around and we'll refetch the same 404. We should notice this case and probably drop the record from our database, though we'll want to be careful to bring it back again when someone else claims it.
I'm a little surprised at how many 404s I'm seeing. How does unpublishing relate to deleting or removing a package?
http://registry.npmjs.com/mysql-schema https://www.npmjs.com/package/mysql-schema
{
"_id":"mysql-schema",
"_rev":"4-c527f6a64778f8d0afbaf6fd4754085e",
"name":"mysql-schema",
"time":{
"modified":"2013-12-08T03:17:01.297Z",
"created":"2013-12-08T03:16:59.506Z",
"0.0.1":"2013-12-08T03:17:01.297Z",
"unpublished":{
"name":"carlosmarte",
"time":"2014-07-26T06:27:28.217Z",
"tags":{
"latest":"0.0.1"
},
"maintainers":[
{
"name":"carlosmarte",
"email":"dev@carlosmarte.me"
}
],
"description":"mysql queries helper",
"versions":[
"0.0.1"
]
}
},
"_attachments":{
}
}
Okay! Fetcher and processor appear to be chugging along. We want to see these both hit zero, though they won't until we delete or skip 404s.
gratipay::MAROON=> select count(*) from packages where readme_raw is null;
┌────────┐
│ count │
├────────┤
│ 223221 │
└────────┘
(1 row)
gratipay::MAROON=> select count(*) from packages where readme_needs_to_be_processed;
┌────────┐
│ count │
├────────┤
│ 323828 │
└────────┘
(1 row)
gratipay::MAROON=>
Done in https://github.com/gratipay/gratipay.com/pull/4181 and deployed.
Deleting! 👍
2016-11-10T07:50:24.124188+00:00 app[scheduler.4899]: fetching practiceone
2016-11-10T07:50:23.863759+00:00 app[scheduler.9696]: fetching node-u2f
2016-11-10T07:50:23.939573+00:00 app[scheduler.9696]: fetching node-typograph
2016-11-10T07:50:23.974219+00:00 app[scheduler.9696]: no readme in killdrev
2016-11-10T07:50:23.974273+00:00 app[scheduler.9696]: fetching kill-desktop-osx
2016-11-10T07:50:24.020858+00:00 app[scheduler.9696]: fetching infinigon-tag
2016-11-10T07:50:24.027898+00:00 app[scheduler.9696]: fetching kill-dash-nine
2016-11-10T07:50:24.037882+00:00 app[scheduler.9696]: 404 for spurious-js-aws-sdk-helper
2016-11-10T07:50:24.038176+00:00 app[scheduler.9696]: fetching spur-di
2016-11-10T07:50:24.081055+00:00 app[scheduler.9696]: fetching node-typo
2016-11-10T07:50:24.152517+00:00 app[scheduler.4899]: yet-another-module is 404; deleting
2016-11-10T07:50:24.155182+00:00 app[scheduler.4899]: fetching yet-another-friendly-dependency
2016-11-10T07:50:24.185761+00:00 app[scheduler.4899]: fetching practice-npm-package
2016-11-10T07:50:24.203793+00:00 app[scheduler.4899]: fetching jspm-nodelibs-process
2016-11-10T07:50:24.228972+00:00 app[scheduler.4899]: fetching jspm-nodelibs-path
2016-11-10T07:50:24.273577+00:00 app[scheduler.4899]: fetching hwsl2
2016-11-10T07:50:24.279907+00:00 app[scheduler.4899]: fetching jspm-nodelibs-os
2016-11-10T07:50:24.304598+00:00 app[scheduler.4899]: yet-another-friendly-dependency is 404; deleting
2016-11-10T07:50:24.308818+00:00 app[scheduler.4899]: fetching yet-another-express-routing
2016-11-10T07:50:24.348231+00:00 app[scheduler.4899]: fetching practice_npm
2016-11-10T07:50:24.358175+00:00 app[scheduler.4899]: fetching jspm-nodelibs-net
2016-11-10T07:50:24.367534+00:00 app[scheduler.4899]: yet-another-express-routing is 404; deleting
2016-11-10T07:50:24.370328+00:00 app[scheduler.4899]: fetching yet-another-express-router
2016-11-10T07:50:24.132520+00:00 app[scheduler.9696]: fetching node-typhoon
2016-11-10T07:50:24.162294+00:00 app[scheduler.9696]: 404 for spur-di
2016-11-10T07:50:24.162538+00:00 app[scheduler.9696]: fetching sptitesmith-stylus-retina-template
2016-11-10T07:50:24.182196+00:00 app[scheduler.9696]: fetching kill-combo
2016-11-10T07:50:24.218516+00:00 app[scheduler.9696]: 404 for sptitesmith-stylus-retina-template
2016-11-10T07:50:24.218923+00:00 app[scheduler.9696]: fetching spruce
2016-11-10T07:50:24.253262+00:00 app[scheduler.9696]: no readme in spruce
2016-11-10T07:50:24.253270+00:00 app[scheduler.9696]: fetching sprout-object
2016-11-10T07:50:24.268288+00:00 app[scheduler.9696]: fetching node-typewriter
2016-11-10T07:50:24.291538+00:00 app[scheduler.9696]: 404 for sprout-object
I'm curious to see how many records we're left with after weeding out the 404s.
Okay! I'm gonna let this run overnight ...
Almost done fetching! Processing is slower to catch up ...
gratipay::MAROON=> select count(*) from packages where readme_raw is null;
┌───────┐
│ count │
├───────┤
│ 3349 │
└───────┘
(1 row)
gratipay::MAROON=> select count(*) from packages where readme_needs_to_be_processed;
┌────────┐
│ count │
├────────┤
│ 258387 │
└────────┘
(1 row)
gratipay::MAROON=> select count(*) from packages;
┌────────┐
│ count │
├────────┤
│ 345044 │
└────────┘
(1 row)
No movement on first and third. I think we have some packages that aren't 404 but also maybe don't have a readme? I think that's how readme_raw is ending up null for a percentage.
gratipay::MAROON=> select count(*) from packages where readme_raw is null;
┌───────┐
│ count │
├───────┤
│ 3349 │
└───────┘
(1 row)
gratipay::MAROON=> select count(*) from packages where readme_needs_to_be_processed;
┌────────┐
│ count │
├────────┤
│ 230868 │
└────────┘
(1 row)
gratipay::MAROON=> select count(*) from packages;
┌────────┐
│ count │
├────────┤
│ 345044 │
└────────┘
(1 row)
27,519 readmes processed in five hours, call it 5,000 an hour, so ... 45-50 hours remaining? Should be done over the weekend?
230,868 - 207,714 = 23,154 in about five hours. 👍
Fetcher crashed!
Captured with #4179. 👍
I don't see a repo for mysql-prettify
. I was gonna check package.json
to see if that's where the bad value is coming from. Do we want to stringify it or count is as "no readme"?
Alright, the {"private": true}
issue should be fixed in #4182.
I'm going to fix the other Sentry bug that I introduced: https://sentry.io/gratipay/gratipay-com/issues/179441800/.
PR in #4183. Waiting for Travis.
gratipay::MAROON=> select count(*) from packages where readme_raw is null;
┌───────┐
│ count │
├───────┤
│ 2 │
└───────┘
(1 row)
gratipay::MAROON=> select count(*) from packages where readme_needs_to_be_processed;
┌───────┐
│ count │
├───────┤
│ 97562 │
└───────┘
(1 row)
gratipay::MAROON=> select count(*) from packages;
┌────────┐
│ count │
├────────┤
│ 345044 │
└────────┘
(1 row)
gratipay::MAROON=>
207,714 - 97,476 = 110,238 in about 19 hours = 5,802/hr. On track! 👍
gratipay::MAROON=> select count(*) from packages where readme_raw is not null;
┌────────┐
│ count │
├────────┤
│ 345042 │
└────────┘
(1 row)
gratipay::MAROON=> select count(*) from packages where readme_needs_to_be_processed;
┌───────┐
│ count │
├───────┤
│ 3 │
└───────┘
(1 row)
gratipay::MAROON=> select count(*) from packages;
┌────────┐
│ count │
├────────┤
│ 345044 │
└────────┘
(1 row)
gratipay::MAROON=>
Here are the three remaining to be processed:
https://www.npmjs.com/package/testing233 https://www.npmjs.com/package/testing234 https://www.npmjs.com/package/kendo-ui-react-jquery-stockchart
The first two give "ERROR: No README data found!" The third results in a 504!
Is this part of a plan to build an index of all the popular libraries? Next is Maven rep for Java, if this works out?
Yeah, something like that. Ideally we can partner with Libraries.io and bring a bunch online at once.
¯\_(ツ)_/¯
gratipay::MAROON=> select count(*) from packages where readme_raw is not null;
┌────────┐
│ count │
├────────┤
│ 345044 │
└────────┘
(1 row)
gratipay::MAROON=> select count(*) from packages where readme_needs_to_be_processed;
┌───────┐
│ count │
├───────┤
│ 0 │
└───────┘
(1 row)
gratipay::MAROON=> select count(*) from packages;
┌────────┐
│ count │
├────────┤
│ 345044 │
└────────┘
(1 row)
gratipay::MAROON=>
Alright! READMEs initially loaded! 💃
Let's get some pages on the site! #4151
Slight change of plans. #4151 is too much of a rabbit hole. We don't want to get ourselves into the business of processing and securing READMEs across 30+ package managers. Our existing /on/network/foo/
pages aren't very contentful, there's no reason /on/npm/foo/
need be, either. My current plan is to make a PR to remove the README processing that is already deployed (we still want package fetching and syncing, since emails are in there and that's our key for linking with users), and then move on to Checkpoint 2: Giving to Packages.
@kaguillera and I are talking IRL about how much tech debt we want to take on underneath Relax Open Work Requirement in order to fast-track the npm feature here. Over there, we are renaming Teams to projects and members to collaborators, and we are also now talking about removing tip migration. It's basically a question of whether we change names in the UI only, or also remove code and drop/rename database tables and such. The trade-off is that if we only make surface changes over there, then the next @JessaWitzel that comes along will have even more confusion to deal with ("Wait—projects are stored in the teams table? WTF!")—but I am going to be so incredibly mad if we're not first to market with npm pledging. :rage4:
@JessaWitzel Can we please please please go into debt here? I PROMISE I will fix it and make it all better in January or February. Or March. 🙏
Discussing in slack.
January or February. Or March.
😄
Ok. I will bless this technical debt with my fairy wand but request a deadline for fixing it. 12 weeks after feature launch
@JessaWitzel at slack
We have reached Checkpoint 1: Inert /on/npm/foo/ Pages! 💃
Anything I can do to accelarate this? It seems top priority for now.
Some questions:
Thanks for bumping this, @nobodxbodon! As mentioned in slack, I hoping to spec this out this week while also bringing Relax Open Work Requirement in for landing (this was blocked on that).
is NPM cool with this, and are they willing to partner/coordinate in any way?
I emailed them and didn't hear back. I think if we get some traction with this, that will be the time to reapproach a conversation with them.
any workload estimate and roadmap with timeline?
Last month I guesstimated (one, two) that the two projects together would take six weeks of calendar time. It's now been four and a half weeks and we're not done with either yet. We're likely to finish the first by the end of this week (Week 5), with next week being Week 1 on Integrate npm. It seems unlikely to take a week. ;-) I was off by a factor of 2.5 on the Relax Open Work time estimate (I figured two out of the six). That suggests 10 weeks of calendar time for Integrate npm.
Roadmap and further estimation tbd when I can spec this out.
who are main developers and any dividing of tasks?
Me and maybe @aandis? Anybody else wanna volunteer? :-) Division of tasks tbd.
any blockers?
I've put both projects on the calendar, with an initial target date of March 17 for Integrate npm.
✈️ This is the flight deck for the Integrate npm project. ✈️
Current open-source crowdfunding options (Kickstarter, Patreon, Gratipay, OpenCollective, etc.) are consumer-grade. Our hunch is that a business-grade product with better aggregation can better serve the companies that want to pay for open source, because companies use hundreds or thousands of open source packages, not just a few.
Picking up from https://github.com/gratipay/gratipay.com/pull/4135#issuecomment-255122149 and https://github.com/gratipay/inside.gratipay.com/issues/852#issuecomment-255098337 ...
For wider context see:
JavaScript is the most popular language in open source and npm is the most popular package manager for JavaScript. A good first concrete step towards helping companies pay for open source (#4135), therefore, will be to add the ability to pay for any package on npm. Once we have npm deployed, we will have enough experience to inform a partnership with Libraries.io for the rest of the package managers.
Target
Our goal is to announce this feature in my lightning talk on Thursday, October 26 at Red Hat's All Things Open conference (https://github.com/gratipay/inside.gratipay.com/issues/757).Our goal is to incrementally improve this feature throughout the first half of 2017, with an eye towards OSCON and $ustain in May.
Package names to test with
From https://github.com/gratipay/gratipay.com/pull/4135#issuecomment-262672635:
http://localhost:8537/on/npm/async/ http://localhost:8537/on/npm/iframe-resizer/ http://localhost:8537/on/npm/mongoose/ http://localhost:8537/on/npm/nodemon/ http://localhost:8537/on/npm/react/ http://localhost:8537/on/npm/react-helmet/ http://localhost:8537/on/npm/react-modal/ http://localhost:8537/on/npm/react-redux/ http://localhost:8537/on/npm/react-router/ http://localhost:8537/on/npm/react-router-redux/ http://localhost:8537/on/npm/redux/ http://localhost:8537/on/npm/redux-thunk/ http://localhost:8537/on/npm/webpack/
Todo
Prerequisites
Checkpoint 1: Inert
/on/npm/foo/
Pages/on/npm/foo
pages (#4212)Checkpoint 2: Giving to Packages
/on/npm/foo/
pages via email verification/on/npm/foo/
pagesCheckpoint 3: Easy Sign-up
/on/npm/
pagesNice to Have
{% content %}
(leaving social network jump in sidebar)Promotion
✈️ This is the flight deck for the Integrate npm project. ✈️