ofek / pypinfo

Easily view PyPI download statistics via Google's BigQuery.
MIT License
417 stars 33 forks source link

Exact definition for the download count #46

Closed raamana closed 6 years ago

raamana commented 6 years ago

Hi There,

thanks for this package, very helpful. It's unclear to me exactly what is being output by this tool? Is it a sum of download counts from various sources? How does that differ from "pip only" option -p? Is there a way to get unique download count etc? Some more details would be helpful.

The default output for my packages is almost 6-10 times higher than if use the --pip option - any idea on why is that? I only ever recommended people to use my package via pip install. Although there are some dev, who clone and install locally outside pip, that number is likely very small. So the --pip option should closely match the default count (unless there is lot more going on which I don't understand).

Also, if I would like to estimate "usage" (which is a higher bar from download) from this, would it make sense?

Thanks for your help.

hugovk commented 6 years ago

It's all downloads logged by PyPI. By default, that also includes mirrors, other clients or very old pip. By far the biggest will be mirrors, you can check with installer.

See https://github.com/ofek/pypinfo/issues/4.

It would be good to add something about this to the README.

ofek commented 6 years ago

I'd really like to have --pip be the default behavior in the next release with a new --all to match the current default behavior.

@hugovk Are you fine with that?

raamana commented 6 years ago

Thanks for the info.

May I also suggest changing the default time window to everything, and not filter it within the last 30 days? Most devs (of not-crazily-popular packages) care about that, I think.

raamana commented 6 years ago

I am also running into errors when specifying time-window, throwing a quota error. When I run it without any time windows, it does return some results without error, making me think its not a issue of quotas. What's going on?

Also, what is Browser here, which is appearing as an installer? Count of tarball download over the browser?

$ 10:46:52 miner ~ >>  pypinfo pyradigm installer
Served from cache: True
Data processed: 0.00 B
Data billed: 0.00 B
Estimated cost: $0.00

| installer_name | download_count |
| -------------- | -------------- |
| bandersnatch   |            371 |
| Browser        |             14 |
| requests       |              4 |
| pip            |              3 |

$ 10:47:01 miner ~ >>  pypinfo -d 999  pyradigm installer
Traceback (most recent call last):
  File "/home/praamana/anaconda2/envs/py36/bin/pypinfo", line 11, in <module>
    sys.exit(pypinfo())
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/click/core.py", line 1043, in invoke
    return Command.invoke(self, ctx)
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/pypinfo/cli.py", line 106, in pypinfo
    query_rows = query_job.result(timeout=timeout // 1000)
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/google/cloud/bigquery/job.py", line 2344, in result
    super(QueryJob, self).result(timeout=timeout)
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/google/cloud/bigquery/job.py", line 640, in result
    return super(_AsyncJob, self).result(timeout=timeout)
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/google/api_core/future/polling.py", line 115, in result
    self._blocking_poll(timeout=timeout)
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/google/cloud/bigquery/job.py", line 2318, in _blocking_poll
    super(QueryJob, self)._blocking_poll(timeout=timeout)
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/google/api_core/future/polling.py", line 94, in _blocking_poll
    retry_(self._done_or_raise)()
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/google/api_core/retry.py", line 260, in retry_wrapped_func
    on_error=on_error,
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/google/api_core/retry.py", line 177, in retry_target
    return target()
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/google/api_core/future/polling.py", line 73, in _done_or_raise
    if not self.done():
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/google/cloud/bigquery/job.py", line 2306, in done
    location=self.location)
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/google/cloud/bigquery/client.py", line 556, in _get_query_results
    retry, method='GET', path=path, query_params=extra_params)
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/google/cloud/bigquery/client.py", line 311, in _call_api
    return call()
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/google/api_core/retry.py", line 260, in retry_wrapped_func
    on_error=on_error,
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/google/api_core/retry.py", line 177, in retry_target
    return target()
  File "/home/praamana/anaconda2/envs/py36/lib/python3.6/site-packages/google/cloud/_http.py", line 293, in api_request
    raise exceptions.from_http_response(response)
google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/bigquery/v2/projects/pypistatsraamana/queries/90b09b2d-f62e-4f76-adbc-8c6abe201ee9?maxResults=0&timeoutMs=10000: Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
$ 10:47:20 miner ~ >>
ofek commented 6 years ago

@raamana I will definitely not make that the default as that takes a while to execute and those queries may incur unexpectedly large costs (as your example shows :slightly_smiling_face:). You'll have to wait until you get more free quote (a few days perhaps) or you enable billing.

raamana commented 6 years ago

I did enable billing.. web interface says "You have $384.24 in credit and 362 days left of your free trial. ", so not sure why I get errors.

raamana commented 6 years ago

Does that mean the option "-d 999" will cost me more than $380?

ofek commented 6 years ago

Yeah this is a common issue afaik. Credits still implies free tier, inheriting the quotes. You'll need to contact support unfortunately.

hugovk commented 6 years ago

I'd really like to have --pip be the default behavior in the next release with a new --all to match the current default behavior.

Yes, I'm fine with that, I always use it with --pip anyway.

raamana commented 6 years ago

thanks @ofek.. they started working again (seems like my fiddling of the billing account worked)

can you point me to more details on what the different installers are? I think None refers to older versions of pip (as noted in #4, which are legitimate installs right?) and am curious about Browser and requests installers, which do have a high count for my packages.. thanks.

$ 11:21:38 miner ~ >>  pypinfo -sd -400 neuropredict installer
Served from cache: False
Data processed: 142.76 GiB
Data billed: 142.76 GiB
Estimated cost: $0.70

| installer_name | download_count |
| -------------- | -------------- |
| bandersnatch   |         13,971 |
| requests       |            294 |
| Browser        |            287 |
| pip            |            104 |
| None           |             69 |
| z3c.pypimirror |             43 |
| pep381client   |             28 |
| Artifactory    |              7 |
ofek commented 6 years ago
raamana commented 6 years ago

thanks Ofek, that's what I thought, just wanted to double check.

I simply can't imagine why my target user (not professional software developers), esp. over 200 of them, would use requests to get the package and manually install it, when pip install is so much easier? Assuming the clones performed the CI servers are counted under pip, the counts for Browser and requests seem very high to me. Barring excessive repeat/duplicate counts, this seems like a bug to me somewhere?

raamana commented 6 years ago

Any comments @ofek and @hugovk on my above question? I want to be sure I estimate the install/download counts with < 5-10% error

hugovk commented 6 years ago

I've no idea, but it's not necessarily only your target user downloading stuff.

The requests library is used a lot for all sorts of things, it may be used by the mirror services or anything really.

You can add --test to the pypinfo call to see the query it uses, and then use that to query BigQuery in another way to see if the numbers are different.

hugovk commented 6 years ago

Please see PR #51.