jupyter / docs-team-compass

Documentation Work Group Discussions
BSD 3-Clause "New" or "Revised" License
5 stars 3 forks source link

Capture (and possibly automate) traffic CSVs from readthedocs #14

Open ericsnekbytes opened 8 months ago

ericsnekbytes commented 8 months ago

ReadTheDocs offers traffic and search stats that Jupyter subprojects can use to direct their docs improvement efforts. Right now, these metrics are not widely used (as indicated by discussions in group meetings) and are not easily accessible (they're locked behind an admin panel). They can be made easily available and usable from a central location so that subprojects can better benefit from the insights they contain.

ericsnekbytes commented 8 months ago

@jtpio mentioned Chris Holdgraf's repo metrics notebooks, we can look at those for inspiration.

krassowski commented 8 months ago

This is the kind of data that is available from read the docs, on example of JupyterLab:

Summary The CSV head
image image
krassowski commented 8 months ago

Previously I brought up adding a footer like on GitHub:

image

ericsnekbytes commented 3 months ago

This is in progress here:

image

blink1073 commented 3 months ago

readthedocs_traffic_analytics_jupyter-server_2023-12-27_2024-03-26.csv

blink1073 commented 3 months ago

readthedocs_traffic_analytics_jupyter-enterprise-gateway_2023-12-27_2024-03-26.csv

ericsnekbytes commented 3 months ago

@blink1073 Thank you, it's much appreciated!

blink1073 commented 3 months ago

readthedocs_traffic_analytics_jupyterlab-server_2023-12-27_2024-03-26.csv readthedocs_traffic_analytics_jupyterlab_2023-12-27_2024-03-26.csv readthedocs_traffic_analytics_jupyter-client_2023-12-27_2024-03-26.csv readthedocs_traffic_analytics_jupyter_2023-12-27_2024-03-26.csv readthedocs_traffic_analytics_ipywidgets_2023-12-27_2024-03-26.csv readthedocs_traffic_analytics_ipykernel_2023-12-27_2024-03-26.csv

blink1073 commented 3 months ago

readthedocs_traffic_analytics_traitlets_2023-12-27_2024-03-26.csv readthedocs_traffic_analytics_terminado_2023-12-27_2024-03-26.csv readthedocs_traffic_analytics_nbformat_2023-12-27_2024-03-26.csv readthedocs_traffic_analytics_nbconvert_2023-12-27_2024-03-26.csv readthedocs_traffic_analytics_lumino_2023-12-27_2024-03-26.csv readthedocs_traffic_analytics_jupyter-notebook_2023-12-27_2024-03-26.csv

Okay that's all I have access to. :smile:

choldgraf commented 3 months ago

Two quick thoughts:

Inspiration via jupyter book

If you want some inspiration, I often use Jupyter Book for this kind of thing. For example, here's a dashboard I've used in the past for tracking activity within the Jupyter ecosystem (it's now out of date so there's an error message but you get the idea):

https://chrisholdgraf.com/jupyter-activity-snapshot/jupyter.html#merged-pull-requests source: https://github.com/choldgraf/jupyter-activity-snapshot

That uses papermill to use a github organization stats template that creates the pages for each organization. It uses that to generate the source files of pages that then go into a jupyter-book build process.

Plausible?

Historically, we've used Google Analytics to track user behavior across our websites, including docs. This was very useful for things like generating impact reports for grants. We moved away from Google Analytics for privacy reasons, but some folks mentioned that https://plausible.io/ was an attractive alternative that wouldn't have the same concerns.[^1]

[^1]: Another option is Matomo, no strong opinions from me.

Would it be less work if Jupyter self-hosted a plausible instance that generated dashboards for all of the sub-project docs sites? Apologies if this has already been discussed and decided on, just wanted to throw it out there in case it creates an "ah-ha that would be way easier" response.

ericsnekbytes commented 3 months ago

@blink1073 AWESOME, thank you!

ericsnekbytes commented 3 months ago

@choldgraf These look like just the thing (I was pondering something similar here), thanks for linking. I may contact you further down the line.

@blink1073 Also I just noticed and hate to bother you further but there should be a second SEARCH csv for all those sites also if you are able to provide those 😅

blink1073 commented 3 months ago

readthedocs_search_analytics_traitlets_2023-12-27_2024-03-26.csv readthedocs_search_analytics_terminado_2023-12-27_2024-03-26.csv readthedocs_search_analytics_nbformat_2023-12-27_2024-03-26.csv readthedocs_search_analytics_nbconvert_2023-12-27_2024-03-26.csv readthedocs_traffic_analytics_jupyter-events_2023-12-27_2024-03-26.csv readthedocs_search_analytics_lumino_2023-12-27_2024-03-26.csv readthedocs_search_analytics_jupyter-server_2023-12-27_2024-03-26.csv readthedocs_search_analytics_jupyter-notebook_2023-12-27_2024-03-26.csv readthedocs_search_analytics_jupyterlab-server_2023-12-27_2024-03-26.csv readthedocs_search_analytics_jupyterlab_2023-12-27_2024-03-26.csv readthedocs_search_analytics_jupyter-events_2023-12-27_2024-03-26.csv readthedocs_search_analytics_jupyter-enterprise-gateway_2023-12-27_2024-03-26.csv readthedocs_search_analytics_jupyter-client_2023-12-27_2024-03-26.csv readthedocs_search_analytics_jupyter_2023-12-27_2024-03-26.csv readthedocs_search_analytics_ipywidgets_2023-12-27_2024-03-26.csv readthedocs_search_analytics_ipykernel_2023-12-27_2024-03-26.csv

ericsnekbytes commented 3 months ago

@blink1073 Geeze this is fantastic, thanks for single handedly knocking this problem out of the park :D

choldgraf commented 3 months ago

In case it's helpful @ericsnekbytes:

Here are the templates that are used to generate org-specific pages: https://github.com/choldgraf/jupyter-activity-snapshot/tree/main/monthly_update/templates

Specifically here's the one that generates the org reports I mentioned before: https://github.com/choldgraf/jupyter-activity-snapshot/blob/main/monthly_update/templates/org_report.ipynb

You can see where the templates have variables to be inserted later within {{ }}, for example:

CleanShot 2024-03-26 at 13 28 11@2x

You can then generate pages using that template with code like this: for org in github_orgs:

path_book = Path("generated/book")
for org in github_orgs:
    parameters = dict(github_org=org, n_days=n_days)
    path_out = path_book.joinpath(f"{org}.ipynb")
    ntbk = pm.execute_notebook(
        "./templates/org_report.ipynb",
        str(path_out),
        parameters=parameters,
        nest_asyncio=True,
        cwd="./templates/",
    )

    # Remove the param cell so it doesn't show up
    (param_cell,) = [
        cell for cell in ntbk.cells if "injected-parameters" in cell.metadata.tags
    ]
    param_cell.metadata.tags.append("remove-cell")
    nbs = nbf.writes(ntbk)
    nbs = nbs.replace("{{ github_org }}", org)
    path_out.write_text(nbs)

And then these two github actions are used in the CI/CD to build the pages from a template, and then build the book:

    - name: Generate book pages with latest data
      run: |
        papermill --cwd monthly_update monthly_update/run_template.ipynb -
      env:
        GITHUB_ACCESS_TOKEN: ${{ secrets.ACCESS_TOKEN }}

    # Build the book
    - name: Build the book
      run: |
        jb toc from-project monthly_update/generated/book -e .ipynb -e .md -e .rst --guess-titles > monthly_update/generated/book/_toc.yml
        jb build monthly_update/generated/book

I think that's the core of the logic there. A lot of the code there is very stale which is why I'm trying to point out the details here. If you really wanna get fancy you could also try the new MyST build engine at https://mystmd.org :-)

minrk commented 3 months ago

I grabbed the stats for the docs I have access to here: https://gist.github.com/minrk/c1df933c520f9a51ee2bf474817a20bb

including the notebook I used to get them. It seems the traffic data isn't in the API, so I needed to script it with playwright.

ericsnekbytes commented 3 months ago

@choldgraf Thanks for the additional details.

@minrk I'll be digging through these and may ping you again for some additional info, thanks for providing these 👍

ericsnekbytes commented 2 weeks ago

Edit: Check below

ericsnekbytes commented 1 week ago

@minrk @blink1073 We need another CSV dump (there's a ticket on RTD that would create an API call for this). We should also make a service account and grant it permissions to download these (since the API call does not exist yet), which I cannot do. I can make the account and add it to the Jupyter password manager, if one of you can grant permissions to it...

minrk commented 1 week ago

I've updated the data in the gist today. If you create the bot account, I can add it to some projects.

blink1073 commented 1 week ago

I'm also happy to add the account to the projects I own.

ericsnekbytes commented 1 week ago

@minrk @blink1073 The new GitHub user is @jupyterautomation (jupyterautomation on RTD as well), and should be ready to be added to ReadTheDocs sites :) I've added that account to the Jupyter password manager (and the underlying email address, jupyterautomation@gmail.com). Thanks!

blink1073 commented 1 week ago

Okay, I added it to all the projects I maintain

minrk commented 6 days ago

Sent all my invitations, too, I think.

ericsnekbytes commented 5 days ago

@blink1073 @minrk Thank you! I'll share progress when I've got something up and running.