python / cpython

The Python programming language
https://www.python.org
Other
63.38k stars 30.35k forks source link

"Data model" page is too long #126053

Open barneygale opened 1 week ago

barneygale commented 1 week ago

Documentation

The Data model document is very long, and as a result it basically never shows up in search engine results, because 90% of the page is considered irrelevant for any query like "python __hash__".

I suggest we split it up by top-level topic, e.g. we add a dedicated page for "Special method names".

See also #126052

rhettinger commented 1 week ago

Try not to break all the external links going into the pages. We don't want to invalidate all the references from blogs, tweets, stackoverflow answers, etc.

With regard to search engine results, I don't think we can or should engage in SEO. There is no promise that rearrangements will lead to being a top hit for a search.

barneygale commented 1 week ago

Try not to break all the external links going into the pages. We don't want to invalidate all the references from blogs, tweets, stackoverflow answers, etc.

Presumably this is impossible, right @picnixz?

ncoghlan commented 1 week ago

My suggestion for refactoring these large pages while mitigating the damage to existing deep links was to:

  1. Make the existing page name an orphan that exists solely as a navigation page to get from stale deep links to updated semantic references
  2. Ensure the new subfolder name can exist in parallel with the old file name (e.g. data_model/ in this case)

The damage to existing deep links that can't (or won't) be changed is still a good reason to tread carefully, but never being able to split pages as they grow over time isn't a great situation either.

For more background on why we should preserve link integrity as much as we can, the World Wide Web Consortium has a decent page here on why "Cool URIs Don't Change": https://www.w3.org/Provider/Style/URI

picnixz commented 1 week ago

Presumably this is impossible, right

Mmmh. It could be possible actually but this would require a custom Sphinx extension and custom redirection at the nginx / apache level where old URLs would redirect to new ones (the Sphinx extension will be used to extract the mapping). It's also a bit of a hacky solution but I don't have a better alternative (a pure Sphinx solution may not be possible because we don't want a dead link if an article cites something like https://docs.python.org/3/reference/datamodel.html#numbers-number; auto-generated doc using :class:`numbers.Number` would be fine since the intersphinx inventory would be updated but raw links won't).

If you want to improve SEO, isn't there a way to indicate in an HTML document that this or that text is more important than something else (e.g., with some aria label or whatever HTML feature we may have)?

More generally, if you want to split the HTML, it's more of a server-side issue rather than a Sphinx issue (where the server would redirect to the appropriate page). So some redirect rules will need to be rewritten (and I don't know how much it could slow down the entire docs website).


Alyssa's suggestion on having a page serving as a hub is possible but it will be a bit ugly (because we still need to make all possible anchors available on that page so that users can re-click on them to have the expanded content).

barneygale commented 1 week ago

Alyssa's suggestion on having a page serving as a hub is possible but it will be a bit ugly (because we still need to make all possible anchors available on that page so that users can re-click on them to have the expanded content).

Could the Sphinx extension glue together several pages to form datamodel.html? It would resemble the existing page (perhaps with a small amount of jankyness), but it would be an "orphan" page with no incoming links from the rest of the Python docs. At the top we could add a banner:

The Python data model documentation has been split into several chapters. This page combines those chapters into a single document; it exists solely to keep existing links working.

ncoghlan commented 1 week ago

The original Py2-as-default -> Py3-as-default in https://peps.python.org/pep-0430/ was certainly all server-side redirect config. And yeah, I agree the orphaned navigation page isn't a good solution, it's just a better option than leaving people with either a 404 or an unanchored link to the start of a page with less inline content.

Unfortunately, web server rewrite rules can't help us here, as the anchor tag part is never sent to the server - it's handled by the browser after downloading the page. HTTP redirects don't help either, as they also operate at the page level.

It should be possible to do something clever with client side JavaScript: https://stackoverflow.com/questions/1305211/javascript-to-redirect-from-anchor-to-a-separate-page (and that could potentially be extended further to handle smaller cases like the deep links I recently broke by moving the Py_Main C API docs to a different page in #78387).

picnixz commented 1 week ago

Could the Sphinx extension glue together several pages to form datamodel.html

If you're worried about the length of datamodel.rst, then you can do it natively using .. include:: directives.


Ah yes, I forgotten about the redirection using JS. I was confused because I actually thought about server-side rendering. Now using JS can be integrated in Sphinx directly (IIRC).

hugovk commented 1 week ago

If you want to improve SEO, isn't there a way to indicate in an HTML document that this or that text is more important than something else (e.g., with some aria label or whatever HTML feature we may have)?

We've no way of knowing which of the 18k words (or 25k in https://github.com/python/cpython/issues/126052) is the important text that any given visitor is interested in. That's why more granular pages will help.

ncoghlan commented 1 week ago

(We may want to break out a separate pre-requisite issue for this, but continuing here for now)

Summarising what a potential solution to allowing moving link targets between pages, or making other changes (like updating section headings) without breaking deep links to those anchors:

This is still @picnixz's "custom Sphinx extension" idea, just with a better idea of what that extension would need to offer to enable docs refactoring without worrying about breaking existing deep links. If this existed, my orphaned navigation hub idea wouldn't be needed.

barneygale commented 1 week ago

I like the idea of using the intersphinx data. Here's a script that uses sphobjinv to print links that have died in the 3.14 docs:

from sphobjinv.inventory import Inventory

def load(url):
    inv = Inventory(url=url)
    return {obj.uri_expanded for obj in inv.objects}

old_urls = load('https://docs.python.org/3.13/objects.inv')
new_urls = load('https://docs.python.org/3.14/objects.inv')
dead_urls = old_urls - new_urls

for url in sorted(dead_urls):
    print(url)
Current output ``` library/asyncio-policy.html#asyncio-watchers library/asyncio-policy.html#asyncio.AbstractChildWatcher library/asyncio-policy.html#asyncio.AbstractChildWatcher.add_child_handler library/asyncio-policy.html#asyncio.AbstractChildWatcher.attach_loop library/asyncio-policy.html#asyncio.AbstractChildWatcher.close library/asyncio-policy.html#asyncio.AbstractChildWatcher.is_active library/asyncio-policy.html#asyncio.AbstractChildWatcher.remove_child_handler library/asyncio-policy.html#asyncio.AbstractEventLoopPolicy.get_child_watcher library/asyncio-policy.html#asyncio.AbstractEventLoopPolicy.set_child_watcher library/asyncio-policy.html#asyncio.FastChildWatcher library/asyncio-policy.html#asyncio.MultiLoopChildWatcher library/asyncio-policy.html#asyncio.PidfdChildWatcher library/asyncio-policy.html#asyncio.SafeChildWatcher library/asyncio-policy.html#asyncio.ThreadedChildWatcher library/asyncio-policy.html#asyncio.get_child_watcher library/asyncio-policy.html#asyncio.set_child_watcher library/collections.abc.html#collections.abc.ByteString library/dis.html#opcode-BEFORE_ASYNC_WITH library/dis.html#opcode-BEFORE_WITH library/dis.html#opcode-BUILD_CONST_KEY_MAP library/dis.html#opcode-LOAD_ASSERTION_ERROR library/dis.html#opcode-RETURN_CONST library/json.html#cmdoption-json.tool-arg-infile library/json.html#cmdoption-json.tool-arg-outfile library/json.html#cmdoption-json.tool-h library/json.html#cmdoption-json.tool-indent library/json.html#cmdoption-json.tool-json-lines library/json.html#cmdoption-json.tool-no-ensure-ascii library/json.html#cmdoption-json.tool-sort-keys library/sqlite3.html#sqlite3.version library/sqlite3.html#sqlite3.version_info library/subprocess.html#disable-vfork library/typing.html#typing.ByteString using/configure.html#cmdoption-without-freelists ```
barneygale commented 1 week ago

A very basic solution might be to redirect users to search.html, and supply the URL fragment as the search query. This would work OK for terms and python references, but not heading permalinks.

picnixz commented 1 week ago

This one is weird: library/json.html#cmdoption-json.tool-indent. Nothing seems to have changed in the rst between 3.13 and 3.14 and this could be a Sphinx issue. I think we had an issue for that somewhere but I forgot. I'll need to investigate.

NVM, the program was changed.

rhettinger commented 1 week ago

@nedbat Does the docs WG want to take a position with regard to docs stability versus refactoring into smaller chunks in hopes that SEO will be improved?

JelleZijlstra commented 1 week ago

I think this should be motivated not just by SEO, but also by improving the usability of the docs. It's a very large file that covers a lot of ground, and the way it's organized isn't necessarily the best. That may be bad for SEO, but it's also not ideal for human readers.

Currently the file has not just a discussion of Python's general "data model", the way data is represented, but also detailed documentation about some precise types, such as code objects. That documentation might fit better at https://docs.python.org/3/library/types.html#types.CodeType, so the data model page can focus more on behavior of the core language. Similarly, the data model page has discussion of numbers.Number and similar classes, which feels a bit out of place, as those are library ABCs, not core parts of the language. On the other hand, memoryview, a builtin, isn't mentioned as part of the "standard type hierarchy". Some of the file also duplicates the stdtypes page: compare https://docs.python.org/3/reference/datamodel.html#set-types and https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset.

I also agree we should avoid breaking links. If we want to be very strict in this, we could build some tooling that records e.g. all anchor targets in an old version of the docs and asserts they continue to work.

willingc commented 1 week ago

A few general considerations on splitting up a long page in the Language Reference (which this is). I'm speaking from my perspective and not for the entire @python/editorial-board. I would urge us to be more conservative with the Language Reference docs than the Library docs since it is the definition of the Python language.

  1. Users experience and discoverability are more important than SEO.
  2. SEO is important, but we should have a plan of how we will use the SEO information.
  3. Like @hugovk mentions, changes, if made, would need to account for orphaned pages to keep existing links working.
willingc commented 1 week ago

Currently the file has not just a discussion of Python's general "data model", the way data is represented, but also detailed documentation about some precise types, such as code objects. That documentation might fit better at https://docs.python.org/3/library/types.html#types.CodeType, so the data model page can focus more on behavior of the core language.

@JelleZijlstra's example is in line with my thinking when it comes to Language Reference changes vs. Library Doc changes.

barneygale commented 1 week ago

Users experience and discoverability are more important than SEO.

To be clear, UX and discoverability are the entire reason I care about SEO here!

willingc commented 1 week ago

To be clear, UX and discoverability are the entire reason I care about SEO here!

I understand your intent. To restate, if improvements to SEO impact negatively UX and discoverability, we should pass until the negatives are mitigated. As an aside, the exclamation point wasn't necessary in the earlier response.

barneygale commented 1 week ago

Sorry!

nedbat commented 1 week ago

I think the page is too long, and would improve both UX and SEO to be split up. It sounds like there is probably a way to reasonably preserve old links, though that still needs some investigation. It's a big job that should be done with care.

ncoghlan commented 1 week ago

As there seems to be consensus that a technical improvement around preserving deep links is needed before we embark on any major layout changes, I filed that request as a docsbuild-scripts issue: https://github.com/python/docs-community/issues/134 (even if using the technical solution ends up being a CPython change, creating that solution seemed more like a docs build question to me).

nedbat commented 1 week ago

Another good first step is making a concrete proposal about how the page would be split up. I know from my own work on the devguide that it's easy to look through an existing document and be certain that it could be reshaped into something better. When you actually sit down to do the reshaping, difficulties arise, decisions have to be made, and so on. Does someone want to write a doc somewhere that shows how a split page would be structured?

picnixz commented 1 week ago

My first impression is: split them by classes first. They are good on their own IMO. And each class can by regrouped by topic (e.g. strings, numerics, collections, etc). I can sketch a rough idea if you want (maybe by the end of the afternoon)

ncoghlan commented 1 week ago

I'm not sure about the Data Model page, but @nedbat's question prompted me to add a draft split for the builtin types page in https://github.com/python/cpython/issues/126052#issuecomment-2447175975 (giving str its own page would also mean we could finally move the details of the format string syntax out of the string module docs).

willingc commented 1 week ago

Perhaps the most conservative first iteration after getting the linking resolved would be to split the doc where there are natural breaks: 3.1, 3.2, 3.3 and 3.4. This will keep familiarity initially, and it does not preclude us from further splitting classes and 3.2 in future iterations.