pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.38k stars 1.86k forks source link

Broken API links in the user guide (404 page not found) + stale documentation example (fetch function) #18852

Open npielawski opened 2 hours ago

npielawski commented 2 hours ago

Description

I noticed a dead link in the user guide, so I made a script to probe all links in docs/source/_build/API_REFERENCE_LINKS.yml It turns out that there are 22 dead links (HTTP response != 200) in the user guide. Many links are stale and need to be updated, there are a few typos, too. The issue concerns both Python and Rust links.

As an example of broken link Expressions / Aggregation has a broken link if you click on API Categorical for the Python code example.

I made another script to look at stale tags that are not referenced in the API, and there are 17 such instances. This is assuming that the links are only being used in docs/ and that the line contains code_block. I am double checking the positives manually to make sure there are no false positive.

Finally, the fetch link (which gives a 404) doesn't have a API documentation page anymore, likely due to the function being deprecated. It would be best to rewrite the section in docs/source/user-guide/lazy/execution.md L52-79 and use head+collect instead (since this is what is recommended in the source code).

If this issue is accepted, I can submit a PR and update the links (already did the work), I can start writing a new Execution on a partial dataset section as well. I am wondering if the stale tags should be removed at all (the links are all returning HTTP 200), and I am not 100% certain I won't break something by removing them.

Here is the list of links:

https://docs.pola.rs/api/python/stable/reference/api/polars.Categorical.html https://docs.pola.rs/api/python/stable/lazyframe/api/polars.lazyframe.engine_config.GPUEngine.html https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.col.html https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.prefix.html https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.suffix.html https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_alias.html https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.fetch.html https://docs.pola.rs/api/python/stable/reference/sql https://docs.pola.rs/api/python/stable/reference/api/polars.SQLContext.register.html#polars.SQLContext.register https://docs.pola.rs/api/python/stable/reference/api/polars.SQLContext.register_many.html https://docs.pola.rs/api/python/stable/reference/api/polars.SQLContext.query.html https://docs.pola.rs/api/python/stable/reference/api/polars.SQLContext.execute.html https://docs.pola.rs/api/python/stable/reference/api/polars.date_range.html https://docs.pola.rs/api/python/stable/reference/api/polars.Array.html https://docs.pola.rs/api/rust/dev/polars_core/frame/hash_join/index.html https://docs.pola.rs/api/python/stable/reference/sql.html https://docs.pola.rs/api/rust/dev/polars_io/csv/struct.CsvReader.html https://docs.pola.rs/api/rust/dev/polars_io/csv/struct.CsvWriter.html https://docs.pola.rs/api/rust/dev/polars_io/parquet/struct.ParquetReader.html https://docs.pola.rs/api/rust/dev/polars_io/parquet/struct.ParquetWriter.html https://docs.pola.rs/api/rust/dev/polars_io/prelude/struct.IpcReader.html https://docs.pola.rs/api/rust/dev/polars_lazy/dsl/fn.concat_lst.html

Here is the list of unused tags:

GPUEngine Config min max prefix suffix map_alias concat_list implode read_database_connectorx Series.dt.day min max implode arr.eval concat_list Series.dt.day

The code to find all broken links (run in docs/source/_build):

for url in `cat API_REFERENCE_LINKS.yml | grep -o "https://.*$"`
do
  if ! curl -s -i $url | grep -q "HTTP/2 200"
then
  echo $url
fi
done

The code to find stale tags (run in docs/source/_build, using ripgrep):

#!/bin/bash
for tag in `yq '.[] | keys | .[]' API_REFERENCE_LINKS.yml`
do
  if ! rg -q "code_block.*'$tag'" ../..
  then
    echo $tag
    # Uncomment to go manually through the hits and avoid false positives
    # rg "$tag" --iglob "!API_REFERENCE_LINKS.yml" ../..
  fi
done

Link

https://docs.pola.rs/user-guide/expressions/aggregation/

rodrigogiraoserrao commented 2 hours ago

Hey there, thanks for this!

Regarding the link for fetch / the section “Execution on a partial dataset”, see https://github.com/pola-rs/polars/pull/18033. My suggestion is that you share with the OP of that PR what you intended to do with head + collect.

As for the broken links, please do submit the corrected links.

When you talk about “stale tags”, I assume you are talking about entries in the YAML file that are not referenced in code blocks, is that it?

npielawski commented 2 hours ago

When you talk about “stale tags”, I assume you are talking about entries in the YAML file that are not referenced in code blocks, is that it?

Yes exactly

rodrigogiraoserrao commented 1 hour ago

Yes exactly

Ok, I see. To be honest with you, I am not 100% sure if those tags are relevant elsewhere, so if those links are all just working fine, I'd recommend we keep them for now.

Fixing the broken & used links seems more useful in the short term and since you were kind enough to share the scripts you used to check the links we can always go through the tags again later.