hynek / doc2dash

Create docsets for Dash.app-compatible API browsers.
https://doc2dash.hynek.me/
MIT License
558 stars 38 forks source link

CatBoost documentation doesn't contain a known documentation format #206

Closed capac closed 2 months ago

capac commented 2 months ago

I'm trying to generate Dash documentation for CatBoost, but even after successfully generating the documentation following the instructions in the README, I get the error message.

doc2dash -n "catboost 1.2.5" -d "/Users/angelo/Library/Application Support/doc2dash/DocSets/catboost/1-2-5/" --icon-2x "/Users/angelo/Pictures/Icons/dash/catboost/icon@2x.png" -v -j -u "https://catboost.ai/en/docs/" -I "/Users/angelo/Programming/docs/catboost/docs-gen/en/index.html" ./ -a -f "/Users/angelo/Programming/docs/catboost/docs-gen/en" does not contain a known documentation format.

Any suggestions?

Angelo

hynek commented 2 months ago

Looks like they're using an own format for docs: https://github.com/catboost/catboost/tree/master/catboost/docs

You can probably only import it as generic HTML document as documented here: https://kapeli.com/docsets#dashDocset

capac commented 2 months ago

I tried following your suggestion to import the generic HTML files, but as far as I can tell the HTML may not be so "generic". I know very little about Javascript, but I think that the HTML makes use of a Javascript library (probably app.client.js) to extract the data from an embedded code block in the HTML file itself and then visualize it in the browser. This can already been seen in the index.html file in the root of the CatBoost document directory. The Python script I found that populates the SQLite index (which uses BeautifulSoup) can't parse any of the HTML code from the index.html file. To make this clearer, I've attach a portion of the HTML code from the index.html file:


        <!DOCTYPE html>
        <html lang="en">
            <head>
                <meta charset="utf-8">
                <meta name="title" content="CatBoost">
                <meta name="viewport" content="width=device-width, initial-scale=1.0">
                <title>CatBoost | CatBoost</title>
                <style type="text/css">
                    body {
                        height: 100vh;
                    }
                </style>
                <link type="text/css" rel="stylesheet" href="../_bundle/app.client.css" />

            </head>
            <body class="yc-root yc-root_theme_light">
                <div id="root"></div>
                <script type="application/javascript">
                   window.STATIC_CONTENT = false
                   window.__DATA__ = {"data":{"leading":true,"toc":{"title":"CatBoost","href":"index.html","items":[{"name":"Installation","expanded":true,"items":[{"href":"concepts/installation.html","name":"Overview","id":"Overview-0-0.47378348458599273"},

[...]

                  {"title":"Videos","href":"concepts/educational-materials-videos"}]}]},"meta":{"title":"CatBoost","style":[],"script":[]}},"router":{"pathname":"index.html"},"lang":"en"};
                </script>
                <script type="application/javascript" src="../_bundle/app.client.js"></script>
            </body>
        </html>

I'm thinking that the data block looks a lot like JSON, so probably a good idea would to modify the script to parse the JSON block. Do you have any other suggestions? Thanks a lot.

Cheers, Angelo

hynek commented 2 months ago

I mean if it is important to you and you're so inclined, you can try to write a custom parser: https://doc2dash.hynek.me/en/stable/extending/ If it's some client-side shenanigans, there might be a chance to find the data somewhere in JSON form or something.