Write script to validate hyperlinks in docs translation

akien-mga commented 4 years ago

This is mainly needed for https://github.com/godotengine/godot-docs-l10n but that repo is mostly used for practical reasons, actually discussion should likely happen here.

I've noticed that in our docs translations, it's common to find broken links, e.g.: Screenshot_20190725_104032 Screenshot_20190725_104050

In the above two examples, that's due either to formatting issues (Japanese doesn't separate words with spaces, but reST seems to require a space after the trailing _ of an external link) or translator mistake due to not being familiar with the markup (the French translator simply removed the markup).

I guess Sphinx might be able to warn about some of these, but likely not all cases (e.g. the French example above would not be seen as invalid formatting).

I think the best would be to have a script (likely in Python) that I could use to go over all .po files in https://github.com/godotengine/godot-docs-l10n/tree/master/weblate to find such formatting issues or mismatches between source string msgid and translation msgstr. For mismatches, it could be that gettext already has a feature for that, though I'm not sure it would support reST-specific markup (it does have a feature to warn about mismatch in trailing spaces or newlines for example).

.po files are wrapped, so to parse the markup properly the parser should be able to unwrap the lines and consider the whole string. It could then also be used on the .pot template to check for badly formatted links in the English source.

Any Python and parser-loving volunteer? :P

akien-mga commented 4 years ago

The rules for inline markup recognition are in http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#inline-markup-recognition-rules

They do mention:

For languages that don't use whitespace between words (e.g. Japanese or Chinese) it is recommended to set simple-inline-markup to True and eventually escape inline markup characters. The examples breaking rules 6 and 7 above show which constructs may need special attention.

But that's for docutils, I'm not sure we can enable it for Sphinx, nor if it would be wise as we have a lot of code examples that should then be escaped manually?

Also found "Gotchas" here: https://www.sphinx-doc.org/en/2.0/usage/restructuredtext/basics.html#gotchas

Separation of inline markup: As said above, inline markup spans must be separated from the surrounding text by non-word characters, you have to use a backslash-escaped space to get around that. See the reference for the details.

That's probably what Japanese and Chinese translators should then to avoid having to put extra visible spaces around links.

omicron321 commented 3 years ago

that would be usefull for general purpose

i am meeting dead links for English content

i have done a check pass with a limited online tool (brokenlinkcheck), which found broken links:

https://www.360toolkit.co/convert-cubemap-to-spherical-equirectangular.html https://github.com/godotengine/godot/blob/master/drivers/gles2/shaders/copy.glsl https://github.com/GodotNativeTools/GDNative-demos/tree/master/c/SimpleDemo https://aur.archlinux.org/packages/mingw-w64-gcc/ https://github.com/godotengine/godot/blob/master/core/pool_vector.cpp https://github.com/godotengine/godot/blob/master/scene/audio/audio_player.cpp https://github.com/godotengine/godot/blob/master/core/message_queue.cpp https://godot.eska.me/irc-logs/ https://godot.readthedocs.io/en/latest/tutorials/misc/running_code_in_the_editor.html https://blog.escapecreative.com/customizing-mailto-links/ https://docs.godotengine.org/en/latest/tutorials/viewports/multiple_resolutions.html https://docs.godotengine.org/en/latest/getting_started/workflow/assets/importing_images.html https://docs.godotengine.org/en/latest/classes/class_@c https://docs.godotengine.org/en/latest/getting_started/workflow/export/feature_tags.html

some third party links are obsolete, some gh resources got slightly moved but are now broken links in docs, or C# class reference link is dead (docs generating a link with unescaped # char) for example

godotengine / godot-docs

Write script to validate hyperlinks in docs translation #2654