openSUSE / obs-docu

Official Open Build Service Documentation. Content gets reviewed and edited. Generated books are available at http://www.openbuildservice.org
34 stars 70 forks source link

SEO: rewrite certain xml:id values and validate against Geekodoc #364

Closed smithfarm closed 4 months ago

smithfarm commented 4 months ago

For SEO optimization, we need to avoid the use of dots ('.') and underscores ('_') in our xml:id values.

To be merged together with companion PR: https://github.com/openSUSE/obs-landing/pull/466

hennevogel commented 4 months ago

Phew this will break a lot of external links right?

smithfarm commented 4 months ago

Did we ever say that our links live forever? All the other SUSE documentation had to take this poison pill, so why not ours?

hennevogel commented 4 months ago

Did we ever say that our links live forever?

No but that doesn't mean we should not give a 💩 either. obs-landing knows redirects. At least for the sections etc. (everything but anchors) we should do those. Like

/help/manuals/obs-user-guide/par.first_steps /help/manuals/obs-user-guide/par-first-steps
smithfarm commented 4 months ago

@hennevogel OK, I'll create a companion PR at obs-landing.

tomschr commented 4 months ago

Did we ever say that our links live forever? All the other SUSE documentation had to take this poison pill, so why not ours?

That's true, we faced this problem in the past. We mitigated this by adding redirect rules in our Apache .htaccess file. For the future, we introduced a "don't change your ID!" rule.

Still we face this problem from time to time when some clever managers want to change the product names. :wink:

If IDs are part of URLs, there is not much you can do to avoid this problem. Sometimes it's inevitable. You can make people aware of that to mitigate the issue.

For the record, our styleguide contains a section about Identifiers. Just to give you an idea how we use and create IDs.

smithfarm commented 4 months ago

That's true, we faced this problem in the past. We mitigated this by adding redirect rules in our Apache .htaccess file.

As already suggested by @hennevogel I have opened a companion PR with corresponding redirects: https://github.com/openSUSE/obs-landing/pull/466

tomschr commented 4 months ago

I have opened a companion PR with corresponding redirects: openSUSE/obs-landing#466

Yes, saw it. Just wanted to give you some context from our side. :slightly_smiling_face:

smithfarm commented 4 months ago

We seem to have reached a consensus that migrating to Geekodoc is the way to go, so as long as there are no objections to the additional changes I have added, this PR seems ready to go in (together with its counterpart PR over at obs-landing).

tomschr commented 4 months ago

I've did a quick validation check and there is more work ahead. :wink:

I forgot to mention that the xml:id is only one part. You also need to modify the value of the linkend attribute. It appears inside a <xref linkend="..."/> element. These are internal cross references to other parts of the document. I'm sorry for the additional work.

@smithfarm Nathan, I've experimented a bit to make this work more efficient. After I was unsuccessful with sed and xmlstarlet, I wrote a small Python script:

# modifydocbook.py
import re
import sys

pattern = re.compile(r'(xml:id|linkend)="([^"]+)"')  # Regular expression pattern

def modify_docbook_xml(file_path):
  """
  Read a DocBook XML file by replacing dots and underscores with dashes
  in xml:id and linkend attributes.
  Prints the modified content to standard output.

  Args:
    file_path: Path to the original DocBook XML file.
  """

  with open(file_path, 'r') as fh:
    for line in fh:
        match = pattern.search(line)
        if match:
            # line = line.rstrip()
            attr_value = match.group(2).replace('.', '-').replace('_', '-')
            replacement = rf'\1="{attr_value}"'
            line = pattern.sub(replacement, line)

        print(line, end="")

if __name__ == "__main__":
    modify_docbook_xml(sys.argv[1])

If you install sponge (package name is the same) you can do it in one go:

$ for xml in xml/*.xml; do 
  python3 modifydocbook.py $xml | sponge $xml
done

After this process, there is one ID (co-obs-sserv-struct-name) which occurs twice. If I fix that as well and correct the ROOTID in the DC files, both admin and user guide are valid.

smithfarm commented 4 months ago

You also need to modify the value of the linkend attribute.

@tomschr Thanks! But I believe I did that already... Do you see any linkend attribute value that I missed?

tomschr commented 4 months ago

Thanks! But I believe I did that already...

daps thinks otherwise. :wink: It reports some xml:id and linkend errors. I count 14 files that are modified after I apply my script. :slightly_smiling_face:

smithfarm commented 4 months ago

@tomschr Instead of me duplicating your work, could you just push the changes right into the PR? Or just push a branch with the commit and I'll cherry-pick it in...

smithfarm commented 4 months ago

@tomschr The odd thing is, Geekodoc no longer complains about the . and _ characters in the xml:id attribute values.

I mean, I'm trying to get to them all, but I found I missed some - yet Geekodoc (daps) did not complain.

smithfarm commented 4 months ago

daps thinks otherwise. 😉 It reports some xml:id and linkend errors.

Where? Here in the PR it says this:

[INFO] Building DC-obs-all
[DEBUG] Validate inside container docker exec fd9b38640981f186599fbbaca8e9a91df877a2104d61d86db9027b55b2ae5468 daps -d /daps_temp-dkjdVP1OK7/source/DC-obs-all validate 
[DEBUG] Validation for DC-obs-all result was 0
[DEBUG] Checking validation result...
daps -d /daps_temp-dkjdVP1OK7/source/DC-obs-all html  
Your output documents are:
/tmp/daps2docker-2DhNs4mF/obs-all/html/obs-all/

I thought this meant that daps thinks the XML is valid.

tomschr commented 4 months ago

daps thinks otherwise. 😉 It reports some xml:id and linkend errors.

Where? Here in the PR it says this: ... I thought this meant that daps thinks the XML is valid.

Use the option -vv and look at the beginning. You will see probably something like this:

     DOCBOOK_VERSION: 5
        DOCBOOK5_RNG: /usr/share/xml/docbook/schema/rng/5.2/docbookxi.rng

If that's the case, you still validate against DocBook 5. However, DocBook doesn't have any restriction regarding IDs.

To really validate against Geekdoc, you need daps to tell what schema to use. In your ~/.config/daps/dapsrc file, add the following line:

DOCBOOK5_RNG_URI="urn:x-suse:rng:v2:geekodoc-flat"

Then validate again and you should hopefully see some messages. For me, it looks like this:

[ERROR]: The document contains XML errors:
/home/toms/repos/GH/opensuse/obs-docu/build/.profiled/bogus_opensuse_novell/book-obs-user-guide.xml:6:217: error: value of attribute "xml:id" is invalid; must be an XML name without colons matching the regular expression "[\-0-9a-zA-Z]+"
tomschr commented 4 months ago

@tomschr Instead of me duplicating your work, could you just push the changes right into the PR? Or just push a branch with the commit and I'll cherry-pick it in...

@smithfarm Absolutely! I just didn't want to intervene with your work so my next questions would be exactly that. :+1: I've committed the changes in commit 05a8958, see the commit message for details.

smithfarm commented 4 months ago

@tomschr Thanks! Much appreciated.