rwth-iat / word2asciidoc

Apache License 2.0
0 stars 1 forks source link

Fix Image/Table/Section links in asciidoc #1

Open zrgt opened 11 months ago

zrgt commented 11 months ago

Currently all references in generated Asciidoc for images, tables and sections are just a static text.

We need to make these references interactive, such that on click the forwarding to the position with an according Image/Table/Section will be executed. We need to define for each image, table referenced in text link names, which will be used for referencing in the text as links.

Former issue: https://github.com/rwth-iat/aas-specs/issues/2

monsieuremre commented 11 months ago

I have locally tackled this issue. But before creating a pull, I want to get some things straight. Here is what my approach has been.

Create a python function to do the following in order to an ascii doc:

Now, every mention of figures refer to the respective figure. I have the script and it works perfectly. I did not test it for tables just yet but it should work just fine perfectly too.

The reason I didn't create a pull just yet is that this solution is severely limited to countable references. So we can't use the same solution for sections subsections chapters and clauses. Because these are not countable in a normal manner. For example, a reference to Section 5.4.3.2 let's say. We can enumarate sections and subsections with a counter to get this hierarchical structure.

I also am not very satisfied with the solution I have for tables and figures as well. Not that it does not work or anything. However, I would rather have something that is more flexible and does not rely on figure numbers and instead refers to figure titles. I think I can do this as well with a new script.

What would your advice be? Should I create a pull for figures and tables and leave sections/subsections as is for the time, or should I try a new implementation to address all problems at once for the better. I have some ideas for the latter, but I would like to hear from you first.

zrgt commented 11 months ago

Let's discuss it tomorrow in the meeting

monsieuremre commented 11 months ago

There is no clean way to address the issue directly when doing the first conversion. Pandoc deliberately removes all cross references from the documents. See here.

So the only way to have the references be functional again seems to be my initially suggested method of iterating through the lines and modifying the text using regular expressions. This may or may not be doable also for sections and chapters and their hiararchical structure, but I will try to find a solution.

I am positive I can write a script to do these changes, but by its very nature it won't be clean. There is going to be some complication added on top that would normally be unnecessary, but there is no other way.

zrgt commented 11 months ago

Try to open word file with this (https://python-docx.readthedocs.io/en/latest/) and add some identifiers to all internal references, so that you can find these when you've converted the word to asciidoc

monsieuremre commented 11 months ago

Fixed with this commit. Please test and report anything needs changing.

monsieuremre commented 11 months ago

Try to open word file with this (https://python-docx.readthedocs.io/en/latest/) and add some identifiers to all internal references, so that you can find these when you've converted the word to asciidoc

This wasn't necessary. At first I tried to open and modify the docx but this causes infinetely many side effects some of them are practically impossible to fix. Any other library or method would come with their own problems. Luckily, this turned out to be completely superfluous anyway. Everything is handled in the ascii doc.

The references for everything are in the table of contents in the ascii doc. What I do is, read this toc, create a dictionary with the necessary replacements, and run them on the text. Then we delete the toc like before.

Unfortunately this isn't enough, because what we are referring is also needed to have the same if as in the toc for reference. This is not the case, but the id is there right next to these elements. These can be formatted in the desired order to get the functionality. That is what we do in the script. This comes with some changes in the old implementation because the old implementation just destroys every reference header completely instead of fixing the formatting and the order of them.

With the default pandoc behavior, only the stuff that refer to the said elements are converted into plain text, what is being referenced and its id stays in the ascii doc. The previous implementation completely destroys these id's that are meant for referencing, which actually are not deleted by pandoc. So I had to completely get rid of the code block responsible for this behavior. The id's for referencing are formatted and the sequence of reference-id, block-name, image itself is made sure to be the same accross the document, in their desired respective order. So the script handles the correct formatting and ordering and functionality of these reference headers now.

monsieuremre commented 11 months ago

Initially solved then reverted the merge. It still works but it adds extra manual fixing work. I am going to make it optional and test it more to make sure I am not missing anything.