retext-project / pymarkups

Wrapper around various text markups, used in ReText

https://pymarkups.readthedocs.io/en/latest/

BSD 3-Clause "New" or "Revised" License

35 stars 8 forks source link

Added support for asciidoc #17

Closed dkuhlman closed 2 years ago

dkuhlman commented 2 years ago

Here are changes that add support for converting Asciidoc files.

Could you please take a look. If you are interested and think it's good enough, please merge it.

I'm hoping that someday, Asciidoc support will be added to ReText.

If you are interested in adding this, here are additional comments and questions:

Reporting errors and warnings

Asciidoc exceptions (errors and warnings) are written to stderr. Does that work for other use cases? Is there something else that I should do with them.

Extra required modules

The added code in support of Asciidoc uses the packages: (1) asciidoc3, (2) bs4 (BeautifulSoup4), and (3) lxml.

Need to add bs4 to settings.conf? I'm not sure how to specify dependencies.

Coding conventions

I've tried to follow the conventions in the other .py files. Except that I've used 4 spaces for indentation rather than a tab character.

PEP 8 says this:

Tabs or Spaces?

Spaces are the preferred indentation method.
Tabs should be used solely to remain consistent with code that is
already indented with tabs.
Python disallows mixing tabs and spaces for indentation.

So the question is whether we want consistency within each file or across all files.

Do you have an opinion on this? I can use the command line tools expand/unexpand to convert the new Asciidoc support code to use tabs for indentation, if that's your preference. My preference is to use the standard 4 spaces indent.

Unit tests

I added unit test for Asciidoc.

When I run them, I see several warnings: DeprecationWarning. I can hide them by running the following::

python -Wignore::DeprecationWarning -m unittest test_asciidoc

dkuhlman commented 2 years ago

Dmitry,

Thanks for your comments.

I've made most of the changes that you suggested. See more detailed notes, below. I've pushed these changes to Github.

What I've done:

About using a different parser -- I've switched from bs4 to lxml, which can be installed with pip (pip install lxml).
I modified method available so that it also checks for importing lxml.

I hope this is acceptable. I cannot find a parser in the Python standard distribution that supports HTML, except the one you suggested (html.parser), which is a SAX style event based parser and, so, is not usable for my needs (specifically, extracting the body, title, stylesheet, etc.
Moved that import of asciidoc3 from module level to inside the functions that use it.
Removed references to settings_overrides and self.overrides
Copied the contents of tests/test_asciidoc_data.py into tests/test_ascidoc.py and removed file tests/test_asciidoc_data.py.
Removed functions/tests test_whole_html and test_toc_backrefs from file tests/test_asciidoc.py.
Checked code with flake8. It uses PEP-8, I believe.
Reverted markup2html.py to the revision with your last change. I did not intend to change that file. I apologize.

Dave

dkuhlman commented 2 years ago

Dmitry,

I believe I have addressed the items in your recent suggestions.

I have not yet pushed the changes to Github.

First, a question:

In unit tests, re-worked the test for stylesheet in a attempt to make it less brittle. I'm thinking we should be testing our code in markups to ensure that it works. We should not be testing the code in asciidoc3 for changes in that code. As long as we have captured the title text, body text, and stylesheet text, our code should pass the unit test. So, I've replaced the tests with these:

self.assertTrue(len(body) > 100)
self.assertEqual(title_expected, title)
self.assertTrue(len(stylesheet) > 100)

What do you think?

Dave

dkuhlman commented 2 years ago

Dmitry,

I believe that I have addressed all of you second round of comments/suggestions.

I have pushed these changes to my fork of pymarkups. So, they should show up in the pull request.

Some details --- What I've done:

Removed use of self.converter_class.
Simplified use of StringIO.
Eliminated storing head in ConvertedAsciidoc constructor.
Captured multiple Javascript elements, although I believe there is only one. Better to be safe.
In unit tests, re-worked the test for stylesheet in a attempt to make it less brittle. We should be testing the code in markups to ensure that it works. We should not be testing the code in asciidoc3 for changes in that code. Our unit tests should not fail if changes/improvements are made to the stylesheet and HTML output produced by asciidoc3.

Do you agree?

As long as we have captured the title text, body text, and stylesheet text, our code should pass the unit test. So, I've replaced the tests with these:

self.assertTrue(len(body) > 100) self.assertEqual(title_expected, title) self.assertTrue(len(stylesheet) > 100)

What do you think?
Changed globals in tests/test_asciidoc.py to all upper case. In particular: Changed global Basic_text to BASIC_TEXT.
Removed commented out debug code.
Added install of dependencies to .github/workflows/main.yml:

python -m pip install lxml asciidoc3

Dave

dkuhlman commented 2 years ago

Dmitry,

Thanks for these suggestions. I've pushed my changes to my fork.

What I have done:

Changed unit test code to use `self.assertGreater(...) -- 2 instances.
Removed constructor method (__init__) from class AsciidocMarkup.
Removed class ConvertedAsciidoc. Returned instance of ConvertedMarkup instead.
removed import re in test_asciidoc.py -- Not used.

About the copyright notice -- It's fine with me to leave it as is. You have done so much more work on this project than I have and they are really you ideas. I suggest that you leave it as is. But, I'll leave it up to you. If you do add me, I'd be pleased. My name would be "Dave Kuhlman".

Dave

codecov[bot] commented 2 years ago

Codecov Report

Merging #17 (1333bee) into master (c13ae66) will increase coverage by 0.29%. The diff coverage is 97.59%.

@@            Coverage Diff             @@
##           master      #17      +/-   ##
==========================================
+ Coverage   95.06%   95.36%   +0.29%     
==========================================
  Files          10       12       +2     
  Lines         608      690      +82     
==========================================
+ Hits          578      658      +80     
- Misses         30       32       +2

Impacted Files	Coverage Δ
markups/asciidoc.py	`96.15% <96.15%> (ø)`
markups/__init__.py	`87.50% <100.00%> (+0.32%)`	:arrow_up:
tests/test_asciidoc.py	`100.00% <100.00%> (ø)`
tests/test_public_api.py	`94.11% <100.00%> (+0.56%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c13ae66...1333bee. Read the comment docs.

mitya57 commented 2 years ago

I pushed some commits, hope you don't mind.

Now the code looks good, but I have some remaining questions:

Why did you choose asciidoc3 and not https://asciidoc-py.github.io/ aka https://asciidoc.org/? It is the first Google result for “AsciiDoc”, and its website suggests that AsciiDoc3 is unofficial implementation:

The AsciiDoc3 implementation you can find on this website asciidoc3.org is a personal project.

And what I like about that implementation is:
- Changelog for version 10.0.0 says:
  
  Importing asciidoc should no longer require the asciidocapi.py script, and can be done through regular python import, e.g. import asciidoc; asciidoc.execute(...).
  
  This means we no longer need to import two different modules, add __path__ hack, etc.
- It is possible to disable the "Last modified" footer with footer-style=none attribute. Asciidoc3 does not have such an attribute.
Will you mind if I rename AsciidocMarkup to AsciiDocMarkup? That capitalization looks more correct.
What do you think about using Safe Mode? Having sys: syntax which allows executing arbitrary commands seems like a security hole (imagine if someone sends a malicious file to a ReText user).

I will be happy to push the remaining changes myself once you answer these questions.

dkuhlman commented 2 years ago

Dimtry,

Renaming AsciidocMarkup sounds fine to me.

I don't understand the save-mode support, but it sounds like a good idea.

asciidoc-py also sounds like a reasonable option. I've made changes in my local repository so that we use it. I was also able to suppress the footer as you suggested. We can switch to asciidoc-py later if you like.

I used asciidoc3 originally because:

It was what I found.
It worked for me.
There was a Python API. (asciidoc-py has one too, of course)

Both asciidoc-py and asciidoc3 seem to be actively supported.

If you feel better with asciidoc-py and since I can't give any reasons for using one over the other, we can switch to it. Tell me if/when you want my patches for asciidoc-py.

Dave

dkuhlman commented 2 years ago

Here is a diff against my fork containing my changes in support of asciidoc-py.

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 5190acc..cf1427c 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -18,7 +18,7 @@ jobs:
         run: |
           python -m pip install Markdown python-markdown-math pymdown-extensions
           python -m pip install docutils textile pygments codecov PyYAML
-          python -m pip install lxml asciidoc3
+          python -m pip install lxml asciidoc
       - name: Install the project
         run: python setup.py install
       - name: Run tests
diff --git a/markups/asciidoc.py b/markups/asciidoc.py
index bfd5539..eeaefc1 100644
--- a/markups/asciidoc.py
+++ b/markups/asciidoc.py
@@ -27,27 +27,26 @@ class AsciidocMarkup(AbstractMarkup):
     @staticmethod
     def available():
         try:
-            importlib.import_module('asciidoc3')
+            importlib.import_module('asciidoc')
             importlib.import_module('lxml')
         except ImportError:
             return False
         return True

     def convert(self, text):
-        import asciidoc3
-        from asciidoc3 import asciidoc3api
+        import asciidoc
         from lxml import etree
         outfile = StringIO()
         infile = StringIO(text)
-        converter = asciidoc3api.AsciiDoc3API(
-            list(asciidoc3.__path__)[0] + '/asciidoc3.py')
-        converter.attributes['newline'] = r'\n'
-        try:
-            converter.execute(infile, outfile)
-        except asciidoc3api.AsciiDoc3Error:
-            # Note: asciidoc3 reports/prints these errors for us.
-            # We catch the exception to avoid multiple reports.
-            pass
+        opts = [
+            ('--out-file', outfile),
+            ('--backend', 'xhtml11'),
+            # uncomment next line to suppress footer.
+            # ('-a', 'footer-style=none'),
+            ('--safe', None),
+        ]
+        cmd = 'asciidoc'
+        asciidoc.execute(cmd, opts, [infile])
         result = outfile.getvalue()
         result = result.encode()
         parser = etree.HTMLParser()
diff --git a/setup.cfg b/setup.cfg
index 0229089..c855252 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -37,7 +37,7 @@ markdown = Markdown>=2.6; PyYAML
 restructuredtext = docutils
 textile = textile
 highlighting = Pygments
-asciidoc = asciidoc3; lxml
+asciidoc = asciidoc; lxml

 [options.entry_points]
 pymarkups =

mitya57 commented 2 years ago

Sorry, I accidentally pushed tip of this repo's master to your branch and GitHub auto-closed this PR.

Anyway, these changes are merged as 363181ba2f81127b37d5582e2e33b44cc73d9f4e. Thank you for your patience!