open-contracting-archive / extensions-data-collector

Superseded by open-contracting/extension_registry.py
https://github.com/open-contracting/extension_registry.py
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Where does HTML conversion and i18n live? #11

Closed kindly closed 5 years ago

kindly commented 6 years ago

This issue is a discussion about where (initially for only core extensions) the following code lives:

The three locations that this could happen are:

Other requirements

There needs to be some in-lining of HTML tables into some of the longer docs/readmes. Also, we would like to be able keep our current sphinx directives that do the table in-lining and potentially other things.

So Where?

Preferably in one location for everything, maybe the html rendering could be separate from the i18n and live in different places. However, currently the sphinx i18n ties these together i.e the only way to internationalize is whilst converting to HTML. Also internationalizing outstide sphinx i.e translating the markdown directly, would require a new approach to i18n and would take a lot of work.

Suggestion

Let all conversion to HTML and i18n live in the data collector. This would involve some hacking of sphinx to make it covert/translate individual pages without any theme/border/toc (or just extract the main section). It would also mean having HTML in the data output (which to some peoples tastes is not great practice). However, this seems better to me than have the i18n live in different places.

jpmckinney commented 6 years ago

First, adding some notes from our call:

I'm open to replacing Sphinx directives with Jinja2 filters, Markdown plugins, or something else; on our call, I was probing to check why we would replace them, and what the implications would be for repositories not directly implicated in the extension explorer.

I'm also open to not using Sphinx to build any part of the extension explorer. The reasons expressed by others that I noted from the call:


I think it's relevant to split up 'internationalize' into different tasks, each of which may not need to be performed in the same place. I'm restating a lot of shared knowledge, but please bear with me :)

Extract messages

One task is extracting messages from source files. I expect this can be done in the data collector, which will make copies of source files from extension repositories, and then do something very similar or identical to what is described here.

For CSV and JSON files, the dependencies here would just be pybabel and documentation-support, though documentation-support has a long list of requirements, because it's presently only used in the context of Sphinx builds. I can extract the Babel extractors to a new repository, so that the dependencies are only pybabel and that new repo.

For Markdown files, is there any issue around using the gettext Sphinx builder to extract messages?

Translate messages

Once messages are extracted to POT files, I think we can continue to map POT files to Transifex resources using sphinx-intl (commands here), in the collector.

Is there any issue with continuing to use Sphinx here?

We can then use Transifex as usual, which results in pulling PO files to a local directory.

Translate source files

I documented how we presently translate source CSV, JSON and Markdown files. Although translation of CSV and JSON files occurs during a Sphinx build, it'd be straightforward to call the translate_schema and translate_codelists methods outside such a build. (We can also make those methods more flexible as part of this work, if desired.) We can move these into the same repo as the Babel extractors, if we were to move those.

What remains then, is the translation of Markdown files, after messages have been extracted and translated. (Basically, all the above was to clearly narrow the scope of the problem.)


All uses of csv-table-no-translate, jsonschema and extensiontable (for rendering JSON Schema and codelist CSV files as tables) refer to translated files (at paths containing current_lang or similar). In other words, the messages have already been extracted and the sources files already translated – it's just a question of the Sphinx directive, Markdown plugin, etc. building a table based on the input (eval_rst block or otherwise). This rendering can be done in a "build" step separate from the "i18n" step we're discussing – i.e. it can be done in the extension explorer.

So, I think all we're left with is the translation of the messages in Markdown files. I haven't explored the following option at all, but shouldn't it be possible, without a lot of work, to use the CommonMarkParser to translate the Markdown file from English into another language, and avoid rendering HTML?

Anyway, this is a long way of getting to a couple questions.

jpmckinney commented 6 years ago

I tested my proposed approach, and I can successfully translate all Markdown files in the standard repo with this script, by running:

sphinx-intl build -d standard/docs/locale
find standard/docs/en -name '*.md' -exec python translate_markdown.py \{\} \;

The Markdown in the translated files is also a little cleaner. Try running the script with language set to 'en' to see that the script preserves the Markdown markup. During development, I had a few bits of code to print the nodes being traversed. They might be useful if extensions use some Markdown features not used in the standard (unlikely, but possible – I think only horizontal rules aren't yet handled) and therefore not anticipated by the script.

kindly commented 6 years ago

@jpmckinney that is really good news it is feasible to translate markdown to markdown using docutils and recommonmarks version of commonmark. I did a very basic proof of concept using https://github.com/PavloKapyshin/paka.cmark (another commonmark parser based on the c library) and found it possible (to extract and translate) too but would have required writing some new babel extractors to do the work. So your way seems simpler.

By the way the ExtensionTable directive currently does not relate to translated files. The translation is done from the tables generated after the directive is run. We currently do not translate the schemas of the extensions (which for the extension explorer we should do).

My feeling is that all the translation and directive/jinja/mardown-extentions should be done in the data-collector. Also if possible a markdown -> markdown conversion would be better than a markdown -> HTML one, but the latter would mean that we would get to keep any sphinx directive regardless of what Node tree it built.

As the requirement is to in-line some of the directives/jinja/markdown-extensions then if generating markdown we probably would want to just use in-lined HTML as the result of these extra functions. Mainly because as far as I am aware there is no consensus as to how to do tables within markdown extensions.

As a separate point if we where to use jinja I would imagine the templating to look something like

   {{ extension_table(extension_data, definitions="Lot LotDetails LotGroup") }}

which is a jinja macro or just a python function. This can have the same arguments as the current directive. We would explicitly need to pass in the extension_data for that extension. This would just generate an HTML table and would be very easy to write.

I am happy for us nonetheless to continue using sphinx to render the pages as long as this rendering/translating is done inside the extensions-data-collector. My biggest concern is the difficulty with flexibility and navigation in the explorer website expressed above and here: https://github.com/open-contracting/extension-explorer/issues/2

jpmckinney commented 6 years ago

You're right about extensiontable! As discussed on the call, it's not so different from jsonschema that we couldn't replace both with a single directive/method.

I used recommonmark because it's what our Sphinx sites use to parse Markdown (and which this collector could/should also use, if we would still be running sphinx-build -q -b gettext $(DOCS_DIR) $(POT_DIR) as suggested in my earlier comment). I figure it is best to use the same versions of the same libraries across repositories, to avoid edge cases from slight differences in implementation.

I think we've answered i18n now, yes?

For HTML conversion (i.e. Markdown -> HTML), in an ideal scenario, wouldn't that be done in the explorer? That was the sense I got from the call, and it seems like it's a viable option now that we have a solution for translating Markdown files without rendering them as HTML.

As for the eval_rst blocks, the extensions presently use the directives: extensiontable, csv-table-no-translate, jsoninclude, and also list-table (because the version of recommonmark we're using doesn't have table support).

For clarity, these are my non-exhaustive user stories for such blocks (whether they're Sphinx directives, Markdown plugins, etc.). As an author of documentation, I want to:

I understand the comment on the call that such instructions mix code with content, but we already mix markup with content (I figure we'd need to use a rich text editor to be pure content), and in my experience, authors with no coding knowledge have rarely expressed issues with learning a few instructions and have in fact been very happy and eager to use such instructions (for reasons above).

If I understand your previous comment in this issue, the collector would be responsible for interpreting those instructions and substituting rendered content, and the rendered content would be HTML, because Markdown has no standard table format for later components to unambiguously interpret. That's fine with me, especially as generating HTML tables has robust solutions compared to generating Markdown tables.


In summary, it seems we're heading towards:

  1. Write extensions' files to disk
  2. Extract messages from extensions' files into POT files using sphinx-build -b gettext for Markdown files, and PyBabel and Babel extractors for CSV and JSON files, as in current practice.
  3. Map POT files to Transifex resources using sphinx-intl update-txconfig-resources, as in current practice.
  4. Use Transifex, resulting in PO files, as in current practice.
  5. Compile MO files from PO files using sphinx-intl build.
  6. Translate extensions' files using translate_schema and translate_codelists for CSV and JSON files, as in current practice, and using something like my above gist for Markdown files.
  7. Render the instructions (markup TBD) in the Markdown files.

Does that seem right? If so, I figure we can close this issue and then author follow-up issues.

Some follow-ups already mentioned include:

kindly commented 6 years ago

The outline seems correct to me.

As said we still have to come up with the markup that can be added in the docs.

My only concern still is the one you raised on the call, is the re-usability of our current sphinx directives, especially across projects. I have moved over to the other side of the fence on this one and would prefer reuse at the expense of markdown -> HTML conversion rather that markdown -> markdown (with emeded HTML tables) conversion. The reason being is that whatever markdown we produce will only ever be converted to HTML by any of the tools that use it. So even though it feels ugly to produce the (translated) HTML in the collector I do not think it will effect anybodies use. Also as markdown parsers are notoriously incompatible (perhaps except commonmark ones) we do not have to worry about differing output.

Nonetheless, I have been looking at if it would be possible to use spinx/docutils just to have a way to render a single directive into HTML for embedding in the markdown and think it would be a lot of work to do well. However if possible this would sort pretty much all issues and we would be able to keep markdown -> markdown and our directives.

If we do come up with a different markup for embedding then should be backport it to the standard? for at the directives outlined above?

jpmckinney commented 6 years ago

I did a brief exploration of rendering a single directive (e.g. seeing if I can add the line AutoStructify(document).apply() to translate_markdown, which would run eval_rst blocks, though it wouldn't yet know how to interpret our directives), and it does seem like a lot of machinery is needed, even after taking shortcuts inspired by Sphinx tests.

I'm tending towards thinking that it's fine for the documentation of extensions to not support eval_rst and to only support a small number of custom instructions.

Doing an audit of eval_rst blocks in extensions:

So, it seems like we can actually do away with these directives in extensions. What do you think?


Update: I can render eval_rst blocks.

jpmckinney commented 6 years ago

If we do keep the directives, the above test surfaced a few issues:

  1. csv-table-no-translate refers to a translated codelist file, whose location is build-dependent, i.e. its location may differ between the standard, profiles and explorer.
  2. An extension name is specified in extensiontable, which then uses a config value to determine the version of the extension, whose release-schema.json it will retrieve (the relevant config value changed from extension_registry_git_ref to extension_versions, but the logic is the same). This is fine for the standard and profiles, as all extensiontable occurrences are meant to use the same version of a given extension. For the explorer, some occurrences would be meant to use different versions of a given extension. Since the occurrences in the standard and profiles will be removed once the explorer is ready, we have a lot of flexibility here in terms of solutions.

If we keep directives (which may get us an explorer earlier), I'm flexible on solutions to these issues.

kindly commented 6 years ago

I am very happy to ignore the directives on our first go with the explorer. Also as there is a way to convert the eval_rst blocks to embedded html in the markdown (thanks @jpmckinney) it may be possible to add them back in or decide on some other option. However, I can see the issues raised (mainly point 2 above) could take a bit of work or change to the directive.

A jsonschema -> tabular-data library

What I would like to see is a library whose job it is to convert jsonschema (or parts of it) into tabular data that is independent of any directives/markup/jinja code. So this library could be used by any of these and leave the directives/markup/jinja just for the view/rendering. With this I would not be so worried about having these different mechanisms for display as long as they got the same results. Also, this library would be much easier to test.

There could also be a library to do this for codelists or extracting parts of example data. These would be fairly trivial and may not be necessary, but may still be worth it to know that the output would always be the same.

I think it could make sense to write these as part of the extension explorer work. The extension explorer will need to do these as part of displaying the extensions schema/codelists anyway and it would be a shame to have this logic wrapped up in a "view" once again.

Another i18n decision

I have a further concern regarding the i18n process outlined above:

Currently the canonical source of translations for the standard docs and core extensions are the pot/po files in the github repository for that particular version of the docs. A patch release can have different text (therefore different po files) from each other. Transifex only cares about the latest patch release and not the historical ones.

For the extension-data-collector, we need a new place for the translations to live for core extensions. So we have the following options:

  1. Have a pot/po file for every patch release in the data collector. This would mean many more files in transifex and could be cumbersome.

  2. Combine all translations of all patch releses into a single pot/po file. This would require a more complicated i18n workflow than above as we would have to combine po/pot files. Also translating using transifex would be more difficult because text from different patch version would be intermingled and giving the context as to where to look for the original text would be confusing.

  3. Just forget about patch releases in the extension explorer/collector and always just show the latest or accept old patch releases i18n will not work.

  4. Push the po/pot files back to the extensions github themselves. This means essentially moving all i18n work back to the extensions themselves. If doing this than the codelist/schema translations might as well be pushed back there too.

My feeling is that 4, with good libraries/tooling is the correct solution in the long run (as it gives the community extensions a way to do it too) but is probably the most amount of work. However, I am really not sure which is the best option, maybe 3 would be acceptable for now?

kindly commented 6 years ago

Just to expand on point 4. We could make the extensions repositories look something like:

The idea is that the extensions repositories would contain all the files (docs/schema/codelists) fully translated. For the core extensions we would use transifex/sphinx/babel to manage this and make the translated files. However for community extensions could opt for just translating these by hand or use any translation workflow that made sense. So the translated files would mandatory not the po/pot files.

jpmckinney commented 6 years ago

Having a library to support the rendering of tables sounds good to me. I'm imagining one library method would accept similar inputs as the directives (for example: a JSON Schema file and JSON Pointers for what to include) and would output a table/array structure, and then another method would render it (to HTML or otherwise).

For translation, if we were to proceed as we do in other projects, we'd likely do (1). Right now, there are 58 entries in extension_versions.csv, and in OCDS' lifetime, we might have hundreds. That means hundreds of Transifex resources and PO files (and POT files, but those aren't version controlled in other projects). Navigating across hundreds of files is a pain, but in comparison to the pains associated with the other options, I consider it minimal.

2 isn't an option because it makes a translator's job incredibly difficult. Looking at a message in context is a very frequent task for translators.

3 isn't preferred as it means that once a version number rotates, it's impossible to translate old versions into new languages.

For 4, I described my concerns with using the system we have in the standard in all extensions. Whether it's that system or a filename-based system, we've witnessed translators wanting to translate resources authored by others (e.g. the in-progress community translations of the standard). If that translator needs to make pull requests against dozens of extensions, they are unlikely to do it, as that's a big effort. Editing JSON Schema and codelist files directly can also introduce syntax errors. If that translator can instead access a single Transifex project that has all the resources, they are more likely to do it (low effort), and they can't introduce syntax errors.

With any of the options, we can allow community translators to contribute translations to the explorer's Transifex project, just as we do with the standard.

So, in short, what's wrong with 1? Having a lot of files/resources is not a problem.

kindly commented 6 years ago

After looking at sphinxcontrib-jsonschema, to save on repeating work, the library will basically be the non-directive parts of it i.e https://github.com/OpenDataServices/sphinxcontrib-jsonschema/blob/master/sphinxcontrib/jsonschema.py#L265 So I think I am just going to extract those parts into a new library for other so the extension explorer can use it. The good news is that part already has tests.

Regarding translation the only remaining question regarding 1 above is where the po/pot files live as at the moment the collector does not cache any data. The answer could be "transifex" i.e that is the canonical source and we just pick up/cache the po files from there when the collector is run. However, we may want a new github repository for this just to back them all up. I think each extension would probably need 3 resources docs/schema/codelists per version, but agreed, even so I think the numbers are not too high.

I agree that a single transifex project is a good idea. Also potentially automatic pushing of requested community extensions into transifex, but I think you can still have that with option 4 (if it was up to us to do the pull requests for those extensions) but I am as happy going with 1 especially as it will not exclude that option in the future.

jpmckinney commented 6 years ago

Sounds good re:

We don't store POT files elsewhere, and I don't know that you can pull POT files from Transifex. Is there a need to keep POT files?

rhiaro commented 6 years ago

After familiarising myself with the various parts of this issue and talking at length with @kindly, I'm going to implement the following in the data collector. Putting this here for future reference, and to give everyone a chance to shout if I've missed something or misunderstood something.

  1. (done) extensions-data-collector crawls extensions and makes the big data.json file in English
  2. Extract all the appropriately sized bits of markdown from data.json (Reusing as appropriate from @jpmckinney's md translation script)
  3. Extract strings from codelists, schema and extension.json
  4. Send it all to Transifex
  5. Fetch any new translations from Transifex
  6. Build a new data.json file which includes all the translations
  7. (done) Extension explorer website reads this and displays

And separately: another github repo backs up all of the PO files from Transifex.

This workflow assumes all translation will take place centrally via Transifex.

jpmckinney commented 6 years ago

In (2), my script doesn't extract strings, but rather translates strings, once translated strings are available, returning translated Markdown text. To extract strings, you can just use the gettext builder in sphinx-build, as described in this earlier comment: https://github.com/open-contracting/extensions-data-collector/issues/11#issuecomment-404567786

For (3), extraction from codelists and schema can be done using existing code (referenced in above comment). extension.json is translated in-place by adding language keys to objects, so isn't typically sent to Transifex. However, if what we're translating is data.json and not extension.json itself, then it might be appropriate to send to Transifex.

For (4), in terms of sending to Transifex, there are some decisions to be made in terms of how to organize the strings into resources. There is discussion in earlier comments about that as well.

I'm realizing that the earlier comments assumed that extension files would be copied, rather than organized into a big JSON file. I don't know if that makes the earlier proposal more difficult; gettext tools work more out-of-the-box when operating on files. Was there a strong motivation to organize file contents into a big JSON file, instead of organizing files on the filesystem? I haven't followed the structure of data.json closely.

jpmckinney commented 5 years ago

Closing as mostly implemented. I created follow-up issues.