metanorma / coradoc

Coradoc is the Core AsciiDoc Parser used by Metanorma
MIT License
1 stars 4 forks source link

Calculate a "source hash" by expanding includes #137

Open ronaldtse opened 1 month ago

ronaldtse commented 1 month ago

An AsciiDoc document can include multiple included source files and files of document elements, such as images.

This content hash is used to determine whether a document needs to be rebuilt, depending on whether its dependencies have changed.

This task is to calculate a content hash (per node hash) for an AsciiDoc document that covers its attributes, content and included content.

This is needed for Metanorma collections to decide whether to re-compile a document.

ronaldtse commented 1 month ago

@opoudjis please feel free to provide comments.

opoudjis commented 1 month ago

This extends into a general "don't bother recompiling this document" if the source hasn't changed...

... and the version of Metanorma hasn't changed, and it will need to be able to be overridden anyway: we'll actually need the hashes of all commits in the stack if we don't have to end up injecting "recompile this" directives all over Metanorma documents whenever we work on them internally.

Which means the question of when to ignore the hash, when deciding whether to recompile a document or not, is FAR more involved than the simplistic question of generating a hash to begin with. This is a proposal to save me 10 minutes at the same time I'm being asked to do a day of work.

In current Metanorma processing, I do an initial resolution of all includes (and then embeds) (and then Lutaml and other preprocessing of fetched blocks), before parsing the document. I have no idea what if anything Coradoc is doing with preprocessing and things like Lutaml, but this hashing cannot be invoked until the source to be compiled is fully resolved and populated.

There is also the issue of when the outside world finds out what the hash is. Presumably, there is a sequence of:

The hash comes out of the second step. If I were ever to use Coradoc in Metanorma to compile documents, I would want to know the hash before proceeding to the third and fourth steps: the point is to save time by stopping me from doing so to begin with. (That is a problem with the existing Asciidoc information model, which does not segregate load from convert at all.)

The notion that this should also be checking the binary content of images is... well, I simply don't want to know any more. That is the point where this request for time-saving becomes more trouble than it is worth. Are you really going to be hashing every image you reference as you parse them?

ronaldtse commented 1 month ago

SHA-256 is fast. In Coradoc, step 2 occurs at step 3. The resulting hash tree is just a product of all its child/content hashes (i.e child includes).

Step 4 never happens here because we don’t need any output, so there’s no wasted time.

There are ways we memoize hashing of large binary files, such as last access.

opoudjis commented 1 month ago

If step 3 in Coradoc were instantaneous, I could just run 3 in Coradoc to get the hash, inject it, deal with all the problems of determining whether to take it seriously -- which you are still not paying attention to -- and then go back to parsing it from 1, internally.

Step 3 in Coradoc will not be instantaneous: it is in fact parsing the document. Which as you already know, takes time, and that's not out of my incompetence or Dan Allen's: document parsing does, in fact, take time, especially when it's got the kitchen sink of preprocessing thrown in, as you have made Metanorma do.

I remind you that in my profiling from last night of a single collection, 60% of time is spent on preprocessing Express in Lutaml, 20% on the rest of Asciidoc parsing to Semantic XML, and 20% on rendering the Semantic XML to outputs. The 60% is what I want to economise on, not the 20%. And I'm hard put to find a scenario when I can get a hash before all the preprocessing is done, including Lutaml.

... So I'm parsing the document, steps 1-3, to work out whether I need to parse + convert the document, step 4. And that looks like spending 60% of the time, in that case, in order to work out whether I need to spend 80% of the time. And if I do, I will be restarting the 60% preprocessing from scratch -- unless I have output from Coradoc that is already parsed and that I can use with minimal disruption to Metanorma.

That, on top of all the conditions that I would need to invoke on whether to ignore the hash, such as software library updates.

As a friend of mine once said, "people spend hours trying to save minutes."

ronaldtse commented 1 month ago

My friend also commented, the time used to find ways of discounting a task often takes more than the time actually doing it.

opoudjis commented 1 month ago

You asked for feedback. That is my feedback.

opoudjis commented 1 month ago

This is an architectural problem, and any hash needs to be run as fast as possible, if it is to prevent further compilation, and save time. That may well mean that it is NOT aligned to the existing Coradoc processing model. It needs to resolve all includes and imports and file references IMMEDIATELY, before anything else, and generate a hash. No parsing, no processing, no anything. That's the stuff we're trying to preempt: that's the entire point of this.

It needs to happen before your step 3. Step 3 is simply too late, and you will find that once Coradoc parses documents that take 5 minutes to parse, and not 5 seconds.

If you want to parse everything and generate me a hash that will end up doubling my processing time, because it is generated too late into processing, then sure, you can dismiss my objections as taking more time than actually doing the task. And the task that ends up done will be not fit for purpose, because its intended aim is to save time by preempting compilation. I fail to see why you find that objection trivial.

hmdne commented 1 month ago

I have managed to do caching in Opal, and it's working very well. You can take a look at our code, because I think it's self explanatory, and for the most part, can be reused:

A couple of assumptions though:

Those assumptions are valid per Opal project.

But, while that sped up the process significantly, it was not the patch that gained us the most time. Sometimes cache needs to be invalidated and then we needed to wait for a minute or two. So, I came up with a "novel" idea of multiprocessing.

Most of the Ruby code runs single threaded, but we can speed that up, by splitting the work to utilize all CPU cores. In the case of Opal, since the assumption is that a Compiler class instance is independent of any state, we can compile each file in parallel.

Threads? No, no threads, Ruby has a GIL. Ractors? Were unsuitable, would require uglifying the code a lot, and some of the code we can't reliably modify (in particular, Ragel-produced Ruby lexer and parser).

There are two options remaining: fork(2) and system(3). The first doesn't work on Windows and in JavaScript runtimes. The second would kill the state of the code (think: external plugins, monkey-patches). Since most Opal users, from what I know, run modern Unix operating systems, we chose the first.

The code is quite complex, as it involves communication between processes (main process has to communicate new files to compile for child processes, child processes need to send data back to the main process and also a list of those files' dependencies, so that they can be compiled in the next step).

The Prefork has an issue if some C-libraries doing some weird stuff are loaded, or custom Fiber schedulers are used. So that causes some issue to our users, but then we can easily help them by telling them to just fall back to Sequential, or run Opal out of process (eg. by deferring to opal CLI utility).

I am now in a process of reviewing a patch that adds a third scheduler, Threaded (for JRuby and TruffleRuby, as they have real threads). It's also simpler, but requires the code to be thread-safe (and in Ruby libraries, that's not always a given, especially if they use global variables, like... I don't know, class/module instance variables, singleton instances as constants):

https://github.com/janbiedermann/opal/blob/thread_rush/lib/opal/builder/scheduler/threaded.rb

Those are in general practical options that are already in deployment for 2 years I think and can be, for some part, copied and pasted (or abstracted away). But the real work is in verifying the assumptions.

For both points, the main assumption is stateless compilation. Is it really possible, with all the needs of AsciiDoc parsing, for that compilation to be stateless as defined above (the only state being the output, with some metadata). I have seen an issue #125 for which one solution would definitely break this assumption.

I don't know a lot about Metanorma, but if AsciiDoc parsing amounts to just 20% of time, then perhaps a better idea would be to pull those deeper into the stack. All while ensuring the main assumption stands. and it most likely does not always. In the case of Opal, the speedup was significant, as in, we can even run the whole building process for each request (at least, in developer mode) and assume the asset will be produced in a fraction of a second (everything is stateless, except for a last step that is basically just a concatenation of resulting text files).

Perhaps, in the case of Metanorma, this can be solved by doing a multiprocessing queue of small stateless processes, where each process would only depend on previous processes' outputs. Where is possible, there could be cache, and Coradoc could plug into such a system (or run such a system standalone, eg. if CLI is used). But as I noted above, I don't know a lot about Metanorma, so I will certainly need help integrating it with such a system.

opoudjis commented 1 month ago

I don't know a lot about Metanorma, but if AsciiDoc parsing amounts to just 20% of time, then perhaps a better idea would be to pull those deeper into the stack.

Generating the XML from Asciidoc takes 20% to 80%. The collections in ISO-10303 are a difficult subcase, in that they involve extremely heavy preprocessing of external files and templating. Those are however the key value-add of Metanorma. Files that are just normal documents with no external dependencies don't take 5 minutes (let alone 20) to compile.

And the reason rendering the HTML and PDF and DOC doesn't take longer than they do is that Metanorma does run a concurrent thread for each of them.

But the time consuming component is not the Asciidoc parsing. (This is one of my ongoing disgruntlements with the architecting of Metanorma: resources are being expended on things that are not the bottleneck.) The time consuming component is preprocessing the Asciidoctor, to run templates on external data, and postprocessing the XML. A hash generated in parsing Asciidoc preempts only postprocessing the XML, and generating outputs. It does not preempt the preprocessing, which is the biggest timesink for big documents.

That aside, I can only repeat that this is overengineered for the problem posed: I want to know whether I should recompile the document or not, because preprocessing, specifically, is so expensive with a major subclass of documents. That means I want to know if any source files have changed. I should not have to compile the document at all, to know whether I need to recompile it. All I should need to do is a single preprocessing step, that fetches and resolves all includes and dependencies, and compare its hash to the output of that preprocessing the last time I did it. That's it.

If you are going to make compilation so fast that I don't need to ask whether I need to recompile it, that's great in theory, and given how much postprocessing of the XML I still have to do, likely irrelevant in practice. The Express module in ISO-10303 is C code iterating over a massive document schema to generate Asciidoctor: there is NOTHING you can do to optimise that.

If I'm going to do source control for compilation, in fact, it makes more sense to write an alternate mode on the existing Metanorma preprocessors, to fetch the raw text of the includes and imports, with no templating and no parsing, and hash the input file generated. In the case of Express, I don't know if that's even feasible.

You can calculate whatever you calculate, but my use case is that 60% of compilation time in that document was being spent here: https://github.com/metanorma/metanorma-plugin-lutaml . And if coradoc can only run once that has been run to generate the needed Asciidoc... then any hash it generates for me is too late to be useful for me when it matters.

Generate the hash anyway, I guess it can be used for source code validation or something. But this is not solving my problem with this class of documents. And an alternative preprocessor would.

ronaldtse commented 1 month ago

I don't want to be distracted. The key issue here is simple - we need a way to determine if a deliverable's dependencies have changed, using hash calculations.

Yes, I believe dependency checking is faster than compilation. @hmdne already listed many.

Yes, using hours to shave of seconds is worth it because we are talking about a 1-time cost vs hundreds of hours that compilation takes place. I ask myself the times I waited for ISO 10303 to compile and it is absolutely worthwhile.

This is exactly like Make, Rake and the case of compilation of Opal.

Taking this one step further, we can generalize this issue into dependency and workflow. There are nuances such as plugins (e.g. a plugin depends on other files that are not in the typical AsciiDoc feature set), but those can also be defined as separate dependencies (the plugin code can tell the parser its dependencies).

GitHub Actions provides a workflow language, and so is Common Workflow Language, both are in YAML.

Right now, a Metanorma site builds all documents regardless, but there is actually a hierarchy:

In order to create a dependency tree we need to work out the dependencies of all these things. Coradoc is handling the 3rd step.

So I hope this will be the first step in completing this change of dependency tracking for faster compilation for the entire stack.

Let's proceed.

hmdne commented 1 month ago

So, to thoroughly implement this task, so that we would be able to

  1. Solve the issue at hand
  2. Ensure it's ready for use in Metanorma to solve other issues
  3. Prepare for multiprocessing as well

I would propose a following early draft of an API:

class Coradoc::Task
  include UTask::Task

  # List of dependencies in a form of programs being used
  # to produce an output. $LOADED_FEATURES is an array that
  # is being tested for that. This is so that cache is
  # invalidated whenever the programs change.
  utask_feature_dependency "lib/coradoc/"
  utask_feature_dependency "lib/parslet/"

  # It may depend on some other properties.
  utask_value_dependency RUBY_VERSION
  utask_value_dependency RUBY_ENGINE

  # Value can also be deduced at runtime
  utask_value_dependency { Time.now.year }

  # Takes a number of arguments. Each argument must be
  # serializable and process must depend only on those
  # arguments to provide an output.
  #
  # A task can't produce side effects, unless those side
  # effects are wrapped by some future UTask interface.
  # This includes warnings, since they will be squashed
  # by cache.
  #
  # A Task can call other UTasks, but only using the API
  # described below. (This needs further design).
  #
  # Returns a `Result` - a class that must conform to an
  # interface described below.
  def process(*args, **kwargs)
    Result.new(Coradoc.do_something(*args, *kwargs))
  end

  class Result
    # Result must be serializable. A serialization format
    # of choice is Marshal, since it's native to Ruby, it's
    # fast. It's also insecure, but we assume that its
    # only insecure if foreign Marshaled objects are provided.
    # This is not the case here. This mostly means, that
    # the Result instance can't contain instances of following
    # classes:
    # - Proc
    # - IO

    # Can define file dependencies. This must be an array
    # of filenames. This is defined for a single process.
    #
    # The dependency is defined as data on which
    # contents depends a return value of a process.
    #
    # A file dependency is a filename that contains this
    # data.
    #
    # Ie. if a file mentions another file, but its
    # removal, modification, etc. doesn't change its
    # contents, it's not a dependency.
    #
    # If input is provided as an input variable, its
    # source shouldn't be specified here as well. This
    # is just for inferred dependencies.
    #
    # A file dependency can also be a directory. In
    # which case, its entire contents are hashed.
    attr_reader :file_dependencies
  end
end

# The task must be called using the following:
Coradoc::Task.($stdin.read, {option1: true}).await

# Or with a helper function:
def coradoc_process(file)
  Coradoc::Task.(File.read(file)).await
end

# The Coradoc::Task.call function is doing all those things
# in the background (optionally) and they are guaranteed to
# work if the assumptions are valid:
# - caching
# - multiprocessing
# - time tracking
# - progress displaying
#
# For caching, we create a hash of:
# - feature dependencies
# - value dependencies
# - file dependencies
# - process arguments
#
# For multiprocessing to work (perhaps in the future) we must
# assume that Coradoc::Task.call returns some kind of a Promise
# that first needs to be awaited to extract the return value.

# Creation of a pipeline is outside of the scope of UTask. It
# may be done this way, for instance:

files = %w[a b c d e]
data = files.map { |i| File.read(i) }

coradoc_data_promises = files.map { |i| Coradoc::Input::Task.(i) }
coradoc_data = coradoc_data_promises.map(&:await).map(&:data)

metanorma_preprocessed_data_promises = coradoc_data_promises.map { |i| Metanorma::Preprocessor::Task.(i) }
metanorma_preprocessed_data = coradoc_data_promises.map(&:await).map(&:data)

metanorma_concatenated_data = Metanorma::Concatenator.(metanorma_preprocessed_data).await.data

# Likely though, we would be working on some nicer APIs that would wrap the task calling.

# This is designed to be possible to wrap existing code as UTasks
# and also for partial replacement.