Add cache support - Githubissues

zakjan commented 3 years ago

Fixes #49

ronaldtse commented 3 years ago

@zakjan are we ready to merge this? Thanks.

zakjan commented 3 years ago

May I ask how is the cache going to be used? in which environments, how long can it live, how do you expect to refresh it etc. I'd prefer to remove Expressir version from it, keep the functionality that exports/imports gzipped yaml only. is it ok?

ronaldtse commented 3 years ago

Right now the cache is only used in compiling metanorma documents.

It can live for a very long time as long as the referenced schemas are not changed.

It would probably be useful to keep a "cache schema version" in it to ensure that the writer and reader of the cache is compatible.

zakjan commented 3 years ago

for a very long time as long as the referenced schemas are not changed

Does it mean it would be stored together in the repo with source files? I expected it to be used locally only.

to ensure that the writer and reader of the cache is compatible

Yes, but it's going to fail only in the client. Or should it fallback to the parser? Every Expressir release would need to trigger regenerating the cache, so that it's always valid with the most recent version.

zakjan commented 3 years ago

In my opinion, such cache persisted to the repo brings unnecessary complication to the architecture. Can I spend some time investigating parser performance instead? It seems that recently introduced post-parsing steps (attaching remarks, hyperlink formatting) have negative impact on the final performance, because they search for objects in the parse tree, and can be improved.

ronaldtse commented 3 years ago

In my opinion, such cache persisted to the repo brings unnecessary complication to the architecture.

Performance improvements can never replace the function of the cache -- we're spending at least 10 minutes in the parser (on my computer) for 24 schemas. There are over 1,000 schemas. We must support cache saving/loading.

Therefore work on the cache must come before the parsing performance.

Can I spend some time investigating parser performance instead?

Sure, let's do that after the cache works.

ronaldtse commented 3 years ago

See this automated build. On GHA, the document finishes compilation after 3 hours. On my computer, 20 minutes. From the percentage of use, I surmise that GHA spends at least 1 hour in the parser for 24 schemas.

https://github.com/metanorma/annotated-express/runs/2055518291?check_suite_focus=true

zakjan commented 3 years ago

There is a slight improvement in parse & format performance thanks to small changes implemented in #70.

Total processing time for 24 documents in annotated-express is ~10.7s in sc4-formatting branch, opposed to ~13.6s in master on my Mac. That's 21% improvement.

What duration would you consider as an acceptable baseline in order to not need the persisted cache?

Benchmark code:

exp_files = [
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/action_schema/action_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/application_context_schema/application_context_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/approval_schema/approval_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/basic_attribute_schema/basic_attribute_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/certification_schema/certification_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/contract_schema/contract_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/date_time_schema/date_time_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/document_schema/document_schema.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/effectivity_schema/effectivity_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/experience_schema/experience_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/external_reference_schema/external_reference_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/group_schema/group_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/language_schema/language_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/location_schema/location_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/management_resources_schema/management_resources_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/measure_schema/measure_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/person_organization_schema/person_organization_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/process_property_schema/process_property_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/product_definition_schema/product_definition_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/product_property_definition_schema/product_property_definition_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/product_property_representation_schema/product_property_representation_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/qualifications_schema/qualifications_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/security_classification_schema/security_classification_schema_annotated.exp'),
  Expressir.root_path.join('../iso-10303-stepmod/data/resources/support_resource_schema/support_resource_schema_annotated.exp'),
]

start = Time.now
repo = Expressir::ExpressExp::Parser.from_files(exp_files)
repo.to_hash(formatter: Class.new(Expressir::ExpressExp::Formatter) do
  include Expressir::ExpressExp::SchemaHeadFormatter
  include Expressir::ExpressExp::HyperlinkFormatter
end)
puts Time.now - start

ronaldtse commented 3 years ago

@zakjan we’re traversing the tree — can you help profile the time LutaML is spending in Expressir?

zakjan commented 3 years ago

Where does Lutaml traverse the tree? I think it only calls repository.to_hash with the custom formatter as in the benchmark code above, returning a hash (example), which is passed to the template.

zakjan commented 3 years ago

I was able to track down only these three relevant usages of Expressir in Lutaml.

Parser.from_files - https://github.com/lutaml/lutaml/blob/master/lib/lutaml/parser.rb#L37
repository.to_hash - https://github.com/lutaml/lutaml/blob/master/lib/lutaml/express/lutaml_path/document_wrapper.rb#L11-L12
Formatter - https://github.com/lutaml/lutaml/blob/master/lib/lutaml/express/lutaml_path/formatter.rb

This is equivalent with the benchmark code.

zakjan commented 3 years ago

Total processing time in Expressir of all 1246 annotated files:

master:

Parsing time: 7m53s Formatting & serializing time: unknown, didn't finished in reasonable time Total time: unknown

sc4-formatting branch:

Parsing time: 6m18s (20% improvement) Formatting & serializing time: 8s Total time: 6m26s

ronaldtse commented 3 years ago

Thanks @zakjan for the detailed results!

Since we are using this in the CI to build the whole collection (which includes all schemas), a cache will allow us to save most of the 6 minutes when the schemas are unchanged (hence cached on GHA).

But in most cases, eg. when building a single part like part 41, 10s is definitely good enough to work without a cache.

zakjan commented 3 years ago

Hmm, caching in GHA cache probably needs a different strategy than caching in the repo. The gem / script running in GHA would need to take care about refreshing the cache when it is invalid.

zakjan commented 3 years ago

24 files:

Parser.from_files time:  10.9s
Cache.to_file time:      0.5s
Cache.from_file time:    0.3s
Repository.to_hash time: 0.2s

1246 files:

Parser.from_files time:  328.9s
Cache.to_file time:      20.4s
Cache.from_file time:    14s
Repository.to_hash time: 7.4s

zakjan commented 3 years ago

I have finished a simple cache implementation in this PR. It throws if Expressir version doesn't match. Current usage is:

begin
  repository = Expressir::ExpressExp::Cache.from_file(cache_file, root_path: root_path)
rescue Expressir::ExpressExp::CacheLoadError
  repository = Expressir::ExpressExp::Parser.from_files(files)
  Expressir::ExpressExp::Cache.to_file(cache_file, repository, root_path: root_path)
end

The usage can be improved further if needed.

ronaldtse commented 3 years ago

@zakjan looks good! @w00lf can you help incorporate this in LutaML? Thanks.

w00lf commented 3 years ago

@zakjan looks good! @w00lf can you help incorporate this in LutaML? Thanks.

Ok, will do

zakjan commented 3 years ago

Is it good to merge and release, or do we need further improvements?

ronaldtse commented 3 years ago

@zakjan let's release since you've confirmed it works. @w00lf can have a separate ticket to use it in lutaml.

zakjan commented 3 years ago

Release 0.2.20 running

w00lf commented 3 years ago

@zakjan let's release since you've confirmed it works. @w00lf can have a separate ticket to use it in lutaml.

Will be implemented in https://github.com/lutaml/lutaml/issues/13

lutaml / expressir

Add cache support #60