Closed zakjan closed 3 years ago
@zakjan are we ready to merge this? Thanks.
May I ask how is the cache going to be used? in which environments, how long can it live, how do you expect to refresh it etc. I'd prefer to remove Expressir version from it, keep the functionality that exports/imports gzipped yaml only. is it ok?
Right now the cache is only used in compiling metanorma documents.
It can live for a very long time as long as the referenced schemas are not changed.
It would probably be useful to keep a "cache schema version" in it to ensure that the writer and reader of the cache is compatible.
for a very long time as long as the referenced schemas are not changed
Does it mean it would be stored together in the repo with source files? I expected it to be used locally only.
to ensure that the writer and reader of the cache is compatible
Yes, but it's going to fail only in the client. Or should it fallback to the parser? Every Expressir release would need to trigger regenerating the cache, so that it's always valid with the most recent version.
In my opinion, such cache persisted to the repo brings unnecessary complication to the architecture. Can I spend some time investigating parser performance instead? It seems that recently introduced post-parsing steps (attaching remarks, hyperlink formatting) have negative impact on the final performance, because they search for objects in the parse tree, and can be improved.
In my opinion, such cache persisted to the repo brings unnecessary complication to the architecture.
Performance improvements can never replace the function of the cache -- we're spending at least 10 minutes in the parser (on my computer) for 24 schemas. There are over 1,000 schemas. We must support cache saving/loading.
Therefore work on the cache must come before the parsing performance.
Can I spend some time investigating parser performance instead?
Sure, let's do that after the cache works.
See this automated build. On GHA, the document finishes compilation after 3 hours. On my computer, 20 minutes. From the percentage of use, I surmise that GHA spends at least 1 hour in the parser for 24 schemas.
https://github.com/metanorma/annotated-express/runs/2055518291?check_suite_focus=true
There is a slight improvement in parse & format performance thanks to small changes implemented in #70.
Total processing time for 24 documents in annotated-express is ~10.7s in sc4-formatting branch, opposed to ~13.6s in master on my Mac. That's 21% improvement.
What duration would you consider as an acceptable baseline in order to not need the persisted cache?
Benchmark code:
exp_files = [
Expressir.root_path.join('../iso-10303-stepmod/data/resources/action_schema/action_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/application_context_schema/application_context_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/approval_schema/approval_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/basic_attribute_schema/basic_attribute_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/certification_schema/certification_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/contract_schema/contract_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/date_time_schema/date_time_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/document_schema/document_schema.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/effectivity_schema/effectivity_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/experience_schema/experience_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/external_reference_schema/external_reference_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/group_schema/group_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/language_schema/language_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/location_schema/location_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/management_resources_schema/management_resources_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/measure_schema/measure_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/person_organization_schema/person_organization_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/process_property_schema/process_property_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/product_definition_schema/product_definition_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/product_property_definition_schema/product_property_definition_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/product_property_representation_schema/product_property_representation_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/qualifications_schema/qualifications_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/security_classification_schema/security_classification_schema_annotated.exp'),
Expressir.root_path.join('../iso-10303-stepmod/data/resources/support_resource_schema/support_resource_schema_annotated.exp'),
]
start = Time.now
repo = Expressir::ExpressExp::Parser.from_files(exp_files)
repo.to_hash(formatter: Class.new(Expressir::ExpressExp::Formatter) do
include Expressir::ExpressExp::SchemaHeadFormatter
include Expressir::ExpressExp::HyperlinkFormatter
end)
puts Time.now - start
@zakjan we’re traversing the tree — can you help profile the time LutaML is spending in Expressir?
Where does Lutaml traverse the tree? I think it only calls repository.to_hash
with the custom formatter as in the benchmark code above, returning a hash (example), which is passed to the template.
I was able to track down only these three relevant usages of Expressir in Lutaml.
This is equivalent with the benchmark code.
Total processing time in Expressir of all 1246 annotated files:
master:
Parsing time: 7m53s Formatting & serializing time: unknown, didn't finished in reasonable time Total time: unknown
sc4-formatting branch:
Parsing time: 6m18s (20% improvement) Formatting & serializing time: 8s Total time: 6m26s
Thanks @zakjan for the detailed results!
Since we are using this in the CI to build the whole collection (which includes all schemas), a cache will allow us to save most of the 6 minutes when the schemas are unchanged (hence cached on GHA).
But in most cases, eg. when building a single part like part 41, 10s is definitely good enough to work without a cache.
Hmm, caching in GHA cache probably needs a different strategy than caching in the repo. The gem / script running in GHA would need to take care about refreshing the cache when it is invalid.
24 files:
Parser.from_files time: 10.9s
Cache.to_file time: 0.5s
Cache.from_file time: 0.3s
Repository.to_hash time: 0.2s
1246 files:
Parser.from_files time: 328.9s
Cache.to_file time: 20.4s
Cache.from_file time: 14s
Repository.to_hash time: 7.4s
I have finished a simple cache implementation in this PR. It throws if Expressir version doesn't match. Current usage is:
begin
repository = Expressir::ExpressExp::Cache.from_file(cache_file, root_path: root_path)
rescue Expressir::ExpressExp::CacheLoadError
repository = Expressir::ExpressExp::Parser.from_files(files)
Expressir::ExpressExp::Cache.to_file(cache_file, repository, root_path: root_path)
end
The usage can be improved further if needed.
@zakjan looks good! @w00lf can you help incorporate this in LutaML? Thanks.
@zakjan looks good! @w00lf can you help incorporate this in LutaML? Thanks.
Ok, will do
Is it good to merge and release, or do we need further improvements?
@zakjan let's release since you've confirmed it works. @w00lf can have a separate ticket to use it in lutaml.
Release 0.2.20 running
@zakjan let's release since you've confirmed it works. @w00lf can have a separate ticket to use it in lutaml.
Will be implemented in https://github.com/lutaml/lutaml/issues/13
Fixes #49