Closed vsch closed 4 years ago
Major improvements in performance and memory requirements for SegmentedSequence
the work horse of the library.
Here is a brief summary of progress:
Major reorganization and code cleanup of implementation for next version 0.60.0
Formatter implementation is now part of core implementation in flexmark
module
Formatter
improved with more options including wrapping text to margins.
added ability to track and map source offset(s) to their index in formatted sequence. This feature allows editor caret position preservation across formatting operation.
Offset tracking unified using TrackedOffset
. Used by MarkdownParagraph
for text wrapping and MarkdownTable
for table formatting and able to handle caret position during typing and backspace editing operations which are immediately followed by formatting or the edited source.
Tests cleaned up to eliminate duplication and hacks
flexmark-test-util
made reusable for other projects. Having markdown as the source code for tests is too convenient to have it only used for flexmark-java
tests.
Optimized SegmentedSequence
implementation using binary trees for searching segments and byte efficient segment packing. Parser performance is either slightly improved or not affected but allows using SegmentedSequences
for collecting Formatter
and HtmlRenderer
output to track source location of all text with minimal overhead and double the performance of old implementation.
new implementation of LineAppendable
used for text generation in rendering:
can use SequenceBuilder
to generate BasedSequence
result with original source offsets for those character segments which come from the source. This allows round trip source tracking from Source -> AST -> Formatted Source -> Source throughout the library.
As an added bonus using the appendable makes formatting to it 40% faster than previous implementation and 160 times (yes times) more efficient in memory use. For the test below, old implementation allocated 6GB worth of segmented sequences, new implementation 37MB. The % overhead is four times greater but that is after a 43 fold reduction in total overhead bytes. Old implementation allocated 342MB of overhead, new implementation only 8MB.
As a result of increased efficiency, two additional files of about 600kB each can be included in the test run and only add 0.6 sec to the total formatter execution time and only 7.5MB of additional memory.
Tests run on 1141 markdown files from GitHub projects and some other user samples. Largest was 256k bytes. The two new files of 600KB were not included in the results to allow comparing them to previous implementation.
Description | Old SegmentedSequence | New Segmented Sequence | New LineAppendable |
---|---|---|---|
Total wall clock time | 13.896 sec | 9.672 sec | 8.805 sec |
Parse time | 2.402 sec | 2.335 sec | 2.352 sec |
Formatter appendable | 0.603 sec | 0.602 sec | 0.798 sec |
Formatter sequence builder | 7.264 sec | 3.109 sec | 1.948 sec |
The overhead difference is significant. The totals are for all segmented sequences created during the test run of 1141 files. Parser statistics show requirements during parsing and formatter ones are only for formatting of them while accumulating the text as a segmented sequence.
Description | Old Formatter | New Formatter | New LineAppendable | Old Parser | New Parser |
---|---|---|---|---|---|
Bytes for characters of all segmented sequences | 6,029,774,526 | 6,029,774,526 | 37,253,492 | 917,016 | 917,016 |
Bytes for overhead of all segmented sequences | 12,060,276,408 | 342,351,155 | 8,021,677 | 1,845,048 | 93,628 |
Overhead % | 200.0% | 5.7% | 21.5% | 201.2% | 10.2% |
Version 0.60 released.
When updating to 0.60, it looks like the flexmark-ext-gfm-tables
artifact has not been released for this version. It seems like the table functionality still works, however; was that feature rolled into the core library? I didn't see anything about that in the 0.60 release notes or migration guide.
@cjbrooks12, gfm-tables extension has been deprecated for a long time and was not being updated. The flexmark-ext-tables
module is a superset of the gfm module and will perform table parsing compatible with GFM by setting the module options:
.set(TablesExtension.COLUMN_SPANS, false)
.set(TablesExtension.APPEND_MISSING_COLUMNS, true)
.set(TablesExtension.DISCARD_EXTRA_COLUMNS, true)
.set(TablesExtension.HEADER_SEPARATOR_COLUMN_MATCH, true)
@vsch I found your note above on gfm-tables very helpful, thanks!
To match GitHub, I am also including the settings I found here, is that appropriate? This adds:
.set(TablesExtension.WITH_CAPTION, false)
.set(TablesExtension.MIN_HEADER_ROWS, 1)
.set(TablesExtension.MAX_HEADER_ROWS, 1)
It appears that SuperscriptExtension
has been moved from com.vladsch.flexmark.superscript.SuperscriptExtension
to com.vladsch.flexmark.ext.superscript.SuperscriptExtension
. Should this be mentioned in the list of breaking changes? (It broke my build anyway.) And is there an explanation as to why it was moved? Does .ext.
mean this is something outside of the CommonMark specification? (I'm only guessing. An official explanation would be helpful.)
:warning: Release of 0.60.0 has breaking changes due to re-organization, renaming and clean up of some implementation classes.
Please give feedback on changes if are not able to resolve your code to the changes.
Break: split out generic AST utilities from
flexmark-util
module into separate smaller modules. IntelliJ IDEA migration to help with migration from 0.50.40 will be provided where needed if the package or class is changed.com.vladsch.flexmark.util
will no longer contain any files but will contain the separate utilities modules withflexmark-utils
module being an aggregate of all utilities modules, similar toflexmark-all
ast/
classes toflexmark-util-ast
builder/
classes toflexmark-util-builder
collection/
classes toflexmark-util-collection
data/
classes toflexmark-util-data
dependency/
classes toflexmark-util-dependency
format/
classes toflexmark-util-format
html/
classes toflexmark-util-html
mappers/
classes toflexmark-util-sequence
options/
classes toflexmark-util-options
sequence/
classes toflexmark-util-sequence
visitor/
classes toflexmark-util-visitor
Convert anonymous classes to lambda where possible.
refactor
flexmark-util
to eliminate dependency cycles between classes in different subdirectories.Break: delete deprecated properties, methods and classes
Add:
org.jetbrains:annotations:15.0
dependency to have@Nullable
/@NotNull
annotations added for all parameters. I use IntelliJ IDEA for development and it helps to have these annotations for analysis of potential problems and use with Kotlin.Break: refactor and cleanup tests to eliminate duplicated code and allow easier reuse of test cases with spec example data.
Break: move formatter tests to
flexmark-core-test
module to allow sharing of formatter base classes in extensions without causing dependency cycles in formatter module.Break: move formatter module into
flexmark
core. this module is almost always included anyway because most extension have a dependency on formatter for their custom formatting implementations. Having it as part of the core allows relying on its functionality in all modules.Break: move
com.vladsch.flexmark.spec
andcom.vladsch.flexmark.util
inflexmark-test-util
tocom.vladsch.flexmark.test.spec
andcom.vladsch.flexmark.test.util
respectively to respect the naming convention between modules and their packages.Break:
NodeVisitor
implementation details have changed. If you were overridingNodeVisitor.visit(Node)
in the previous version it is nowfinal
to ensure compile time error is generated. You will need to change your implementation. See comment in the class for instructions.:information_source:
com.vladsch.flexmark.util.ast.Visitor
is only needed for implementation ofNodeVisitor
andVisitHandler
. If you convert all anonymous implementations ofVisitHandler
to lambdas you can remove all imports forVisitor
.com.vladsch.flexmark.util.ast.NodeAdaptedVisitor
see javadoc for classcom.vladsch.flexmark.util.ast.NodeAdaptingVisitHandler
com.vladsch.flexmark.util.ast.NodeAdaptingVisitor