tweag / topiary

https://topiary.tweag.io/
MIT License
547 stars 23 forks source link

Topiary is pretty slow on formatting OCaml codebases #525

Open Niols opened 1 year ago

Niols commented 1 year ago

Describe the bug

This issue is unrelated to #519. Topiary takes about half a second to format OCaml files, even trivial ones. This is reasonable when formatting one file but it becomes very frustrating when formatting a whole code base.

To Reproduce

For instance, on a relatively small codebase, with the latest version compiled in release mode:

$ git clone https://github.com/colis-anr/morsmall
Cloning into 'morsmall'...
remote: Enumerating objects: 1335, done.
remote: Counting objects: 100% (385/385), done.
remote: Compressing objects: 100% (161/161), done.
remote: Total 1335 (delta 228), reused 340 (delta 199), pack-reused 950
Receiving objects: 100% (1335/1335), 393.89 KiB | 1.21 MiB/s, done.
Resolving deltas: 100% (812/812), done.

$ cd morsmall

$ find -name '*.ml' | wc -l
16

$ find -name '*.ml' -exec wc -l {} +
   26 ./src/ASTUtils.ml
   58 ./src/morsmall.ml
  214 ./src/AST.ml
  244 ./src/utilities/testParser.ml
   46 ./src/equality/located.ml
   46 ./src/equality/nonLocated.ml
   46 ./src/printer/debugNonLocated.ml
   49 ./src/printer/json.ml
  346 ./src/printer/safe.ml
   46 ./src/printer/debug.ml
   49 ./src/printer/jsonNonLocated.ml
   37 ./src/location.ml
  875 ./src/CST_to_AST.ml
  134 ./tests/golden/golden.ml
  215 ./tests/qcheck/generator.ml
  138 ./tests/qcheck/qcheck.ml
 2569 total

$ time find -name '*.ml' -print -exec topiary -if {} \;
./src/ASTUtils.ml
./src/morsmall.ml
./src/AST.ml
./src/utilities/testParser.ml
./src/equality/located.ml
./src/equality/nonLocated.ml
./src/printer/debugNonLocated.ml
./src/printer/json.ml
./src/printer/safe.ml
./src/printer/debug.ml
./src/printer/jsonNonLocated.ml
./src/location.ml
./src/CST_to_AST.ml
./tests/golden/golden.ml
./tests/qcheck/generator.ml
./tests/qcheck/qcheck.ml

real    0m6.976s
user    0m6.880s
sys 0m0.090s

Expected behavior

I would expect Topiary to be reasonably fast to format a whole code base (maybe a few seconds). This could be achieved either by having Topiary be very fast (probably under 100 ms) on separate files or by having Topiary able to process several files at once.

Environment

Additional context

The situation has already been improved a bit by https://github.com/tweag/topiary/pull/523 and its 40% speedup. However, one would need a much more drastic speedup per run to achieve something nicer.

If this helps, I'd imagine a command line interface of:

topiary --in-place FILE [FILE ...]

(or -i for short) handling at least one file (and failing if none were passed), or:

topiary --input-file FILE --output-file FILE

(or -f and -o for short) handling exactly one file and defaulting to stdin and stdout without any flags.

Such an interface would also make for a much easier use in find/xargs and in pre-commit or similar utilities that expect formatters to handle several files at once.

Niols commented 1 year ago

In the stand-up meeting of today we mentioned the idea to have topiary be able to take several --input-file and that would then write everything in the one output file (potentially stdout), concatenating things. However, this would be maybe a bit counter intuitive but it would also behave very badly with a potential future multi-threaded version of Topiary.

Niols commented 1 year ago

Somehow, this reminds me of sed -i that accepts an extension for backing-up files. Typically, one could write sed -i.bak ... to get files to be copied into *.bak files before applying formatting. I am not sure it would make sense for Topiary, though, but I thought I'd throw it in the mix.

As another random idea, if we really wanted to be able to have --input-file and --output-file work with several files at once, we could allow alternating them:

topiary -f file1.ml -o file1-fmt.ml -f file2.ml -o file2-fmt.ml ...

or maybe have --output-file able to take a pattern or something, like:

topiary -f file1.ml -f file2.ml -o '{}-fmt.ml'

but I am really not convinced by this. Again, just throwing them in here.

aspiwack commented 1 year ago

I think it's time to think about subcommands.

There could be a subcommand topiary format which takes a list of files and format them in place. This is the “standard” behaviour of a formatter. This can be multithread, or single thread. But could share compiled queries.

Then we could have topiary dev (not sure it's the best name), with the current interface. Because things like --visualise or the playground.sh script don't necessarily make sense in the multi-file setting.

Would this avoid the complications?