Open sogaiu opened 3 years ago
Main problem atm is that only solution I can come up with involves keeping the whole file in memory, or dumping it to disk. Are there other ways to do this without potentially eating a lot of ram? I'm guessing it might be hard not to keep track of the whole file as long as one would use external tools like jfmt.
Would be neat to implement a diffing algorithm to only keep the changes, rather than the full file. One way that seems pretty doable that I read about (https://wiki.c2.com/?DiffAlgorithm):
The core of diff algorithms seeks to compare two sequences and to discover how the first can be transformed into the second by a sequence of operations using the primitives delete-subsequence, and insert-subseqence. If a delete and an insert coincide on the same range then it can be labeled as a change-subsequence. The user doesn't always want to be informed as to which subsequences remain identical, but this information is nonetheless typically computed as a by-product of the basic algorithm. The sophisticated way to find those editing operations is to sort the two sequences in order of the values of the elements of the sequences; for when the basic unit is a text line, usually a hashcode of each line is used to represent the line, and the sequences are sorted by hashcode in order to discover groups of identical elements/lines. Once identical elements are found, they are clumped together into runs which were adjacent in the pre-sorted sequences. One common variant looks for the longest possible such runs, which is an O(N^2) approach, but for N equal to the number of possibilities in each group, not the number of total elements/lines, so this is usually acceptably fast. At this stage all matching regions between the two sequences/files have been found. Running sequentially through the elements/lines that have not been matched now trivially indicates which regions have been inserted, deleted, or changed. Usually moved regions are indicated as an insert and a delete, but with some extra effort (not typically done by most diff programs) they can be reported as moved. Actually source code in C can of course be found in e.g. the GNU version of diff or the one from Version 7 Unix, also available online. The smallest versions of this algorithm can be implemented in less space than the above text of this article; it's less complicated than it may sound.
I didn't digest the above deeply, but a cursory look reminded me of "the string-to-string correction problem". I came across mention of that in section 3.3.1 of a chapter in "The Architecture of Open Source Applications" about Bash.
It appears that after one formats code using
Ctrl-Shift-F
(at least on Linux) that undo has no effect.