jfcherng / php-diff

A comprehensive library for generating differences between two strings in multiple formats (unified, side by side HTML etc).
https://packagist.org/packages/jfcherng/php-diff
BSD 3-Clause "New" or "Revised" License
412 stars 50 forks source link

New Renderer Suggestion #20

Closed the-turk closed 4 years ago

the-turk commented 4 years ago

What does it take to write a renderer to show diffs as in GitHub (see my edit history)? Currently it duplicates the text and doubles the storage volume when storing a JSON output into the database. We might need two renderers for this kind of functionality (one is JSON and the other one is HTML).

jfcherng commented 4 years ago

If you want to treat old and new text as HTML, I think you will need to implement a new HtmlDiffer to diff DOM nodes. And then implement a new renderer to properly display the diff.

Related: https://github.com/jfcherng/php-diff/issues/9#issuecomment-526808774


If you just want to treat old and new text as plain text which Differ is currently doing, then you just need a new renderer I guess. Or maybe the output of the Json renderer can be used to render in the website frontend via Javascript.

the-turk commented 4 years ago

We might summarize the JSON output.

Check the following example out:

Original:

Donec rutrum, odio id tempus consequat, nunc nisi pulvinar dolor, ac faucibus massa erat non est. Integer blandit, eros quis sodales vulputate, erat ex euismod nisl, gravida vestibulum nibh elit pretium quam. Mauris turpis tortor, lacinia vitae purus eget, vulputate mollis felis.

New:

Donec rutrum, odio id tempus consequat, nunc nisi pulvinar dolor, ac faucibus massa erat non est. Integer blandit, eros quis sodales vulputate, erat ex euismod nisl, gravida vestibulum nibh elit pretium quam. Mauris turpis tortor, lacinia vitae purus eget, vulputate testing mollis felis.

JSON output from our repository (duplicates the text to show both versions):

[
  [
    {
      "tag": 8,
      "old": {
        "offset": 0,
        "lines": [
          "Donec rutrum, odio id tempus consequat, nunc nisi pulvinar dolor, ac faucibus massa erat non est. Integer blandit, eros quis sodales vulputate, erat ex euismod nisl, gravida vestibulum nibh elit pretium quam. Mauris turpis tortor, lacinia vitae purus eget, vulputate mollis felis."
        ]
      },
      "new": {
        "offset": 0,
        "lines": [
          "Donec rutrum, odio id tempus consequat, nunc nisi pulvinar dolor, ac faucibus massa erat non est. Integer blandit, eros quis sodales vulputate, erat ex euismod nisl, gravida vestibulum nibh elit pretium quam. Mauris turpis tortor, lacinia vitae purus eget, vulputate <ins>testing </ins>mollis felis."
        ]
      }
    }
  ]
]

HTML output from our repository: Click to view the image.

Output on GitHub: Click to view the image.

Thoughts:

jfcherng commented 4 years ago

Note that the information from AbstractHtml::getChanges() (from upstream SequenceMatcher) does not directly tell things like line X is modified from line Y. It tells that "block (lines)" X is modified from "block" Y.

Old:

Donec rutrum, test.

New:

Donec rutrum.
There is a new inserted line.

JSON Result:

[
    [
        {
            "tag": "rep",
            "old": {
                "offset": 0,
                "lines": [
                    "Donec rutrum, test."
                ]
            },
            "new": {
                "offset": 0,
                "lines": [
                    "Donec rutrum.",
                    "There is a new inserted line."
                ]
            }
        }
    ]
]

Somehow I feel those codes in Jfcherng\Diff\Renderer\Html\LineRenderer\Word may be inspiring. Treating lines as "a string with some \ns in it".

the-turk commented 4 years ago

@jfcherng i wrote a simple (and little tricky) renderer without playing with JSON format for GitHub similarization, check that out.

\Renderer\Html\Combined.php:

<?php

declare(strict_types=1);

namespace Jfcherng\Diff\Renderer\Html;

use Jfcherng\Diff\SequenceMatcher;
use Jfcherng\Diff\Renderer\RendererConstant;

/**
 * Combined HTML diff generator
 */
final class Combined extends AbstractHtml
{
    /**
     * {@inheritdoc}
     */
    const INFO = [
        'desc' => 'Combined',
        'type' => 'Html',
    ];

    /**
     * {@inheritdoc}
     */
    protected function redererChanges(array $changes): string
    {
        if (empty($changes)) {
            return $this->getResultForIdenticals();
        }

        $wrapperClasses = \array_merge(
            $this->options['wrapperClasses'],
            ['diff', 'diff-html', 'diff-combined']
        );

        $html = '<table class="' . \implode(' ', $wrapperClasses) . '">';

        $html .= $this->renderTableHeader();

        foreach ($changes as $i => $blocks) {
            if ($i > 0 && $this->options['separateBlock']) {
                $html .= $this->renderTableSeparateBlock();
            }

            foreach ($blocks as $change) {
                $html .= $this->renderTableBlock($change);
            }
        }

        return $html . '</table>';
    }

    /**
     * Renderer the table header.
     */
    protected function renderTableHeader(): string
    {
        return
            '<thead>' .
                '<tr>' .
                    (
                        $this->options['lineNumbers']
                        ?
                            '<th>' . $this->_('old_version') . '</th>' .
                            '<th>' . $this->_('new_version') . '</th>'
                        :
                            ''
                    ) .
                    '<th>' . $this->_('differences') . '</th>' .
                '</tr>' .
            '</thead>';
    }

    /**
     * Renderer the table separate block.
     */
    protected function renderTableSeparateBlock(): string
    {
        $colspan = $this->options['lineNumbers'] ? '3' : '';

        return
            '<tbody class="skipped">' .
                '<tr>' .
                    '<td' . $colspan . '></td>' .
                '</tr>' .
            '</tbody>';
    }

    /**
     * Renderer the table block.
     *
     * @param array $change the change
     */
    protected function renderTableBlock(array $change): string
    {
        static $callbacks = [
            SequenceMatcher::OP_EQ => 'renderTableEqual',
            SequenceMatcher::OP_INS => 'renderTableInsert',
            SequenceMatcher::OP_DEL => 'renderTableDelete',
            SequenceMatcher::OP_REP => 'renderTableReplace',
        ];

        return
            '<tbody class="change change-' . self::TAG_CLASS_MAP[$change['tag']] . '">' .
                $this->{$callbacks[$change['tag']]}($change) .
            '</tbody>';
    }

    /**
     * Renderer the table block: equal.
     *
     * @param array $change the change
     */
    protected function renderTableEqual(array $change): string
    {
        $html = '';

        // note that although we are in a OP_EQ situation,
        // the old and the new may not be exactly the same
        // because of ignoreCase, ignoreWhitespace, etc
        foreach ($change['old']['lines'] as $no => $oldLine) {
            // hmm... but this is a inline renderer
            // we could only pick a line from the old or the new to show
            $oldLineNum = $change['old']['offset'] + $no + 1;
            $newLineNum = $change['new']['offset'] + $no + 1;

            $html .=
                '<tr data-type="=">' .
                    $this->renderLineNumberColumns($oldLineNum, $newLineNum) .
                    '<td class="old">' . $oldLine . '</td>' .
                '</tr>';
        }

        return $html;
    }

    /**
     * Renderer the table block: insert.
     *
     * @param array $change the change
     */
    protected function renderTableInsert(array $change): string
    {
        $html = '';

        foreach ($change['new']['lines'] as $no => $newLine) {
            $newLineNum = $change['new']['offset'] + $no + 1;

            $html .=
                '<tr data-type="+">' .
                    $this->renderLineNumberColumns(null, $newLineNum) .
                    '<td class="new">' . $newLine . '</td>' .
                '</tr>';
        }

        return $html;
    }

    /**
     * Renderer the table block: delete.
     *
     * @param array $change the change
     */
    protected function renderTableDelete(array $change): string
    {
        $html = '';

        foreach ($change['old']['lines'] as $no => $oldLine) {
            $oldLineNum = $change['old']['offset'] + $no + 1;

            $html .=
                '<tr data-type="-">' .
                    $this->renderLineNumberColumns($oldLineNum, null) .
                    '<td class="old">' . $oldLine . '</td>' .
                '</tr>';
        }

        return $html;
    }

    /**
     * Renderer the table block: replace.
     *
     * @param array $change the change
     */
    protected function renderTableReplace(array $change): string
    {
        $html = '';

        if (\count($change['old']['lines']) >= \count($change['new']['lines'])) {
            foreach ($change['old']['lines'] as $no => $oldLine) {
                $oldLineNum = $change['old']['offset'] + $no + 1;

                if (isset($change['new']['lines'][$no])) {
                    $newLineNum = $change['old']['offset'] + $no + 1;
                    $newLine = '<span>' . $change['new']['lines'][$no] . '</span>';
                } else {
                    $newLineNum = null;
                    $newLine = '';
                }

                $html .=
                    '<tr>' .
                        $this->renderLineNumberColumns($oldLineNum, $newLineNum) .
                        '<td class="rep"><span>' .
                            $this->mergeDiffs($newLine, $oldLine) .
                        '</span></td>' .
                    '</tr>';
            }
        } else {
            foreach ($change['new']['lines'] as $no => $newLine) {
                $newLineNum = $change['new']['offset'] + $no + 1;

                if (isset($change['old']['lines'][$no])) {
                    $oldLineNum = $change['old']['offset'] + $no + 1;
                    $oldLine = '<span>' . $change['old']['lines'][$no] . '</span>';
                } else {
                    $oldLineNum = null;
                    $oldLine = '';
                }

                $html .=
                    '<tr>' .
                        $this->renderLineNumberColumns($oldLineNum, $newLineNum) .
                        '<td class="rep"><span>' .
                            $this->mergeDiffs($newLine, $oldLine) .
                        '</span></td>' .
                    '</tr>';
            }
        }

        return $html;
    }

    /**
     * Renderer the line number columns.
     *
     * @param null|int $oldLineNum The old line number
     * @param null|int $newLineNum The new line number
     */
    protected function renderLineNumberColumns(?int $oldLineNum, ?int $newLineNum): string
    {
        if (!$this->options['lineNumbers']) {
            return '';
        }

        return
            (
                isset($oldLineNum)
                    ? '<th class="n-old">' . $oldLineNum . '</th>'
                    : '<th></th>'
            ) .
            (
                isset($newLineNum)
                    ? '<th class="n-new">' . $newLineNum . '</th>'
                    : '<th></th>'
            );
    }

    /**
     * Merge diffs between lines.
     *
     * Gets newPart in <ins>newPart</ins>
     * Replaces <del>oldPart</del> with
     * <del>oldPart</del><ins>newPart</ins>
     *
     * @param string $newLine New line
     * @param string $oldLine Old line
     */
    protected function mergeDiffs(string $newLine, string $oldLine): string
    {
        $newParts = $this->getNewPartsWithClosures(
            RendererConstant::HTML_CLOSURES_INS[0],
            RendererConstant::HTML_CLOSURES_INS[1],
            $newLine
        );

        $offset = 0;

        return preg_replace_callback(
            '/' . preg_quote(RendererConstant::HTML_CLOSURES_DEL[1], '/') . '/',
            function($match) use($newParts, &$offset)
            {
            return $newParts[$offset++];
            }, $oldLine
        );
    }

    /**
     * Get new parts of line with </del><ins></ins> closures.
     * This one is adapted from:
     * https://stackoverflow.com/a/27078384/12866913
     *
     * @param string $line Line
     * @param string $leftDelim Left delimiter
     * @param string $rightDelim Right delimiter
     */
     protected function getNewPartsWithClosures(
         string $leftDelim,
         string $rightDelim,
         string $line
     ): array
     {
        $contents = [];
        $leftDelimLength = \strlen($leftDelim);
        $rightDelimLength = \strlen($rightDelim);
        $startFrom = $contentStart = $contentEnd = 0;

        while (($contentStart = \strpos($line, $leftDelim, $startFrom)) !== false) {
            $contentStart += $leftDelimLength;
            $contentEnd = \strpos($line, $rightDelim, $contentStart);

            if ($contentEnd === false) {
              break;
            }

            $contents[] =
                RendererConstant::HTML_CLOSURES_DEL[1] .
                RendererConstant::HTML_CLOSURES_INS[0] .
                \substr(
                    $line,
                    $contentStart,
                    $contentEnd - $contentStart
                ) .
                RendererConstant::HTML_CLOSURES_INS[1];

            $startFrom = $contentEnd + $rightDelimLength;
        }

        return $contents;
    }
}

NEEDS FURTHER TESTING FOR DIFFERENT LEVELS. IT STILL HAVE BUGS

this one is introducing new class name for td tags: rep (representing combined lines). It looks okay with these styles:

.diff-wrapper.diff .rep {
background: #fce7a0; 
}
.diff-wrapper.diff .rep del {
text-decoration: line-through !important;
background: #f7c9c9;
}
.diff-wrapper.diff .rep ins {
background: #bef6be;
}

here are the screenshots from my side:

screen1

screen2

jfcherng commented 4 years ago

That's a good start. :+1:

Here are some things that I notice currently:

Also, maybe create a PR for easier code review and test.

the-turk commented 4 years ago

Hey, it's me again. Could you try to render the diff between these two using Combined (word-level)?

Old:

https://paste.ubuntu.com/p/8yZQpH9h3W/

New:

TESTING IT AS MUCH AS I CAN

Output look like a mess.

Also, is the output okay with the other renderers? It shows old lines as new lines to me. I think i broke it đŸ˜«đŸ˜«

jfcherng commented 4 years ago

image

yeah it produces a bad-looking result. When there are too many differences, force merging them together is no longer meaningful.

jfcherng commented 4 years ago

Here's a naive patch https://github.com/jfcherng/php-diff/commit/273d1b43ca67d239f4da36af78253afcaf9f17c4. I am not sure about the proper ratio. 0.8 is just a random pick at this moment.

the-turk commented 4 years ago

you're as fast as lightning! It fixed. 0.8 looks reasonable for now.

the-turk commented 4 years ago

Damn, you're a genius.

Off topic... This page is useful to generate diffs from a random text: https://csgenerator.com/

jfcherng commented 4 years ago

Off topic... This page is useful to generate diffs from a random text: https://csgenerator.com/

Interesting! But the sentence is too short I guess :smile:

the-turk commented 4 years ago

Hey @jfcherng, could you test this out with word-level and 0.8 threshold? What's wrong with that dot?

Old Airlcets are fnuod in mnay Idno-Eorueapn lgeganaus, Semiitc lnugagaes (olny the dnitefie aclitre), and Plsaeyionn lggenuaas; hwevoer, are fmallroy anebst form many of the wolrd's mjaor laeagnugs idcnliung: Chnisee, Jaanepse, Koearn, Molniagon, mnay Tkiurc laeunaggs (icnl. Taatr, Bikhsar, Tuvan and Cavshuh), many Ualric lganueags (incl. Fininc[a] and Samai lguangeas), Inaoidesnn, Hindi-Urdu, Pnuabji, Tmail, the Baltic leaungags, the mjirotay of Salvic lagaugens, the Bnatu lenagaugs (icnl. Slhaiwi) and Yborua. In smoe lnaeaggus taht do have aiecrlts, such as some Notrh Csiaaaucn lauganegs, the use of atilecrs is oaotnpil; hewveor, in oreths like Elsngih and Garmen it is mrtoanady in all cases. Lntsgiuis bveleie the common anecstor of the Indo-Euopearn legguanas, Porto-Idno-Eorpuaen, did not have aieltrcs. Msot of the lanaguges in this flamiy do not hvae diiefnte or ieiinfntde atrciels: three is no actirle in Ltain or Ssknarit, nor in some meodrn Indo-Eoeapurn lgagaenus, scuh as the falmeiis of Salvic lnggaeuas (epexct for Bulraiagn and Mneiodaacn, whcih are reahtr dviiittcsne amnog the Saivlc lagauegns in thier gmrmaar), Btialc leaauggns and mnay Indo-Aaryn lgugneaas. Aluogthh Csiacslal Greek had a dtiniefe artlcie (wichh has srivvued itno Modren Gerek and wichh bears snotrg fcaiutonnl raeelmnscbe to the Gramen deiinfte arltcie, whcih it is rtealed to), the erailer Hmeoirc Geerk used this alctrie llrgeay as a prnuoon or dioravsttmnee, wareehs the erlaseit known form of Geerk konwn as Meayenacn Gerek did not have any acrlties. Atilcres dvlpeeoed inleenddpenty in svaeerl lagnauge fmieials. Not all luaganges hvae btoh dfeintie and iidtenfnie acertils, and some lunaeaggs have drnifefet tepys of deifnite and iinnetfide aerlitcs to diiutsigsnh fienr seadhs of menniag: for eplamxe, Frecnh and Ilitaan hvae a pavitrite atrclie uesd for iiennitfde msas nuons, wrheaes Clgniaoon has two dcinitst sets of deiitnfe arlciets ianitdncig fuocs and unenuiesqs, and Medcoanain uess defiinte acretils in a dantoistvreme sense, with a trttipraie diioicsnttn (proixaml, mdiael, dastil) bsead on dasitcne form the saepekr or iotutencorlr. The words this and taht (and tiehr parluls, tshee and those) can be ueonrdsotd in Eisnglh as, utelitmaly, froms of the ditinfee article the (woshe dcenleison in Old Eiglsnh ilcuednd thaes, an asarntecl from of tihs/taht and thsee/toshe). In mnay lagnuegas, the form of the atrlice may vary acdrocnig to the gneedr, nbmuer, or csae of its nuon. In smoe laganegus the artcile may be the only idiaitnocn of the case. Many lgangeaus do not use atcerlis at all, and may use other ways of idinntaicg old vesurs new iiotmfonran, scuh as toipc–coemnmt cnictouorntss.
New Articles are found in many Indo-European languages, Semitic languages (only the definite article), and Polynesian languages; however, are formally absent from many of the world's major languages including: Chinese, Japanese, Korean, Mongolian, many Turkic languages (incl. Tatar, Bashkir, Tuvan and Chuvash), many Uralic languages (incl. Finnic[a] and Saami languages), Indonesian, Hindi-Urdu, Punjabi, Tamil, the Baltic languages, the majority of Slavic languages, the Bantu languages (incl. Swahili) and Yoruba. In some languages that do have articles, such as some North Caucasian languages, the utilization of articles is optional; however, in others like English and German it is obligatory in all cases. Linguists believe the mundane forebear of the Indo-European languages, Proto-Indo-European, did not have articles. Most of the languages in this family do not have definite or indefinite articles: there is no article in Latin or Sanskrit, nor in some modern Indo-European languages, such as the families of Slavic languages (except for Bulgarian and Macedonian, which are rather distinctive among the Slavic languages in their grammar), Baltic languages and many Indo-Aryan languages. Albeit Classical Greek had a definite article (which has survived into Modern Greek and which bears vigorous functional resemblance to the German definite article, which it is cognate to), the earlier Homeric Greek utilized this article largely as a pronoun or demonstrative, whereas the earliest kenned form of Greek kenned as Mycenaean Greek did not have any articles. Articles developed independently in several language families. Not all languages have both definite and indefinite articles, and some languages have variants of definite and indefinite articles to distinguish finer shades of denotement: for example, French and Italian have a partitive article utilized for indefinite mass entities, whereas Colognian has two distinct sets of definite articles denoting focus and uniqueness, and Macedonian uses definite articles in a demonstrative sense, with a tripartite distinction (proximal, medial, distal) predicated on distance from the verbalizer or interlocutor. The words this and that (and their plurals, these and those) can be understood in English as, ultimately, forms of the definite article the (whose declension in Old English included thaes, an ancestral form of this/that and these/those). In many languages, the form of the article may vary according to the gender, number, or case of its entity. In some languages the article may be the only designation of the case. Many languages do not utilize articles at all, and may use other ways of denoting old versus incipient information, such as topic–comment constructions.
jfcherng commented 4 years ago

Hey @jfcherng, could you test this out with word-level and 0.8 threshold? What's wrong with that dot?

Old New

I can't see the problem. Maybe a screenshot?

Or maybe the text you provided cannot be copied accurately since they are not in code block.
jfcherng commented 4 years ago

Since you have release the-turk/flarum-diff 1.0.0, I assume the issue no longer exists?

the-turk commented 4 years ago

Since you have release the-turk/flarum-diff 1.0.0, I assume the issue no longer exists?

Ahh, sorry I forgot this. Seems like I can't reproduce it either, I dropped my custom renderers maybe this fixed the issue. Thank you!

the-turk commented 4 years ago

@jfcherng sorry for the ping, are you using Discord?

jfcherng commented 4 years ago

yes, ---------