mantas-done / subtitles

Subtitle/caption converter
https://gotranscript.com/subtitle-converter
MIT License
142 stars 48 forks source link

smi parse error #89

Closed hyoungki-kim closed 6 months ago

hyoungki-kim commented 6 months ago

When smi parse, two error exists.

  1. When SYNC tag has only timecode, parser can't get line. ex) <SYNC START=144700>

  2. When SYNC tag includes FONT tag, ignore subtitle content inside FONT tag. ex) <SYNC START=145308><FONT COLOR="FFFF00">-야, 처리해. </FONT>

Then, I suggest this code.

$data = []; // Define data array
foreach ($syncElements as $syncElement) {
    $time = $syncElement->getAttribute('start');

   // Ignore childNodes empty
    if(!$syncElement->childNodes->length) {
        continue;
    }

    foreach ($syncElement->childNodes as $childNode) {
        $lines = [];
        $line = '';

        $contentNode = null;

        // Process font tag
        if ($childNode->nodeName === 'p' || $childNode->nodeName === '#text') {
            $contentNode = $childNode;
        } else if ($childNode->nodeName === 'font' && $childNode->childNodes->length) {
            $contentNode = $childNode->childNodes->item(0);
        }

        if($contentNode) {
            $line = $doc->saveHTML($contentNode);
            $line = preg_replace('/<br\s*\/?>/', '<br>', $line); // normalize <br>
            $line = str_replace("\u{00a0}", '', $line); // no brake space - &nbsp;
            $line = str_replace("&amp;nbsp", '', $line); // somebody didn't have semicolon at the end of &nbsp
            $line = trim($line);
            $lines = explode('<br>', $line);
            $lines = array_map('strip_tags', $lines);
            $lines = array_map('trim', $lines);
            break;
        }
    }

    $data[] = [
        'start' => static::timeToInternal($time),
        'is_nbsp' => trim(strip_tags($line)) === '',
        'lines' => $lines,
    ];
}
mantas-done commented 6 months ago

hi @hyoungki-kim , can you attach a sample file that gets an error?

hyoungki-kim commented 6 months ago

Two files attached. smi-exception.zip

and my final code is this....

public function fileContentToInternalFormat($file_content, $original_file_content)
    {
        $internal_format = []; // array - where file content will be stored

        // $file_content = mb_convert_encoding($file_content, 'HTML');
        // in the future 'HTML' parameter will be deprecated, so use this function instead
        // https://github.com/mantas-done/subtitles/issues/87
        $file_content = mb_encode_numericentity($file_content, [0x80, 0x10FFFF, 0, ~0], 'UTF-8');

        if (strpos($file_content, '</SYNC>') === false) {
            $file_content = str_replace('<SYNC ', '</SYNC><SYNC ', $file_content);
        }
        $file_content = str_replace("\n", '', $file_content);
        $file_content = str_replace("\t", '', $file_content);
        $file_content = preg_replace('/>\s+</', '><', $file_content);

        $doc = new \DOMDocument();
        @$doc->loadHTML($file_content); // silence warnings about invalid html

        $syncElements = $doc->getElementsByTagName('sync');

        $data = [];
        foreach ($syncElements as $syncElement) {
            $time = $syncElement->getAttribute('start');

            if(!$syncElement->childNodes->length) {
                continue;
            }

            foreach ($syncElement->childNodes as $childNode) {
                $lines = [];
                $line = '';

                $contentNode = null;

                if ($childNode->nodeName === 'p' || $childNode->nodeName === '#text') {
                    $contentNode = $childNode;
                } else if ($childNode->nodeName === 'font' && $childNode->childNodes->length) {
                    $contentNode = $childNode->childNodes->item(0);
                }

                if($contentNode) {
                    $line = $doc->saveHTML($contentNode);
                    $line = preg_replace('/<br\s*\/?>/', '<br>', $line); // normalize <br>
                    $line = str_replace("\u{00a0}", '', $line); // no brake space - &nbsp;
                    $line = str_replace("&amp;nbsp", '', $line); // somebody didn't have semicolon at the end of &nbsp
                    $line = trim($line);
                    $lines = explode('<br>', $line);
                    $lines = array_map('strip_tags', $lines);
                    $lines = array_map('trim', $lines);
                    break;
                }
            }

            $data[] = [
                'start' => static::timeToInternal($time),
                'is_nbsp' => trim(strip_tags($line)) === '',
                'lines' => $lines,
            ];
        }

        if(empty($data)) {
            return $internal_format;
        }

        $i = 0;
        foreach ($data as $row) {
            if (!isset($internal_format[$i - 1]['end']) && $i !== 0) {
                $internal_format[$i - 1]['end'] = $row['start'];
            }
            if (!$row['is_nbsp']) {
                $internal_format[$i] = [
                    'start' => $row['start'],
                    'lines' => $row['lines'],
                ];
                $i++;
            }
        }
        if (!isset($internal_format[$i - 1]['end'])) {
            $internal_format[$i - 1]['end'] = $internal_format[$i - 1]['start'] + 2.067; // SubtitleEdit adds this time if there is no last nbsp block
        }

        return $internal_format;
    }
mantas-done commented 6 months ago

Your code works great! Added your code and also a new unit test. Also made a new release. Thank you! :)