tvgrabbers / tvgrabnlpy

Deze versie is deprecated zie: tvgrabpyAPI
https://github.com/tvgrabbers/tvgrabpyAPI
GNU General Public License v2.0
27 stars 8 forks source link

Non standard characters in XML #44

Closed Waywardnl closed 8 years ago

Waywardnl commented 8 years ago

Hello,

I use bigscreen EPG to parse the XML to windows media center, i get an error on the xml file. I have added a snippet:

[code]

Bluf Aflevering 5 Dramaserie: Ook Elise is in de ban van Julian. Of gebruikt ze hem alleen om Mark jaloers te maken? Tjé ten slotte komt in de Amsterdam ArenA de vrouw van zijn dromen tegen. "Ik weet niet wat ik in mijn vorige leven heb gedaan om dit te mogen meemaken, maar het was letterlijk een soort jongensdroom: strippers op schoot en voetballen in de ArenA. Ik krijg een eigen nachtclub en word voor het eerst verliefd", vertelt acteur Géza Weisz, die Tjé speelt, in Grazia. "Het... Drama 1 . 4 . 12+ Seks Grof

[/code]

I Suspect the letter é is the problem, could this be translated?

Thankyou, Roland de Leeuw foutmelding-xml-bigscreen-epg

Waywardnl commented 8 years ago

Added the snipped as text

I cannot add a text file

hikavdh commented 8 years ago

It looks like you're using Windows. The default coding of the text is utf-8, which is not default under Windows, but is set in the header of the xml file. I have been testing a limited things under Windows. I have set all direct screen output under Windows to the ancient native Windows codeset. These are tricky things. I can not really test stuff under your software, so it is very limited what we can do. Do however check the header of you xmltv file!

hikavdh commented 8 years ago

I've been thinking some further. The output you show shows é correctly, so I very much doubt that's the problem. It seems to understand utf-8, I however can think about creating an option to use the ancient Windows codeset, but I doubt it will help. Your output gives the position of the 'illegal' character. I suspect it is the one shown as a '?'. Please verify on the basis of given line and position number. This one comes very probably from one of the source websites. Especially tvgids.tv is very sloppy and gives regularly errors due to invalid coding. You can try to disable the tvgids.tv source. e.g. Add to the [Configuration] section of your config file:
disable_source = 1 Also check if your program does not have an option to ignore such invalid characters.

Hika

Waywardnl commented 8 years ago

Hello,

I have been thinking also, and i converted all the strange characters with php becease i just want to get them all!

Here is the code: <?php

$bron="/root/.xmltv/tv.xml"; $doel="/root/.xmltv/tv_checked.xml";

Open file to read

# $handle = fopen($bron, "r") or die("Couldn't get handle");

Open file to write

#

First create a new one

# $fp = @fopen($doel, "w");

$schrijf = @fwrite ($fp, " ");

fclose($fp);

$fp = @fopen($doel, "a");

if ($handle) { while (!feof($handle)) { $buffer = fgets($handle, 4096); // Process buffer here..

    ## Change the characters
    #
        $buffer=str_replace("ü", "u", $buffer);
            $buffer = str_replace("&#8230;", ".", $buffer);
            $buffer=str_replace("ú", "u", $buffer);
            $buffer=str_replace("ù", "u", $buffer);
            $buffer=str_replace("û", "u", $buffer);
            $buffer=str_replace("ï", "i", $buffer);
            $buffer=str_replace("í", "i", $buffer);
            $buffer=str_replace("ì", "i", $buffer);
            $buffer=str_replace("î", "i", $buffer);
            $buffer=str_replace("ë", "e", $buffer);
            $buffer=str_replace("é", "e", $buffer);
            $buffer=str_replace("è", "e", $buffer);
            $buffer=str_replace("ê", "e", $buffer);
            $buffer=str_replace("ö", "o", $buffer);
            $buffer=str_replace("ó", "o", $buffer);
            $buffer=str_replace("ò", "o", $buffer);
            $buffer=str_replace("ô", "o", $buffer);
            $buffer=str_replace("ø", "o", $buffer);
            $buffer=str_replace("ÿ", "y", $buffer);
            $buffer=str_replace("ý", "y", $buffer);
            $buffer=str_replace("Ü", "U", $buffer);
            $buffer=str_replace("Ú", "U", $buffer);
            $buffer=str_replace("Ù", "U", $buffer);
            $buffer=str_replace("Û", "U", $buffer);
            $buffer=str_replace("Ï", "I", $buffer);
            $buffer=str_replace("Í", "I", $buffer);
            $buffer=str_replace("Ì", "I", $buffer);
            $buffer=str_replace("Î", "I", $buffer);
            $buffer=str_replace("Ë", "E", $buffer);
            $buffer=str_replace("É", "E", $buffer);
            $buffer=str_replace("È", "E", $buffer);
            $buffer=str_replace("Ê", "E", $buffer);
            $buffer=str_replace("Ý", "Y", $buffer);
            $buffer=str_replace("Ö", "O", $buffer);
            $buffer=str_replace("Ó", "O", $buffer);
            $buffer=str_replace("Ò", "O", $buffer);
            $buffer=str_replace("Ô", "O", $buffer);
            $buffer=str_replace("Ø", "O", $buffer);
            $buffer=str_replace("é", "e", $buffer);
            $buffer=str_replace("â€", "`", $buffer);
            $buffer=str_replace("ï", "i", $buffer);
            $buffer=str_replace("Â ", " ", $buffer);
            $buffer=str_replace("ë", "ee", $buffer);
            $buffer=str_replace("&#8217;", "_", $buffer);
            $buffer=str_replace("´", "_", $buffer);
            $buffer=str_replace("ú", "u", $buffer);
            $buffer=str_replace("è", "e", $buffer);
            $buffer=str_replace("&lt;", "", $buffer);
            $buffer=str_replace("`¦", "", $buffer);
            $buffer=str_replace("ç", "c", $buffer);
            $buffer=str_replace("£", "", $buffer);
            $buffer=str_replace("`œ", "`", $buffer);
            $buffer=str_replace("€", "EURO", $buffer);
            $buffer=str_replace("`™", "", $buffer);
            $buffer=str_replace("â‚", "euro", $buffer);
            $buffer=str_replace("û", "u", $buffer);
            $buffer=str_replace("à ", "a ", $buffer);
            $buffer=str_replace("Ã", "i", $buffer);
            $buffer=str_replace("í‰é", "1", $buffer);
            $buffer=str_replace("%", "procent", $buffer);
    $buffer=str_replace("i¢", "a", $buffer);
    $buffer=str_replace("&amp_", "en", $buffer);
    $buffer=str_replace("i³", "o", $buffer);
    $buffer=str_replace("¤", "", $buffer);
    $buffer=str_replace("i¤", "a", $buffer);
    $buffer=str_replace("i¼", "u", $buffer);
    $buffer=str_replace("i¶", "o", $buffer);
    $buffer=str_replace("iŸ", "s", $buffer);
    $buffer=str_replace("iª", "e", $buffer);
    $schrijf = @fwrite ($fp, $buffer);

    echo ".";

}
fclose($handle);

}

Close the file

# fclose($fp);

?>

Now the file works

hikavdh commented 8 years ago

As said I don't think the accents are the problem and I cannot remove them because others want them. You ask to go to just ASCI. You do kill more then needed, but... I notice some & characters and other weird ones in your list, these might be cause and come as I said from bad coding on the source. If you have the capacity try recoding from utf-8 to iso-8859-15 (also called latin9) or windows-1252 (also called cp1252 or Western Europe) and report on it. The first is the general predecessor and the second the ancient Windows default and still used.

I still say look for a setting to ignore unrecognized characters. If the program is anywhere decent created it will have one. But then that might be asked to much from Microsoft!

Waywardnl commented 8 years ago

Well the EPG xml parser to microsoft mxf does have a setting utf8, but it does nothing. Yes it is gunshot troubleshooting, but we have to begin somewhere.

When i find the time and the drive to translate it i will surely post it here, becease i think your scraper is awesome!

hikavdh commented 8 years ago

Thanks! In the mean time I will look into an option to set a different coding for the output. But know that utf-8 (and unicode) are the future. All other codings will slowly fade out in the coming years. Only Microsoft is always slow to accept something they don't own and control! ;-(

Waywardnl commented 8 years ago

;-) Hear Hear, but i like the wmc backend of windows media centre ;-) And it works well with kodi. I was very amused to read that Micorosoft has to be compatible with odf.. But that... my children is a whole other story ;-)

hikavdh commented 8 years ago

Try: https://github.com/tvgrabbers/tvgrabnlpy/releases/tag/beta-2.2.2-p20151004

hikavdh commented 8 years ago

Did the new option --output-windows-codeset, using cp1252 or windows-1252 for the outputfile help?

Waywardnl commented 8 years ago

Good morning,

Back from vacation.

I wanted to try the new option with version 2.2.2,but it seems to hang at: NPO

Now fetching details for 68 programs on ITV 1(xmltvid=itv-1) (channel 15 of 92) for 14 days.

Now fetching 17 channels from npo.nl (day 4 of 14).

Waywardnl commented 8 years ago

Bloated too soon, now it goes through