scripting / feedBase

A project to get feeds into a base.
MIT License
33 stars 3 forks source link

Issues with Unicode Titles #22

Open thejeshgn opened 6 years ago

thejeshgn commented 6 years ago

My subscription list is here http://feedbase.io/?username=thej

It has couple of feeds with title in Kannada. Feedbase prints ???

Example

   <outline title="&#xCB5;&#xCB0;&#xCCD;&#xCA4;&#xCAE;&#xCBE;&#xCA8; - Vartamaana" text="&#xCB5;&#xCB0;&#xCCD;&#xCA4;&#xCAE;&#xCBE;&#xCA8; - Vartamaana" description="&#xCB5;&#xCB0;&#xCCD;&#xCA4;&#xCAE;&#xCBE;&#xCA8; - Vartamaana" type="rss" xmlUrl="http://www.vartamaana.com/feed/" htmlUrl="http://www.vartamaana.com"/>

when imported into feedbase has title ???

where as in a standard html it would print ok.

HTML

<span>&#xCB5;&#xCB0;&#xCCD;&#xCA4;&#xCAE;&#xCBE;&#xCA8; - Vartamaana</span>

RENDERING

ವರ್ತಮಾನ - Vartamaana

This feedlist opml is an export from liferea.

scripting commented 6 years ago

Thanks for the report.

Here's a little background.

It doesn't matter what the title is in the input OPML file, we don't even store it. What matters is what we get back when we read the feed. For that we use the node.js package feedParser. I believe it is unicode-aware.

I verified that the question marks are in the database (though seeing it in the feed list is basically verification of that, it's where the data comes from).

I'm going to have to do some more investigating. I have a standalone app that just calls feedParser for debugging situations like this.

I suspect it is an encoding issue that came up in River5 recently, which uses the same feed reading package.

scripting commented 6 years ago

OK, I think I have what may be a fix in feedBase.

Here's the test.

http://feedbase.io/readfeed?feedurl=http://kannadascience.blogspot.com/feeds/posts/default

Is that the correct title?

Dave

thejeshgn commented 6 years ago

Yes. It's correct.

scripting commented 6 years ago

Excellent. Over the next few hours feedBase will re-read the feeds, and with any luck the names on your page should become correct. If you can, let me know either way, if it works or doesn't.

Thanks for your help debugging this. I was getting a lot of feedback on this problem. ;-)

thejeshgn commented 6 years ago

I can still see question marks on my user page

question

scripting commented 6 years ago

Okay -- I've dug in some more and found that I only solved part of the problem, when I added the conversion of foreign charsets, which was quite tricky. Now some more trickiness is to come.

I've subscribed to the feed myself.

http://rashmidiaries.blogspot.com/feeds/posts/default

Then I hooked up the /ping call in feedBase to the exact mechanism that maintains the database, that reads each feed in turn, and updates the title, html link, description.

http://feedbase.io/ping?feedurl=http://rashmidiaries.blogspot.com/feeds/posts/default

As we saw earlier, it's correctly reading the feed over the net. However when I add it to the database, this is what I see.

image

There are the question marks.

I guess I assumed that all text in MySQL would be UTF-8. But apparently not so.

I'm going to link to this from Scripting News, asking for advice.

thejeshgn commented 6 years ago

As far as I know (From my wordpress experience) while creating database you can set the characterset and collate

CREATE DATABASE feedbase_db CHARACTER SET utf8 COLLATE utf8_general_ci;

and then while creating table you can use

CREATE TABLE feeds (

) ENGINE=MyISAM DEFAULT CHARSET=utf8

This should solve both storing and sorting of Varchars.

In our case we can alter the table

ALTER TABLE feeds MODIFY title VARCHAR(250) CHARACTER SET utf8 COLLATE utf8_general_ci;

"If you use ALTER TABLE to convert a column from one character set to another, MySQL attempts to map the data values, but if the character sets are incompatible, there may be data loss."

scripting commented 6 years ago

In another thread I'm asking for advice on how to backup the database. I don't want to do anything radical that could lose data w/o a good backup.

On the other hand I think mainly we would lose data that we don't actually have yet.

So my next thing will be to get a good backup.

jystervinou commented 6 years ago

What he, and they on the other thread, said :)

=> Sequel Pro (or any graphical client) will really show you the structure of the database (the encoding used etc…), will make it easy to import/export data, browse data etc… Very useful when you’re new to MySQL.

JY

Le 3 avr. 2018 à 19:53, Dave Winer notifications@github.com a écrit :

In another thread I'm asking for advice on how to backup the database. I don't want to do anything radical that could lose data w/o a good backup.

On the other hand I think mainly we would lose data that we don't actually have yet.

So my next thing will be to get a good backup.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

timbotron commented 6 years ago

FWIW, I like Sequel Pro for a lot of things, but not their export feature. It always makes a new INSERT statement for every row, which can take forever to import if you have a lot of data. mysqldump does it a lot more intelligently, with INSERT statements that insert thousands of rows.

scripting commented 6 years ago

I'm working on a Node utility that backs up the database to two JSON files, one for feeds and the other for subscriptions.

The feeds file is done, it's 10.7MB. I figure that's not a problem for a Node app even if it's 10 times that size. Much more than that and I guess it might be a problem. That's the way I'm going for now.

Backing it up to a Dropbox folder and a chronologic folder structure, so I will have redundant snapshots.

On Wed, Apr 4, 2018 at 11:33 AM Tim Habersack notifications@github.com wrote:

FWIW, I like Sequel Pro for a lot of things, but not their export feature. It always makes a new INSERT statement for every row, which can take forever to import if you have a lot of data. mysqldump does it a lot more intelligently, with INSERT statements that insert thousands of rows.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scripting/feedBase/issues/22#issuecomment-378643674, or mute the thread https://github.com/notifications/unsubscribe-auth/ABm9O5dwQIwshtoRhzgsjIuHPYxMB5Wpks5tlOfBgaJpZM4S_7gn .

scripting commented 6 years ago

I have backup and restore working, so I have the same installation on my Mac desktop as I have on the server. So I can do the experiments here.

Here's the source for the backup and restore apps.

https://github.com/scripting/feedBase/blob/master/backup/

Dave

scripting commented 6 years ago

BTW, one of the fallouts of what I'm doing right now, is that it will be possible, without any changes to the source, to run your own feedBase. All you'll have to do is change values in config.json.

scripting commented 6 years ago

I just did the ALTER command on the feeds table and tried reading in

http://rashmidiaries.blogspot.com/feeds/posts/default

But got an error when I tried to update the table.

The Mac version of MySQL works somewhat differently than the Unix version. Here I get an error message, on Unix it just fails without a message.

This is the SQL command it runs:

replace into feeds (feedUrl, title, htmlUrl, description, whenUpdated, countSubs, ctSecs, code, ctErrors, ctConsecutiveErrors, ctChecks, whenLastError) values ('http://rashmidiaries.blogspot.com/feeds/posts/default', 'ರಶ್ಮಿ ಅಭಯ ಸಿಂಹ', 'http://rashmidiaries.blogspot.com/', '', '2018-04-05 15:51:44', 2, 0.407, 200, 0, 0, 13, NULL);

And the message:

ER_TRUNCATED_WRONG_VALUE_FOR_FIELD: Incorrect string value: '\xE0\xB2\xB0\xE0\xB2\xB6...' for column 'title' at row 1

The Alter command is --

ALTER TABLE feeds MODIFY title VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_general_ci;

I may just try recreating the database, since I can load this database entirely from the JSON.

jystervinou commented 6 years ago

Where are you running/typing these commands?

Le jeu. 5 avr. 2018 à 21:57, Dave Winer notifications@github.com a écrit :

I just did the ALTER command on the feeds table and tried reading in

http://rashmidiaries.blogspot.com/feeds/posts/default

But got an error when I tried to update the table.

The Mac version of MySQL works somewhat differently than the Unix version. Here I get an error message, on Unix it just fails without a message.

This is the SQL command it runs:

replace into feeds (feedUrl, title, htmlUrl, description, whenUpdated, countSubs, ctSecs, code, ctErrors, ctConsecutiveErrors, ctChecks, whenLastError) values (' http://rashmidiaries.blogspot.com/feeds/posts/default', 'ರಶ್ಮಿ ಅಭಯ ಸಿಂಹ', 'http://rashmidiaries.blogspot.com/', '', '2018-04-05 15:51:44', 2, 0.407, 200, 0, 0, 13, NULL);

And the message:

ER_TRUNCATED_WRONG_VALUE_FOR_FIELD: Incorrect string value: '\xE0\xB2\xB0\xE0\xB2\xB6...' for column 'title' at row 1

The Alter command is --

ALTER TABLE feeds MODIFY title VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_general_ci;

I may just try recreating the database, since I can load this database entirely from the JSON.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/scripting/feedBase/issues/22#issuecomment-379057935, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIFBAWnihz3VpM5TDz8Tjl5bR0-_-Y1ks5tlnc4gaJpZM4S_7gn .

scripting commented 6 years ago
  1. It's all happening on my Mac desktop running the official MySQL.

  2. The ALTER command is running at the mysql command line.

  3. The "replace into feeds..." command is being run by feedBase running on the Mac in response to a /ping command.

jystervinou commented 6 years ago

For 3, can you check that the charset option for the connection is correct?

https://github.com/mysqljs/mysql/blob/master/Readme.md#connection-options

JY

Le jeu. 5 avr. 2018 à 22:06, Dave Winer notifications@github.com a écrit :

1.

It's all happening on my Mac desktop running the official MySQL. 2.

The ALTER command is running at the mysql command line. 3.

The "replace into feeds..." command is being run by feedBase running on the Mac in response to a /ping command.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/scripting/feedBase/issues/22#issuecomment-379060306, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIFBBJo-18Ye3tssaNXAJHdzpYDUtvZks5tlnlFgaJpZM4S_7gn .

scripting commented 6 years ago

I did this --

"database": { "host": "localhost", "port": 3306, "user": "root", "password": "xxx", "charset": "utf8mb4", "connectionLimit": 10, "database": "feedbase", "raise_on_warnings": true }

Did the ping, same problem as before.

I have to take a break now, wil be back in about an hour.

scripting commented 6 years ago

I also have MySQL Workbench running on this machine, btw. I don't know how to use it, but I fumble around trying stuff out. Eventually I'll figure it out. ;-)

image

jystervinou commented 6 years ago

If you used utf8 for the alter, use utf8 for the connection charset too.

Or use utf8mb4 everywhere. (In the alter and the connection )

JY

Le jeu. 5 avr. 2018 à 22:23, Dave Winer notifications@github.com a écrit :

I did this --

"database": { "host": "localhost", "port": 3306, "user": "root", "password": "xxx", "charset": "utf8mb4", "connectionLimit": 10, "database": "feedbase", "raise_on_warnings": true }

Did the ping, same problem as before.

I have to take a break now, wil be back in about an hour.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/scripting/feedBase/issues/22#issuecomment-379064682, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIFBDPqJ-v-UmV3oV0ZIbTC69bI9A0Yks5tln0ngaJpZM4S_7gn .

jystervinou commented 6 years ago

JY

Le jeu. 5 avr. 2018 à 22:24, Dave Winer notifications@github.com a écrit :

I also have MySQL Workbench running on this machine, btw. I don't know how to use it, but I fumble around trying stuff out. Eventually I'll figure it out. ;-)

[image: image] https://user-images.githubusercontent.com/1686843/38389905-c3046946-38ed-11e8-9451-284bd5a0f337.png

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/scripting/feedBase/issues/22#issuecomment-379065015, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIFBNbydrYbjco5ckT6VDInxqZZYwvoks5tln1rgaJpZM4S_7gn .

jystervinou commented 6 years ago

Use also utf8mb4_general_ci for the collation. (Note the mb4 part)

CHARACTER SET utf8mb4 COLLATE utf8mb4 _general_ci

JY

scripting commented 6 years ago

Let's review the commands I use to create the database and the tables. I have them documented. Let's get them really dialed in. Then I'll re-create the database, on my Mac, re-import the data, do some pings. Once it works, I'll do the same on the main server, and hopefully we'll have something we can build on for years to come.

Here are the docs for the tables.

https://github.com/scripting/feedBase/blob/master/docs/database.md#sql-commands-to-create-the-tables

And here are the docs for config.json:

https://github.com/scripting/feedBase/blob/master/docs/config.md

I already know that I have to add a charset value in the database sub-object.

scripting commented 6 years ago

I'm back at work this morning.

I did a search on encodings and came up with this article.

https://mathiasbynens.be/notes/mysql-utf8mb4

It seems to have it covered for updating an existing database. I'm still looking for an article that covers starting a new database with new tables with utf8mb4 encoding.

BTW, no question utf8mb4 is the proper encoding to use.

jystervinou commented 6 years ago

You can use a default charset and collation for a table:

CREATE TABLE t1
(
    [... columns definitions...]
) DEFAULT CHARACTER SET utf8mb' COLLATE utf8mb4_general_ci;

https://dev.mysql.com/doc/refman/5.7/en/charset-examples.html

Not sure about the right collation to use, because people turn to utf8_unicode_ci or utf8_general_ci (when using the utf8 charset) :

https://stackoverflow.com/a/367725

Someone mentions:

Update: For newer versions, recommend utf8mb4 and utf8mb4_unicode_520_ci. These give you the rest of Chinese, plus improved collation

https://stackoverflow.com/questions/367711/what-is-the-best-collation-to-use-for-mysql-with-php/367725#comment59285198_367725

jystervinou commented 6 years ago

See the example 4 in the previous link on MySQL.com, you can also set a default charset/collation for the database (instead of repeating it at the tables level)

CREATE DATABASE d1
    DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci;
USE d1;
CREATE TABLE t1
(
    c1 CHAR(10)
);

But i prefer to be explicit at the table level, as it's also what mysqldump does when it dumps the CREATE statements.

jystervinou commented 6 years ago

More details here:

https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

scripting commented 6 years ago

Thank you JY! ;-)

But I'm confused. What is latin2? I'm using utf8mb4, right?

jystervinou commented 6 years ago

Yes yes, sorry. i copy pasted their example for the DEFAULT CHARACTER SET syntax, but do use utf8mb4 when seeing another charset like latin1 etc...

scripting commented 6 years ago

BTW, another question.

When I recreate the tables, I want to use TEXT type instead of VARCHAR.

I don't like having to specify the size limit.

It seems I can't use TEXT for a key field.

jystervinou commented 6 years ago

They have limits too, see here (there is also mediumtext etc..) :

https://stackoverflow.com/a/6766854

Note the part about utf8mb4 using a max of 4 bytes per character.

scripting commented 6 years ago

I've run these three commands. The first worked, the second two got errors.

ALTER DATABASE feedbase CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci; ALTER TABLE feeds CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; ALTER TABLE subscriptions CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

This is the error (same for both):

ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes

jystervinou commented 6 years ago

The "longest" text column seems to be "description" but even for this one, TEXT may seem overkill. Or not :-)

Difference between VARCHAR and TEXT:

https://stackoverflow.com/a/2023513

scripting commented 6 years ago

Here are the commands used to create the tables.

create table subscriptions (username varchar (255), listname varchar (255), feedUrl varchar (512), whenupdated datetime, PRIMARY KEY (feedUrl, username));

create table feeds (feedUrl varchar (512), title varchar (255), htmlUrl varchar (512), description varchar (512), whenUpdated datetime, countSubs int, ctSecs float, code int, ctErrors INT DEFAULT 0, ctConsecutiveErrors INT DEFAULT 0, ctChecks INT DEFAULT 0, whenLastError DATETIME, PRIMARY KEY (feedUrl));

scripting commented 6 years ago

I don't have a good guess as to what happened. I thought it wouldn't change the size of any of the VARCHARs. But it seems that must be happening. Last I checked 512 < 767.

jystervinou commented 6 years ago

Can you first check if the database is using InnoDB? (in MySQL Workbench)

scripting commented 6 years ago

Where to look? Here's what the front panel says.

image

jystervinou commented 6 years ago

"Schemas", "sys" at the bottom left.

jystervinou commented 6 years ago

then "i" for info.

jystervinou commented 6 years ago

More context on the error, i had to deal with that recently:

https://stackoverflow.com/a/1814594

scripting commented 6 years ago

I don't know what "I" for info means.

scripting commented 6 years ago

image

jystervinou commented 6 years ago

just next to the "sys" database, hover with the mouse, you should see an "i" button

jystervinou commented 6 years ago

same at the Tables level.

scripting commented 6 years ago

image

jystervinou commented 6 years ago

yes, that's the first "Info" tab, you should see other tabs in this new window. Look at the "Tables" tab.

scripting commented 6 years ago

Bing

image

jystervinou commented 6 years ago

mmm, actually the sys schema is used by MySQL

https://dev.mysql.com/doc/refman/5.7/en/sys-schema.html

so MySQL workbench is not currently browsing your own database.

scripting commented 6 years ago

Okay I read the context piece, and am utterly confused, which is understandable. There's a lot of history here to catch up on. I know how this works, I created Frontier. We had history too (though not as much).

The key in the feeds table is feedurl, which as it name implies is the URL of the feed.

How short can we make it? That's the question.

scripting commented 6 years ago

Any chance you could get on Skype? This might go a lot faster...