q2a / question2answer

Question2Answer is a free and open source platform for Q&A sites, running on PHP/MySQL.
http://www.question2answer.org/
GNU General Public License v3.0
1.64k stars 629 forks source link

Bug when favoriting a tag: "teisė" and "teisę" are interpreted as the same tag #47

Closed q2apro closed 9 years ago

q2apro commented 10 years ago

While developping another plugin that deals with tags I stumbled over the following bug:

When a user favorites the tag "teisė" or the tag "teisę" both are favorited as entity "teisę".

Already in the frontend favorite form you can see that both tags have the same T id which is for the example favoriteT1900_0. However, teisė should have 11294 according to the database.

For seeing the bug: http://www.klaustukai.lt/tag/teis%C4%97 http://www.klaustukai.lt/tag/teis%C4%99

The problem lies in qa-db-selects.php with the mysql query:

case QA_ENTITY_TAG:
            $selectspec['source'].=' AND entityid=(SELECT wordid FROM ^words WHERE word=$ LIMIT 1)';
            break;

Querying my database with SELECT wordid FROM qa_words WHERE word="teisė" LIMIT 1 incorrectly returns wordid=1900 instead of wordid=11294. Obviously the special character ė is interpreted as ę.

Any way to fix this? Kai

q2apro commented 10 years ago

I found something about Collation and some explanations here / mysql bug from 2010. Maybe this is the problem?!

The solution for the guy over there was: "I solved my problem with creating a lowercase column where I can do a utf8_bin collation search."


PS: I tried to change the collation of table qa_words to utf8_lithuanian_ci (where both e letters exist), however, q2a just throws the error message "A Question2Answer database query failed when generating this page. A full description of the failure is available in the web server's error log file."

q2apro commented 10 years ago

I found a solution, using the query: SELECT wordid FROM qa_words WHERE word= _utf8 "teisė" COLLATE utf8_bin LIMIT 1

which gives me the correct wordid 11294.

I tried to implement this in qa-db-selects.php but it seems not to work (I still get the wrong id in the favorite form / favorite button):

Instead: $selectspec['source'].=' AND entityid=(SELECT wordid FROM ^words WHERE word=$ LIMIT 1)';

I used: $selectspec['source'].=' AND entityid=(SELECT wordid FROM ^words WHERE word=_utf8 $ LIMIT 1)';

?

gidgreen commented 10 years ago

The simple solution is to remove one of those rows, since only one should be there if both of them match the query. I do not know how they both got in there since Q2A is careful in function qa_db_word_mapto_ids_add() to prevent this happening. It could be due to some kind of collation inconsistency in MySQL.

q2apro commented 10 years ago

Thanks for your answer. But is this really a consistent "simple" solution?

As far as I understand, qa_words holds all words from posted content, titles, tags. As soon as a new post arrives, I guess the qa_words are recalculated. Plus the new words are added to the qa_words table. So the false "teisę" would appear again.

I am still wondering why my manual mysql query works, but implemented in q2a does not?

q2apro commented 10 years ago

I tried to remove the word from the database, it does not work:

SQL query:
DELETE FROM `qa_words` WHERE `qa_words`.`wordid` =1900 LIMIT 1

MySQL said: Documentation
#1451 - Cannot delete or update a parent row: a foreign key constraint fails (`db348277_xy`.`qa_contentwords`, CONSTRAINT `qa_contentwords_ibfk_2` FOREIGN KEY (`wordid`) REFERENCES `qa_words` (`wordid`))
gidgreen commented 10 years ago

OK, so you have to update the rows in qa_contentwords to use the other wordid.

q2apro commented 10 years ago

I just deleted all wordids=1900 from qa_contentwords but still cannot delete the wordid=1900 from qa_words.

Nevertheless, is there no other solution? Why is my mysql query working and the q2a one does not?

svivian commented 10 years ago

@gidgreen Can you explain what you mean by your comment above ("only one should be there if both of them match the query")? Are you saying that teisė and teisę should not both be in the qa_words table?

I just tried tagging questions with each of the words, and both end up in there. And a query such as SELECT * FROM qa_words WHERE word='teisé' (with a third different e) shows both.

gidgreen commented 10 years ago

Yes, exactly, both words should not be in qa_words if they both match a query like SELECT * FROM qa_words WHERE word='teise' - the logic for adding words to qa_words is in function qa_db_word_mapto_ids_add() and it is meant to prevent adding words which are considered equal under the collation for that column.

q2apro commented 10 years ago

I still have no solution. @gidgreen To check if I understand you right. If I remove teisę from the database, then teisė will be returned?

But as I wrote already, I cannot hunt and remove each word that is coming in. Especially because it is not my mother tongue and I am not the administrator.

I think this is quite a big, still undiscovered problem of the q2a core (v1.6.3). Each favoriting could lead to wrong words.

For the newsletter plugin that I am developing currently, it seems that I have to create an extra table for all favorited tags of each users. Otherwise the wrong wordid is picked up and the wrong word showing up frontend.

And again, I don't understand why the manual mysql query SELECT wordid FROM qa_words WHERE word= _utf8 "teisė" COLLATE utf8_bin LIMIT 1 gives me the correct result, but not when I implement the same addition in the q2a core with $selectspec['source'].=' AND entityid=(SELECT wordid FROM ^words WHERE word=_utf8 $ LIMIT 1)';.

Really confusing to me. Greetings, Kai

PS: I deleted the false wordid of teisę from qa_contentwords, then qa_titlewords, then I deleted the word itself from qa_words. Favoriting the tag teisė now, shows me teise as favorited. Sorry to say that, but it feels like something is really wrong here, plus I cannot rely on the core favorite tags system :(

gidgreen commented 10 years ago

If you're seeing this difference why not just add COLLATE utf8_bin to the query in the Q2A core?

q2apro commented 10 years ago

I tried $selectspec['source'].=' AND entityid=(SELECT wordid FROM ^words WHERE word=_utf8 $ COLLATE utf8_bin LIMIT 1)'; as well, it was still the same problem. That's why I am also confused :-)

For my tags newsletter plugin I ended up creating a separate table qa_favtags that holds userid and tagwords. Now I am using this to store the correct UTF8-tagwords.

But well, as I mentioned, the favoriting bug still exists in core...

svivian commented 10 years ago

Just to note, I fixed the specific favoriting problem by adding BINARY to the column selected. IIRC it was the line inside qa_db_user_favorite_tags_selectspec:

'columns' => array('word', 'tagcount'),
// replaced with...
'columns' => array('BINARY word', 'tagcount'),

That's not been committed anywhere. I will be checking into the issue Gideon mentioned above so theoretically they shouldn't both get into the words table.

q2apro commented 10 years ago

@Scott: I have tried this, but when I favorite the one, the other one is favorited as well (frontend).

Backend/Frontend: The same wordid is still outputted for the favorite button.

pupi1985 commented 10 years ago

If the intention was to make sure only unique words were inserted in ^words table. Then why isn't the column unique? That would speed up many queries.

You mentioned the idea was to make sure papä and papá are stored only once. Wouldn't it be better to make sure the table itself won't allow those characters? (I'm not sure if this is possible by playing with charset and collation... most likely not but I need a second opinion here).

Changing to solving this by PHP, I still need to confirm exactly what should happen when saving papä and papá. Would it be correct to say that these words should be processed with qa_string_remove_accents function, removed duplicates and then checked its ("its" instead of "their" because they will turn into papa) presence in the DB?

Now if we do this, then we will be sure the word papá won't be on the ^words table. But this makes me wonder:

Absolutely unrelated: I've seen users (I guess they were chinese), that have tags in their respective languages. This would absolutely change the way they tag and will only allow normalized tags. This means new users will only be able to user normalized tags and old users will have to radically change their tagging system. IMO this is a huge no-go, at least for 1.7, if this was 2.0 then it could be acceptable. I'd say let's stick to binary searching over that table and making the column unique (unless there is a reason not to do so which I'm not getting).

Not sure if I shed some light or complicated things even more :)

q2apro commented 10 years ago

Just for your information, listing questions with the tag teisė was also not working correctly. I have 204 questions tagged exactly with teisė, but q2a lists only 20 questions.

I first tested the mysql query:

SELECT postid FROM qa_posttags WHERE wordid=(SELECT wordid FROM qa_words WHERE word="teisė" AND word="teisė" COLLATE utf8_bin LIMIT 1)

and it shows correctly 202 posts in phpmyadmin. I found the query in qa-db-selects.php with function qa_db_tag_recent_qs_selectspec(...)

I could solve this issue finally by using Scott's suggestion above, and changing qa-db-selects.php in function qa_db_tag_word_selectspec($tag) the line:

 'source' => '^words WHERE word=$',

to:

 'source' => '^words WHERE BINARY word=$',

@svivian Thanks mate!