taw / magic-search-engine

Search engine for Magic cards
MIT License
43 stars 20 forks source link

Normalize CJK characters less aggressively #88

Closed fenhl closed 6 years ago

fenhl commented 6 years ago

My fork has an issue (fenhl/lore-seeker#1) with a card illustrated by 酩憲: it is grouped together with the 5 cards for which MTG JSON has no artist info. Mu Yanling's artist's name is also written in Hanzi and will have the same issue. Of course, this would be less of a problem if MTG JSON had complete artist info, but even then I still think it would be appropriate to not “normalize” CJK characters into a single underscore.

taw commented 6 years ago

There's different normalization rules for foreign names (where CJK character search should work as expected), and different for everything else (where normalization is aggressive, because there are no CJK characters on any Magic cards).

I'm surprised Mu Yanling seems to break the rules, and I sort of wonder if Gatherer won't correct the issue by giving official English spelling anyway.

https://mtg.wtf/artist/_ is a separate issue which indexer should fix. I vaguely recall reporting it as mtgjson bug ages ago.

taw commented 6 years ago

Fixed "??? drew 5 cards." issue at least. Let's wait for Mu Yanling before we address it, as right now nothing in database cares either way. I'll probably remove CJK stripping just as you propose.

taw commented 6 years ago

I speculatively did this https://github.com/taw/magic-search-engine/commit/6035924da1179c057548c72e0b727fa5e621ab14

Does it fix the problem?

fenhl commented 6 years ago

You'll want to update the regex in setup_artists! as well.

taw commented 6 years ago

Does it work now?

fenhl commented 6 years ago

It does!

taw commented 6 years ago

Nice.