Translations - Githubissues

airon90 commented 2 years ago

Can you support other language? You can get correct labels from Wikidata

Mte90 commented 2 years ago

Yes it will help a lot too as you are doing query I guess that is enough a language picker that change something in the endpoint the game uses.

tuukka commented 2 years ago

As all the game data is currently loaded from a single file at start, I think the best approach might be to provide language-specific versions of this file.

Approach 0: Instead of having a language-specific file, fetch the data of the Wikidata item each time a card is shown to see if Wikidata (at the moment) contains the desired translations. I'm not sure which endpoints can be accessed directly by the game in the browser, but e.g. these would seem to work: https://www.wikidata.org/wiki/Special:EntityData/Q42.json and https://query.wikidata.org/bigdata/ldf?subject=wd:Q42

Approach 1: For each card (Wikidata item) in the original data file, replace the original label, description and Wikipedia article title (in English) by ones in the desired language from the same Wikidata item. However, they might not be available or they might be unsuitable (contain the answer or have a mistake).

Approach 2: Generate a new set of cards appropriate in the desired language e.g. by tweaking https://github.com/tom-james-watson/wikitrivia-generator.

EDIT: Approach 3: Generate a new set of cards dynamically from frontend by calling a suitable Sparql endpoint such as QLever. https://qlever.cs.uni-freiburg.de/wikidata/

nicolaes commented 2 years ago

I like Approach 2 the most. Approaches 0 and 1 are for me:

Pro: long-term and low-maintenance
Con: may hinder quick-fix tweaks in the database

I'll try Approach 2 in Romanian to see how it goes.

Edit: I take back liking Approach 2 after seeing the 73GB data source. I will still give it a try, but don't have high hopes.

tuukka commented 2 years ago

@nicolaes :+1: Perhaps we can find the necessary people who can make this happen together. To make approach 2 easier, I found some initial discussion on reimplementing it based on queries against a Sparql endpoint. In my experience, the official Sparql endpoint does not have the performance needed, but QLever (and/or Virtuoso) might be able to answer all the queries we need. Here's a quick test that finds about 9000 results that might be suitable for Romanian cards: https://qlever.cs.uni-freiburg.de/wikidata/30kMrq?exec=true

See also: tom-james-watson/wikitrivia-generator#6 and tom-james-watson/wikitrivia-generator#8

nicolaes commented 2 years ago

@tuukka Thanks for the idea. I appreciate the effort to put together the Romanian version. The quick test of 9000 entries is very relevant; current English database has 10k entries.

I don't know SPARQL, so I am playing around the link you provided. My plan is to find a reasonably fast query that provides at least 5000 results, then put it together with the wikitrivia app.

nicolaes commented 2 years ago

I gave QLever a few tries, then I dropped it. I ran a query with all year types (created, discovered, invented, born etc) and I lost the backend connectivity. Probably because lack of optimization. Here is the code: https://qlever.cs.uni-freiburg.de/wikidata/aFFkcp

I got progres on the raw data source processing, and now have ~1000 usable entries for Romanian. I'm not yet sure if Approaches 0 and 1 are viable, but it might be worth trying them out. My steps to get the Romanian entities were:

downloading the wiki data (73GB)
parsing it with wikibase-dump-filter - 150k entries in 9h (should be faster for more popular languages)
adapt the wikitrivia-generator parser (translate filter words, change en to ro, adjust viewcounts) - 250 entries / hour

Since I don't have many cards, I will account for the scenario when you don't have any relevant cards to show. Then I will put this live - see if Romanians actually use it.

tuukka commented 2 years ago

@nicolaes I hadn't thought of the possibility to create a set of cards dynamically based on a Sparql query. I've added it as "Approach 3" in my original list. At a glance, an advantage would be that the data would update automatically, but a disadvantage would be that two games couldn't be guaranteed to be played with the same set of cards.

I have reported the QLever crash to its developers - I hope it's something they can easily fix as QLever is very performant in general.

Do you know why you got just 10% of the amount of cards compared to English? For example, is it because the Romanian labels are missing, the filter words match more often, or the viewcounts are lower?

tuukka commented 2 years ago

Update: here's a query for QLever that returns all suitable Wikidata items and their required attributes sorted by sitelinks count (pageviews is not available for queries). You can change "en" to any other language code: https://qlever.cs.uni-freiburg.de/wikidata/OycBUK

tom-james-watson commented 2 years ago

Some really interesting discussion here!

@nicolaes - yeah unfortunately the wikitrivia-generator process as it stands is slow. I think sparql is definitely the future. Also, with something like the example @tuukka has worked on, that shows how easy the SPARQL approach would make it to internationalize.

The discussion of how to work out the details of the SPARQL approach should be kept to https://github.com/tom-james-watson/wikitrivia-generator/issues/6.

nicolaes commented 2 years ago

@tuukka sorry for late reply, messed up notifications. I appreciated the time you invested in the SPARQL query. I got to download the 10k sample you prepared without any QLever issues.

About Romanian low count of entities: it's because not all pages are translated and I didn't adjust the view count thresholds correctly (e.g. I reduced it by 40x compared to English, while there are 60x less Romanian speakers).

PS: top hit from SPARQL query in Romanian is the wiki of Russia 🤔

tom-james-watson / wikitrivia

Translations #26