Generate Wikipedia articles from offline Wikipedia dump

biodranik commented 2 years ago

There are several issues now with our current crawler implementation:

It is banned by Wiki servers because it DDoSes the API.
It takes a lot of time to get all summaries for several supported languages.
If we add more languages, the time increases even more.
As a result, now we are still using some old outdated articles we got in this spring.

A way better option is to download a dump of all Wiki articles for given languages and extract summaries from there directly.

Any volunteers for this task?

A list of supported languages and the output format can be checked in the existing implementation in tools/python/descriptions.

az09 commented 1 year ago

Can I help? ^_^

biodranik commented 1 year ago

Some ideas about the workflow:

A separate script to update the wiki dumps to the latest version (looks like each language should be downloaded separately).
A separate script/tool to process the dump and extract articles only in required/supported languages.
pbzip2 (parallel bzip2) may be used to decompress/process dump faster.
A fast stream XML parser should be used to avoid unnecessary bz2 archives decompression.
Optionally, some timestamp check logic can be implemented to detect if the article was modified to avoid unnecessary file operations. Otherwise, a previous dump should be completely removed and regenerated again.
(Bonus for the future) Need to check if we can leave HTTP links to uploaded images, so later articles can optionally load/show images if there is an internet connection, and user has allowed it.

pastk commented 1 year ago

Linking #2410

euf commented 1 year ago

Since current workflow is in Python, a few libraries to consider:

biodranik commented 1 year ago

Python is not the fastest tool to quickly and efficiently process large bz2-ed dumps. There are faster tools/languages.

ahsmha commented 1 year ago

@biodranik How about we consider golang? It has inbuilt support for concurrency which can be utilized for parallelizing the task of extracting summary text. Additionally Golang has a relatively low memory footprint, making it well-suited for handling large datasets on machines with limited resources. Apart from golang, C++ can also be a good choice.

ahsmha commented 1 year ago

@biodranik can i go ahead and start working on it in golang ?

biodranik commented 1 year ago

Rust, golang, or C++ are good tools for this task. However, I don't see how concurrency may help here. Can you please elaborate on how would you approach it?

Abhilekhgautam commented 1 year ago

@biodranik this project is listed in GSOC 2023, I can help with RUST, can I get more details about this.

biodranik commented 1 year ago

Did you check the existing description and the repository? It already has all the code.

ahsmha commented 1 year ago

Rust, golang, or C++ are good tools for this task. However, I don't see how concurrency may help here. Can you please elaborate on how would you approach it?

Sure @biodranik , I think concurrency could be utilized while setting up a mechanism to periodically update the extracted summary information to reflect changes in the Wiki articles. This is one usecase I can think of.

newsch commented 1 year ago

Hi, I'm interested in working on this as part of GSoC, I have experience with Python, C, and Rust, and some familiarity with Golang and C++.

From looking through the code, my understanding is that the current descriptions are processed into stripped and minified html, which is packaged up into the map files and rendered in the app.

A separate script/tool to process the dump and extract articles only in required/supported languages.

It looks like the wikipedia dumps are all in wikitext format, so this would involve stripping the markup/converting it to a simplified html summary? And this would be run as part of the maps_generator tool?

biodranik commented 1 year ago

Ideally, wiki's simplified HTML should be generated/updated in parallel to the generator. The idea is to reuse old wiki data if necessary for any reason.

The current dependency is that generator tool extracts wiki article ids into a separate file, and only these ids should be extracted by wiki parser. This task can be easily done independently from the generator tool by using osmconvert or other similar command line tools. And, obviously, html files are loaded into map data mwm files.

Any ideas about improving the toolchain are welcomed!

A bonus task is to leave wiki images links in html, but somehow limit their automatic loading if user doesn't want to use internet.

lzeee commented 1 year ago

Hi, @biodranik I am writing to express my interest in this project. Currently, I work as a Java backend engineer, however, during my university years, I gained extensive experience in writing Python crawlers. I have successfully dealt with tricky situations such as simulating browsers, avoiding IP bans through IP proxies, and identifying captchas through machine learning algorithms. In addition, I am a student graduated from Geographical Information Science.

I found this project interesting. And I have started reading the source code of descriptions_downloader.py. I plan to write a detailed proposal within this week. My proposal will include an understanding of the entire process of crawling wikipedia data, as well as solutions to the problems mentioned in this issue.

I would like to ask if we have a channel like gitter for communicating GSoC ideas?

Apologize for contacting at such a late stage. I was busy preparing for a project that was suddenly closed tonight. I still wish that I could have the opportunity to apply my skills to exciting ideas.

biodranik commented 1 year ago

Issue and comments here clearly state that we want to avoid any scraping and extract information directly from offline wiki dumps, in the fastest possible way.

lzeee commented 1 year ago

Thank you for pointing this out. My understanding of the problem is not sufficient. I will take the necessary steps to thoroughly investigate the matter, including researching and evaluating tools for downloading and parsing the relevant data dump.

I'm sorry I still have some questions. I understand that the core goal of this project should be to obtain some information from Wikipedia through certain methods, whether it is through direct API queries or downloading and analysing dump files, both are different ways to solve the problem. The problem now is that there are some issues with the API query method that make data acquisition difficult and slow, so you want to switch to the second method.

I would be appreciated if you could provide more background knowledge. I am curious if the latter is really faster than the former? I understand that this is only equivalent to bringing what Wikipedia servers do to local. Have these two methods been clearly compared? Is downloading the full amount of data and parsing it really a faster way?

newsch commented 1 year ago

Ideally, wiki's simplified HTML should be generated/updated in parallel to the generator.

Is the goal is to replicate as much of the original page as possible in html? It is unclear to me where #2410 ended up but it seems like some people want as much content as possible.

If so, it seems there are three options for handling that:

using an existing library for converting wikitext to html
using an existing library for parsing wikitext, and writing a custom html converter
creating a wikitext parser/converter, which I think is unrealistic, maybe possible if kept very small in scope, like only converting the summary to plaintext.

Looking through the list of alternative parsers for the mediawiki format, there are many options but it is unclear how much of the format they actually support, whether they are currently maintained, what the output looks like, and how they handle failure cases.

Some promising ones are in Python, which could allow writing an optimized article extractor in something like Rust, and passing the articles that need to be converted directly into the library.
Others would mean bringing in another language runtime or installing an executable, and probably generating in more steps, i.e. Wikipedia dumps -> directories of wikitext articles -> directories of html articles.

For reference, OsmAnd uses the java library bliki to convert dumps to html, with some preprocessing.

The idea is to reuse old wiki data if necessary for any reason.

The current dependency is that generator tool extracts wiki article ids into a separate file, and only these ids should be extracted by wiki parser. This task can be easily done independently from the generator tool by using osmconvert or other similar command line tools.

From what I can tell, the generator framework currently supports falling back to old data, but can't run any tasks in parallel with each other (although some tasks themselves are parallelized). Is that correct?

biodranik commented 1 year ago

Is the goal to replicate as much of the original page as possible in HTML?

We can easily experiment with HTML simplification. It should be very easy to do with the offline dump parser, right? Concerns in the PR were about big mwm map sizes. By experimenting we can see the difference.

From what I can tell, the generator framework currently supports falling back to old data, but can't run any tasks in parallel with each other (although some tasks themselves are parallelized). Is that correct?

By "in parallel" I meant running the wiki parser independently. If it recreates/refreshes already existing HTML pages, then it can be run even before running the generator tool. Or in parallel, as another process.

Choosing an appropriate Wiki => HTML convertor can be tricky, you're right. Ideally, it would be great to do it without other frameworks/languages dependency, but that may not be feasible (although https://docs.rs/parse_wiki_text/latest/parse_wiki_text/ can be investigated).

newsch commented 1 year ago

We can easily experiment with HTML simplification. It should be very easy to do with the offline dump parser, right? Concerns in the PR were about big mwm map sizes. By experimenting we can see the difference.

Yes, I think it would be straightforward to run some speed, size, and quality tests, and the existing scraped pages could serve as a baseline.

By "in parallel" I meant running the wiki parser independently. If it recreates/refreshes already existing HTML pages, then it can be run even before running the generator tool. Or in parallel, as another process.

Ah, I see. It looks like the wikipedia dumps are created every 2 weeks to 1 month, how often are the maps regenerated?

Choosing an appropriate Wiki => HTML convertor can be tricky, you're right. Ideally, it would be great to do it without other frameworks/languages dependency, but that may not be feasible (although https://docs.rs/parse_wiki_text/latest/parse_wiki_text/ can be investigated).

I came across that project, unfortunately the github account/repos have been deleted, but there are some existing forks. It was unclear to me how complete it really is, and using it would require designing an html converter. More reasons to do some experimenting!

biodranik commented 1 year ago

Now we update maps together with the app update, approx. monthly, with goals to do it at least weekly, and without requiring to update the app.

Updating and processing wiki dumps once or twice per month is perfectly fine.

charlie2clarke commented 1 year ago

I have just been doing some experimentation with processing the wiki dumps data in Go (see here). Here are some of my findings:

Benefits

From my small example, processing was very fast, and articles I read suggested that this is the most efficient means of OM extracting wiki data.
Without the API dependency, there's much more opportunity to parallelise data processing.

Drawbacks

In the proposed XML format, extracting articles by their ID isn't as straightforward as it should be, though it isn't hard to extend a library to do this. Most libraries assume you are going to process every page. I get the impression that the SQL dumps are more suited to this.
The most popular go wiki parser (go-wikiparse) has a bit to be desired to fulfil this use case (pointed out in my comments). However, I could implement those changes, so it works best for OM.
Wiki text parsers seem to be a rarity. I haven't tried this out yet, but I found a Go one. I wouldn't be opposed to building upon this if need be or invoking the python one (though I know this isn't preferable).

Considerations

If using the XML format, you don't have the ability to find an article by id like you would with the API. Instead, you have to parse the index dumps file to identify which chunks of the articles dump file you need to decompress and extract from. Therefore, storing the wiki IDs to extract wouldn't work (or not performantly) with the current process - using a key-value store to identify if the currently parsing index line is an article that is wanted or not would work better. I imagine there would be many IDs to be stored, so perhaps using something more robust like Redis might be the answer?

I don't think any language ticks all the boxes here, but I think Go would be a good choice. I'd be happy to implement this in a different language if a better alternative was decided on.

biodranik commented 1 year ago

Thanks for the feedback and experiments!

Let's highlight what is expected:

Processing should not take tens of hours or days.
It should not require extreme hardware resources like 256Gb RAM, 128 cores, etc.
The fewer dependencies from the environment and external tools/databases, the better.
Output HTML simplification should be tunable and configurable.
Existing output Wiki HTML files structure can be changed if necessary to simplify the process or make it faster/easier to support. Previously, it was a result of web scraping, now we can do it as we want.
Existing HTML files can be reused in a new mwm maps generation.
Existing HTML files can be updated/overwritten (if different) to avoid keeping copies of the result dataset.
Existing related C++ code in the (map) generator can be changed/improved/simplified too, if necessary.

We can choose the format that will work better, whether it is XML, SQL, or something else.

OSM has 1.7M Wikipedia and 2.7M wikidata references. Even if it will grow, all these strings (Wikidata can be converted to numbers!) will easily fit into the RAM hash without any need for Redis. Right?

Actually, it would be beneficial to see how the generator uses these wiki articles and wiki data entries now to see the full picture, for better architectural decisions. Making maps generator faster is one of our top priorities.

Rust or GoLang are both ok. Although learning and using Rust is fancier now and will help our team to learn. In theory, we may completely rewrite our generator someday, maybe it will be written in Rust, who knows? 😉 And if we proceed with bigger server-side tasks like providing traffic data, public transport, photos, and reviews database, this experience may also be useful.

newsch commented 1 year ago

My hunch is that reading the index file to build a map of offsets in memory and then skipping over the combined xml dump would be faster than loading it into mysql and running a query, in addition to the disk IO and space savings. Each round of generation would start with a fresh set of data, so there's no benefit to keeping it in mysql across runs.

Redis would just store it in-memory itself.

Actually, it would be beneficial to see how the generator uses these wiki articles and wiki data entries now to see the full picture, for better architectural decisions.

From what I've found, the logic for importing the wikipedia/wikidata html files is in descriptions_section_builder.cpp and updating the map file is in descriptions/serdes, which are called from generator_tool.cpp.

If I'm reading it correctly, for each generated map file (what the app downloads for each region?) it:

parses the map data, storing any wikipedia article titles or wikidata ids
updates the map file, inserting all the description html files it finds with the linked titles/ids

I think the flat file approach is reasonable, it covers many requirements well:

random lookup of description text by id/name
fast insertion/removal/update of description text
last modified time
easy interoperability across tools and languages

I don't think using a database would help much with any of those, but maybe there's something I'm missing.

I don't have any insight into the map file format, but I'd guess that appending a section once is about as optimal as it will get?

charlie2clarke commented 1 year ago

Let's highlight what is expected: Processing should not take tens of hours or days. It should not require extreme hardware resources like 256Gb RAM, 128 cores, etc. The fewer dependencies from the environment and external tools/databases, the better. Output HTML simplification should be tunable and configurable. Existing output Wiki HTML files structure can be changed if necessary to simplify the process or make it faster/easier to support. Previously, it was a result of web scraping, now we can do it as we want. Existing HTML files can be reused in a new mwm maps generation. Existing HTML files can be updated/overwritten (if different) to avoid keeping copies of the result dataset. Existing related C++ code in the (map) generator can be changed/improved/simplified too, if necessary.

Understood, the dumps provide a timestamp when the article was last modified, so making this readily available will be important. The obvious answer is just to include this in the generated file paths, but this decision will depend on how the new HTML files will be structured.

I shall have a look later tonight about its integration into the current generator - I am constrained by the amount of uni work I have to do at the moment though...

OSM has 1.7M Wikipedia and 2.7M wikidata references. Even if it will grow, all these strings (Wikidata can be converted to numbers!) will easily fit into the RAM hash without any need for Redis. Right? Redis would just store it in-memory itself.

Well, I can do a bit of profiling to test this, but I think this is exactly the sort of thing they are made for. They would:

Allow data to be distributed across multiple instances (better scalability). This leads to a follow on Q - what is OM's production environment?
Data can be written to disk to provide some resilience in case the generation fails.

Rust or GoLang are both ok. Although learning and using Rust is fancier now and will help our team to learn. In theory, we may completely rewrite our generator someday, maybe it will be written in Rust, who knows? 😉

I sense a bit of favouritism! I'm not opposed to doing this in Rust. I shall continue with the tests in Go just for speed.

biodranik commented 1 year ago

No overengineering, please. Why several instances are needed? What scalability we're talking about? There is some server that does many things, and one of them would be to quickly process the dump and generate (simplified) HTML files.

Output files can be always overwritten in the simplest case.

charlie2clarke commented 1 year ago

Sure thing. Well I don't know your current setup, you may have had utilities that meant instance replication would be handled automatically, e.g., Knative eventing.

And if we proceed with bigger server-side tasks like providing traffic data, public transport, photos, and reviews database, this experience may also be useful.

Apologies, I thought your comment (above) implied that you might want to be gathering more granular wiki information etc. This is what I meant by scalability.

I'll try to keep it simple.

charlie2clarke commented 1 year ago

Actually, it would be beneficial to see how the generator uses these wiki articles and wiki data entries now to see the full picture for better architectural decisions. Making maps generator faster is one of our top priorities.

I began to look into this. I don't have enough time to have something working from this evening. But, during my process, I was first trying to understand how the current extraction and generation of Wikipedia data from a .osm file works so that I would know what information is made available and get an idea of how this might work in production.

Q - How are all maps currently processed? Is it on a per-country basis or something?

I ended up going on a wild debugging goose chase, thinking that there was an error in the current description generation, but I realised I should have just read the docs...

What I have found out though, is that osm data only makes an article name available and not the id. This shouldn't be a problem, seeing as all Wikipedia articles apparently have a unique name (although is slower to check for equivalence when parsing the dumps data).

But thinking about making this process "parallel" and how that can integrate with the existing .mwm generation, my thoughts are as follows (I apologise in advance for any oversight):

Two main execution paths:

Scheduled pulling of wiki data (as you mentioned, would be desired in the new workflow). I don't know how/if something is being done similarly, but this could be triggered by a cron job or using a timer.
Invocation as a part of the map generation process. This could be triggered as another binary called by the maps_generator CLI or by subverting the Python CLI using a wrapper CLI interface with a .sh file that invokes the new wiki-pulling program.

For both workflows, there's still a dependency on the 'descriptions_section_builder.cpp'. This could be invoked:

As a binary (as current).
Directly using cgo (or Rust's equivalent).
Using mirrored code doing the same function but in the same language as the wiki-pulling code.
Using an event queue.

I am not certain about this, but it seems that .osm files don't store anything other than the Wikipedia article name (see foot of post - I think the timestamp is not related to the Wikipedia tag), so identifying if an article has updated, would either require:

Persisting the last updated time of articles.
Comparing the equivalence of strings - it might be more performant to compare the hash values if they are large.

Sorry for the long post, and thanks for reading.

 <relation id="15318293" version="3" timestamp="2023-01-15T21:30:55Z" changeset="0">
  <member type="way" ref="348325272" role=""/>
  <member type="way" ref="1132283393" role=""/>
  <tag k="destination" v="Porcupine River"/>
  <tag k="name" v="Troochoonjik"/>
  <tag k="old_name" v="Driftwood River"/>
  <tag k="type" v="waterway"/>
  <tag k="waterway" v="river"/>
  <tag k="wikidata" v="Q22594472"/>
  <tag k="wikipedia" v="de:Driftwood River (Porcupine River)"/>
 </relation>

nvarner commented 1 year ago

The OSM wiki does show that Wikipedia pages are linked by their name, but Wikidata, when linked, should identify the corresponding Wikipedia article.

Wikidata can be queried over the Internet, but it's possible it would also need to be processed as a dump at map generation volumes, which could negate the benefit.

charlie2clarke commented 1 year ago

@nvarner It's a good point, I don't know why the relationship between the Wikidata and Wikipedia dumps isn't more structured.

Perhaps I am missing something, but would there be anything wrong with doing the current workflow (but with dumps data) of getting the Wikipedia pages, and then populating a newly linked articles found from Wikidata?

biodranik commented 1 year ago

Valid questions, thanks for raising them!

Regarding Wikidata: Wikipedia dump should also contain WikiData Q-codes for articles. Can you please confirm that they are there? It means that you can check if the currently processed article is in the list from OSM features, and create HTML for it.

Regarding the process:

We always process the whole OSM Planet.
generator_tool should not wait until the wiki dump will be processed (it already takes several days, so we need to make it as fast as possible). Ideally, it should take all output from the recent or last wikidump processor launch and use it.
Wikidump processor should assume that there is a recent OSM Planet file available on the filesystem, in the o5m format. It is likely possible to use a Rust decoder (or libosmium bindings), or some external tool to extract wiki and wikidata values from the OSM planet dump.
Wiki dumps are created twice a month. We can set up a monitor job that will rebuild the wiki articles database for the generator when a new wiki dump is available.

newsch commented 1 year ago

Wikidata can be queried over the Internet, but it's possible it would also need to be processed as a dump at map generation volumes, which could negate the benefit.

I agree, using the API would require fixing all the problems with the current "scraper" and be a worst of both worlds situation.

I don't know why the relationship between the Wikidata and Wikipedia dumps isn't more structured.

Regarding Wikidata: Wikipedia dump should also contain WikiData Q-codes for articles. Can you please confirm that they are there? It means that you can check if the currently processed article is in the list from OSM features, and create HTML for it.

What I've found while writing my proposal is that the wikipedia dumps don't have any direct links to wikidata. It also seems like the actual xml article content of the articles-multistream, articles, and meta-current dumps are all equivalent.

I've sliced out the Anarchism page of all of them below. The Wikidata Q-Code for it Q6199 doesn't appear in any of them.

There is a related wbc_entity_usage enwiki dump that "Tracks which pages use which Wikidata items or properties and what aspect (e.g. item label) is used", but this seems to include more than a 1-1 mapping of article -> Q-Code, and I haven't dug into it because I think using the wikidata dumps is the better route.

Wikidata dumps

Wikidata provides it's own dumps, but the equivalent xml one is a whopping ~130 GB and the "recommended" json one is still ~80GB, larger than the article dumps in all the current supported languages combined.

Thankfully there's a wb_items_per_site dump which is only 1.5GB and contains exactly what's needed: a mapping of Wikidata Q-Code to wikipedia article names in all languages.

The only catch is that it's a SQL dump, so parsing it would be a bit hacky, but very doable and I think very fast. Alternatively, deal with mysql and load it into that.

It looks like this:

wikidatawiki-20230320-wb_items_per_site.sql.gz

``` -- MySQL dump 10.19 Distrib 10.3.34-MariaDB, for debian-linux-gnu (x86_64) -- -- Host: db1167 Database: wikidatawiki -- ------------------------------------------------------ -- Server version 10.4.26-MariaDB-log /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */; /*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */; /*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */; /*!40101 SET NAMES utf8mb4 */; /*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */; /*!40103 SET TIME_ZONE='+00:00' */; /*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */; /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */; /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */; /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */; -- -- Table structure for table `wb_items_per_site` -- DROP TABLE IF EXISTS `wb_items_per_site`; /*!40101 SET @saved_cs_client = @@character_set_client */; /*!40101 SET character_set_client = utf8 */; CREATE TABLE `wb_items_per_site` ( `ips_row_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `ips_item_id` int(10) unsigned NOT NULL, `ips_site_id` varbinary(32) NOT NULL, `ips_site_page` varbinary(310) NOT NULL, PRIMARY KEY (`ips_row_id`), UNIQUE KEY `wb_ips_item_site_page` (`ips_site_id`,`ips_site_page`), KEY `wb_ips_item_id` (`ips_item_id`) ) ENGINE=InnoDB AUTO_INCREMENT=611124882 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED; /*!40101 SET character_set_client = @saved_cs_client */; -- -- Dumping data for table `wb_items_per_site` -- /*!40000 ALTER TABLE `wb_items_per_site` DISABLE KEYS */; INSERT INTO `wb_items_per_site` VALUES (55,3596065,'abwiki','Џьгьарда'),(58,3596033,'abwiki','Аацы'),(76,620894,'abwiki','Агәы-Бедиа'),(228,621901,'abwiki','Ахәри'),(285,621898,'abwiki','Аҽгәара'),(299,621890,'abwiki','Баслахә'),(310,3596028,'abwiki','Блабырхәа'),(346,620880,'abwiki','Галхәыч'),(356,621896,'abwiki','Гәдаа'),(425,3596049,'abwiki','Казан аметрополитен астанциақәа рсиа'),(471,620887,'abwiki','Махәыр'),(565,620878,'abwiki','Тҟәарчал (ақыҭа)'),(623,620884,'abwiki','Чхәарҭал'),(725,620890,'abwiki','Ҵарча'),(727,3596018,'abwiki','Ҷлоу'),(1106,47467,'acewiki','Gunong Perkison'),(2065,3958205,'acewiki','Pulo Bangkaru'),(2067,3958214,'acewiki','Pulo Beunggala'),(2101,3957807,'acewiki','Pulo Tuangku'),(2351,3958235,'acewiki','Syiah Kuala'),(2352,3635987,'acewiki','Symbian'),(2431,3958218,'acewiki','Universitas Syiah Kuala'),(2433,3197917,'acewiki','Ureuëng Kluët'),(3958,3070782,'afwiki','1922-staking'),(4086,1665628,'afwiki','2001-Drienasiesreeks'),(5464,2257842,'afwiki','Abraham H. de Vries'),(5508,4679883,'afwiki','Adam Tas'),(5569,4689578,'afwiki','Afrika Moslem Party'),(5580,2183198,'afwiki','Afrikaanse Protestantse Kerk'),(5583,2668050,'afwiki','Afrikaanse Taalbeweging'),(5600,4690127,'afwiki','Afrikanerbond (vereniging)'),(5605,3643701,'afwiki','Afrikatyd'),(5649,4698807,'afwiki','Airlink Vlug 8911'),(5701,4113789,'afwiki','Aladim Luciano'),(5713,576198,'afwiki','Alba Bouwerprys vir Kinderliteratuur'),(5778,4716715,'afwiki','Alex Boraine'),(5913,4737049,'afwiki','Altus Theart'),(5964,3644576,'afwiki','Amerikaanse langhaarkat'),(6130,2747943,'afwiki','Annelie Botes'),(6183,3644220,'afwiki','Anykščių Vynas'),(6454,444968,'afwiki','Audioscrobbler'),(6462,1881122,'afwiki','August Aimé Balkema'),(6531,3643845,'afwiki','Azanian People\'s Liberation Army'),(6539,2825630,'afwiki','Aërogamie'),(6544,4835254,'afwiki','BCVO'),(6722,3643466,'afwiki','Barnard se rower'),(6888,4882705,'afwiki','Belydenis van Belhar'),(6894,3006842,'afwiki','Ben Schoeman-hoofweg'),(6912,2401180,'afwiki','Benjamin Ashworth Ramsbottom'),(6997,2899360,'afwiki','Bertie Reed'),(7041,3642718,'afwiki','Bhambatha'),(7053,3642907,'afwiki','Bierbrood'),(7263,2907739,'afwiki','Bob Loubser'),(7506,885225,'afwiki','Bra ... ```

So you'd have a separate download and run to generate the article names in supported languages from OSM's wikidata IDs, and then the wikipedia extractor can combine those with the article names directly from OSM.

I looked through the listed tools for working with dumps, but they either aren't in our target languages or want the full wikidata xml/json dump.

OSMAnd uses wikidata JSON dumps, I'm not clear if that's the full one or a smaller one I haven't seen yet.

biodranik commented 1 year ago

Full dump/big size is not an issue for a server.

newsch commented 1 year ago

In that case the wikidata json dump seems like the right choice, depending on processing time.

charlie2clarke commented 1 year ago

There is a related wbc_entity_usage enwiki dump that "Tracks which pages use which Wikidata items or properties and what aspect (e.g. item label) is used", but this seems to include more than a 1-1 mapping of article -> Q-Code, and I haven't dug into it because I think using the wikidata dumps is the better route.

This might be ok, especially if there is the intention to be using a database for storing the last modified time of wiki articles, and reviews etc. in the near future.

I can confirm that osm doesn't store the last modified time of articles (see here for wiki data filtered from osm data).

The JSON data does appear to be the best option at the moment. Although it has a large overhead, it has all available languages centralised (see here for a snippet), and of course the "sitelinks" to the corresponding wiki articles.

Also, from reading these wikidata docs, I just wanted to check if doing this proposed workflow is ok? Is the fact we would be intending to import these into their own HTML files avoiding this licensing problem?

newsch commented 1 year ago

Also, from reading these wikidata docs, I just wanted to check if doing this proposed workflow is ok? Is the fact we would be intending to import these into their own HTML files avoiding this licensing problem?

I'm pretty sure that's talking about editing Wikidata with info from OSM, not displaying them next to each other or using one to access/convert the other.

Whether the current display of the OSM and Wikipedia data in the app meets the terms of their respective licenses is another question that I don't know the answer to.

euf commented 1 year ago

While researching this issue, I stumbled upon Wikimedia Enterprise API. They provide monthly Wikipedia HTML dumps for free, and I guess it counts as frequent enough for OM purposes. This luckily means no wikitext parsing is needed!

https://dumps.wikimedia.org/other/enterprise_html/

biodranik commented 1 year ago

if there is the intention to be using a database for storing the last modified time of wiki articles, and reviews etc. in the near future.

There is no such intention. Storing something important in files may be ok/enough. Last modified time, for example, can be set directly in the filesystem, if it is really necessary.

Also, from reading these wikidata docs, I just wanted to check if doing this proposed workflow is ok? Is the fact we would be intending to import these into their own HTML files avoiding this licensing problem?

I do not understand the problem in our case. Can you please elaborate?

Wiki enterprise costs money. That's not an option now.

charlie2clarke commented 1 year ago

I was just checking that the following isn't applicable: "Copying data from Wikidata into OSM is also generally not allowed - see OpenStreetMap:Wikidata#Importing_data for details."

Is this just talking about importing wikidata onto the shared OSM data, so in this case, downloading an osm file and attaching any additional wiki data isn't a problem, is it?

biodranik commented 1 year ago

It is about adding data from Wikidata database into OSM database.

It is not our case. We're pulling info from both and mixing it in a viewer-app.

charlie2clarke commented 1 year ago

It is about adding data from Wikidata database into OSM database.

It is not our case. We're pulling info from both and mixing it in a viewer-app.

Thanks for clearing that up, and answering my many questions so promptly and clearly.

I think I've got a good understanding of what is expected for this project.

I need get a proposal together so I'm at least in with a chance of getting to work on this!

euf commented 1 year ago

Wiki enterprise costs money.

I'm sorry for not describing my findings clearly enough. You're right, Wiki Enterprise does offer paid plans with frequent data updates. But they also publicly release monthly dumps for free:

At the moment I don't have enough hard drive space to download those and play with them. Anyway I still believe that processing dumped HTML would be much easier than parsing wikitext. (Possibly these dumps could be even fitted as a replacement for a parser in the current workflow!) And being updated once or twice a month is IMO sufficient for offline map wiki.

biodranik commented 1 year ago

If it's a reliable source and they won't stop publishing, and dumps are regular/free, then of course they can also be used.

newsch commented 1 year ago

Here's one of the entries from the category 0 enterprise dump (for this page): myron_reed_enterprise_html.json.txt

Anyway I still believe that processing dumped HTML would be much easier than parsing wikitext.

I agree!

It does mention it's currently an experimental service, and I'm unclear if "partial" and "for a specific set of namespaces" really means all articles, but it does seem to have all the data that's needed - Q-ID, title, and html.

It looks like the html doesn't include any images.

charlie2clarke commented 1 year ago

@biodranik I have just submitted a proposal for this project. Any feedback on this would be very much appreciated.

charlie2clarke commented 1 year ago

The other thing to mention regarding the enterprise HTML dumps is it doesn't use multistream compression like the XML dumps do, so you would have to parse every line. Instead, it stores each article as a JSON (as a part of an NDJSON file). I've played around with extracting its data, but I haven't done any tests to see what difference in performance it would make just yet.

But I can +1 that not having to battle with Wikitext would be great, particularly given the limited tooling available to convert it to HTML. But how hard can it be to make our own? (Famous last words)...

euf commented 1 year ago

But how hard can it be to make our own? (Famous last words)...

After reading readmes for several mediawiki parsing libraries I've got an impression it would be extremely hard. For example, widespread inclusion of macros and templates (e.g. for pronunciation or even for length measures) makes it necessary to process whole XML dumps, including extra namespaces, and perform computation for every entry. The mediawiki syntax itself makes it impossible to be parsed in linear manner, recursive tricks and token counting are used extensively. Overall, there is a reason that one of the most starred projects in this field was named mwparserfromhell...

pastk commented 1 year ago

Wow, I didn't know that wikitext is so complex... But is it possible to implement just a simple subset of it? We need just previews / short summaries of the articles, so we can skip pronounciation macros/templates, tables, images and other optional/complicated stuff.

biodranik commented 1 year ago

@newsch can we close this issue? 🙃

biodranik commented 10 months ago

Thanks to @newsch , wiki parser is implemented: https://github.com/organicmaps/wikiparser/

organicmaps / organicmaps

Generate Wikipedia articles from offline Wikipedia dump #3478

Wikidata dumps