Closed biodranik closed 10 months ago
Can I help? ^_^
Some ideas about the workflow:
pbzip2
(parallel bzip2) may be used to decompress/process dump faster.Linking #2410
Since current workflow is in Python, a few libraries to consider:
Python is not the fastest tool to quickly and efficiently process large bz2-ed dumps. There are faster tools/languages.
@biodranik How about we consider golang? It has inbuilt support for concurrency which can be utilized for parallelizing the task of extracting summary text. Additionally Golang has a relatively low memory footprint, making it well-suited for handling large datasets on machines with limited resources. Apart from golang, C++ can also be a good choice.
@biodranik can i go ahead and start working on it in golang ?
Rust, golang, or C++ are good tools for this task. However, I don't see how concurrency may help here. Can you please elaborate on how would you approach it?
@biodranik this project is listed in GSOC 2023, I can help with RUST, can I get more details about this.
Did you check the existing description and the repository? It already has all the code.
Rust, golang, or C++ are good tools for this task. However, I don't see how concurrency may help here. Can you please elaborate on how would you approach it?
Sure @biodranik , I think concurrency could be utilized while setting up a mechanism to periodically update the extracted summary information to reflect changes in the Wiki articles. This is one usecase I can think of.
Hi, I'm interested in working on this as part of GSoC, I have experience with Python, C, and Rust, and some familiarity with Golang and C++.
From looking through the code, my understanding is that the current descriptions are processed into stripped and minified html, which is packaged up into the map files and rendered in the app.
- A separate script/tool to process the dump and extract articles only in required/supported languages.
It looks like the wikipedia dumps are all in wikitext format, so this would involve stripping the markup/converting it to a simplified html summary? And this would be run as part of the maps_generator
tool?
Ideally, wiki's simplified HTML should be generated/updated in parallel to the generator. The idea is to reuse old wiki data if necessary for any reason.
The current dependency is that generator tool extracts wiki article ids into a separate file, and only these ids should be extracted by wiki parser. This task can be easily done independently from the generator tool by using osmconvert or other similar command line tools. And, obviously, html files are loaded into map data mwm files.
Any ideas about improving the toolchain are welcomed!
A bonus task is to leave wiki images links in html, but somehow limit their automatic loading if user doesn't want to use internet.
Hi, @biodranik I am writing to express my interest in this project. Currently, I work as a Java backend engineer, however, during my university years, I gained extensive experience in writing Python crawlers. I have successfully dealt with tricky situations such as simulating browsers, avoiding IP bans through IP proxies, and identifying captchas through machine learning algorithms. In addition, I am a student graduated from Geographical Information Science.
I found this project interesting. And I have started reading the source code of descriptions_downloader.py
. I plan to write a detailed proposal within this week. My proposal will include an understanding of the entire process of crawling wikipedia data, as well as solutions to the problems mentioned in this issue.
I would like to ask if we have a channel like gitter for communicating GSoC ideas?
Apologize for contacting at such a late stage. I was busy preparing for a project that was suddenly closed tonight. I still wish that I could have the opportunity to apply my skills to exciting ideas.
Issue and comments here clearly state that we want to avoid any scraping and extract information directly from offline wiki dumps, in the fastest possible way.
Thank you for pointing this out. My understanding of the problem is not sufficient. I will take the necessary steps to thoroughly investigate the matter, including researching and evaluating tools for downloading and parsing the relevant data dump.
I'm sorry I still have some questions. I understand that the core goal of this project should be to obtain some information from Wikipedia through certain methods, whether it is through direct API queries or downloading and analysing dump files, both are different ways to solve the problem. The problem now is that there are some issues with the API query method that make data acquisition difficult and slow, so you want to switch to the second method.
I would be appreciated if you could provide more background knowledge. I am curious if the latter is really faster than the former? I understand that this is only equivalent to bringing what Wikipedia servers do to local. Have these two methods been clearly compared? Is downloading the full amount of data and parsing it really a faster way?
Ideally, wiki's simplified HTML should be generated/updated in parallel to the generator.
Is the goal is to replicate as much of the original page as possible in html? It is unclear to me where #2410 ended up but it seems like some people want as much content as possible.
If so, it seems there are three options for handling that:
Looking through the list of alternative parsers for the mediawiki format, there are many options but it is unclear how much of the format they actually support, whether they are currently maintained, what the output looks like, and how they handle failure cases.
Wikipedia dumps -> directories of wikitext articles -> directories of html articles
.For reference, OsmAnd uses the java library bliki
to convert dumps to html, with some preprocessing.
The idea is to reuse old wiki data if necessary for any reason.
The current dependency is that generator tool extracts wiki article ids into a separate file, and only these ids should be extracted by wiki parser. This task can be easily done independently from the generator tool by using osmconvert or other similar command line tools.
From what I can tell, the generator framework currently supports falling back to old data, but can't run any tasks in parallel with each other (although some tasks themselves are parallelized). Is that correct?
Is the goal to replicate as much of the original page as possible in HTML?
We can easily experiment with HTML simplification. It should be very easy to do with the offline dump parser, right? Concerns in the PR were about big mwm map sizes. By experimenting we can see the difference.
From what I can tell, the generator framework currently supports falling back to old data, but can't run any tasks in parallel with each other (although some tasks themselves are parallelized). Is that correct?
By "in parallel" I meant running the wiki parser independently. If it recreates/refreshes already existing HTML pages, then it can be run even before running the generator tool. Or in parallel, as another process.
Choosing an appropriate Wiki => HTML convertor can be tricky, you're right. Ideally, it would be great to do it without other frameworks/languages dependency, but that may not be feasible (although https://docs.rs/parse_wiki_text/latest/parse_wiki_text/ can be investigated).
We can easily experiment with HTML simplification. It should be very easy to do with the offline dump parser, right? Concerns in the PR were about big mwm map sizes. By experimenting we can see the difference.
Yes, I think it would be straightforward to run some speed, size, and quality tests, and the existing scraped pages could serve as a baseline.
By "in parallel" I meant running the wiki parser independently. If it recreates/refreshes already existing HTML pages, then it can be run even before running the generator tool. Or in parallel, as another process.
Ah, I see. It looks like the wikipedia dumps are created every 2 weeks to 1 month, how often are the maps regenerated?
Choosing an appropriate Wiki => HTML convertor can be tricky, you're right. Ideally, it would be great to do it without other frameworks/languages dependency, but that may not be feasible (although https://docs.rs/parse_wiki_text/latest/parse_wiki_text/ can be investigated).
I came across that project, unfortunately the github account/repos have been deleted, but there are some existing forks. It was unclear to me how complete it really is, and using it would require designing an html converter. More reasons to do some experimenting!
Now we update maps together with the app update, approx. monthly, with goals to do it at least weekly, and without requiring to update the app.
Updating and processing wiki dumps once or twice per month is perfectly fine.
I have just been doing some experimentation with processing the wiki dumps data in Go (see here). Here are some of my findings:
Benefits
Drawbacks
Considerations
I don't think any language ticks all the boxes here, but I think Go would be a good choice. I'd be happy to implement this in a different language if a better alternative was decided on.
Thanks for the feedback and experiments!
Let's highlight what is expected:
We can choose the format that will work better, whether it is XML, SQL, or something else.
OSM has 1.7M Wikipedia and 2.7M wikidata references. Even if it will grow, all these strings (Wikidata can be converted to numbers!) will easily fit into the RAM hash without any need for Redis. Right?
Actually, it would be beneficial to see how the generator uses these wiki articles and wiki data entries now to see the full picture, for better architectural decisions. Making maps generator faster is one of our top priorities.
Rust or GoLang are both ok. Although learning and using Rust is fancier now and will help our team to learn. In theory, we may completely rewrite our generator someday, maybe it will be written in Rust, who knows? 😉 And if we proceed with bigger server-side tasks like providing traffic data, public transport, photos, and reviews database, this experience may also be useful.
My hunch is that reading the index file to build a map of offsets in memory and then skipping over the combined xml dump would be faster than loading it into mysql and running a query, in addition to the disk IO and space savings. Each round of generation would start with a fresh set of data, so there's no benefit to keeping it in mysql across runs.
Redis would just store it in-memory itself.
Actually, it would be beneficial to see how the generator uses these wiki articles and wiki data entries now to see the full picture, for better architectural decisions.
From what I've found, the logic for importing the wikipedia/wikidata html files is in descriptions_section_builder.cpp
and updating the map file is in descriptions/serdes
, which are called from generator_tool.cpp
.
If I'm reading it correctly, for each generated map file (what the app downloads for each region?) it:
I think the flat file approach is reasonable, it covers many requirements well:
I don't think using a database would help much with any of those, but maybe there's something I'm missing.
I don't have any insight into the map file format, but I'd guess that appending a section once is about as optimal as it will get?
Let's highlight what is expected: Processing should not take tens of hours or days. It should not require extreme hardware resources like 256Gb RAM, 128 cores, etc. The fewer dependencies from the environment and external tools/databases, the better. Output HTML simplification should be tunable and configurable. Existing output Wiki HTML files structure can be changed if necessary to simplify the process or make it faster/easier to support. Previously, it was a result of web scraping, now we can do it as we want. Existing HTML files can be reused in a new mwm maps generation. Existing HTML files can be updated/overwritten (if different) to avoid keeping copies of the result dataset. Existing related C++ code in the (map) generator can be changed/improved/simplified too, if necessary.
Understood, the dumps provide a timestamp when the article was last modified, so making this readily available will be important. The obvious answer is just to include this in the generated file paths, but this decision will depend on how the new HTML files will be structured.
I shall have a look later tonight about its integration into the current generator - I am constrained by the amount of uni work I have to do at the moment though...
OSM has 1.7M Wikipedia and 2.7M wikidata references. Even if it will grow, all these strings (Wikidata can be converted to numbers!) will easily fit into the RAM hash without any need for Redis. Right? Redis would just store it in-memory itself.
Well, I can do a bit of profiling to test this, but I think this is exactly the sort of thing they are made for. They would:
Rust or GoLang are both ok. Although learning and using Rust is fancier now and will help our team to learn. In theory, we may completely rewrite our generator someday, maybe it will be written in Rust, who knows? 😉
I sense a bit of favouritism! I'm not opposed to doing this in Rust. I shall continue with the tests in Go just for speed.
No overengineering, please. Why several instances are needed? What scalability we're talking about? There is some server that does many things, and one of them would be to quickly process the dump and generate (simplified) HTML files.
Output files can be always overwritten in the simplest case.
Sure thing. Well I don't know your current setup, you may have had utilities that meant instance replication would be handled automatically, e.g., Knative eventing.
And if we proceed with bigger server-side tasks like providing traffic data, public transport, photos, and reviews database, this experience may also be useful.
Apologies, I thought your comment (above) implied that you might want to be gathering more granular wiki information etc. This is what I meant by scalability.
I'll try to keep it simple.
Actually, it would be beneficial to see how the generator uses these wiki articles and wiki data entries now to see the full picture for better architectural decisions. Making maps generator faster is one of our top priorities.
I began to look into this. I don't have enough time to have something working from this evening. But, during my process, I was first trying to understand how the current extraction and generation of Wikipedia data from a .osm file works so that I would know what information is made available and get an idea of how this might work in production.
Q - How are all maps currently processed? Is it on a per-country basis or something?
I ended up going on a wild debugging goose chase, thinking that there was an error in the current description generation, but I realised I should have just read the docs...
What I have found out though, is that osm data only makes an article name available and not the id. This shouldn't be a problem, seeing as all Wikipedia articles apparently have a unique name (although is slower to check for equivalence when parsing the dumps data).
But thinking about making this process "parallel" and how that can integrate with the existing .mwm generation, my thoughts are as follows (I apologise in advance for any oversight):
Two main execution paths:
For both workflows, there's still a dependency on the 'descriptions_section_builder.cpp'. This could be invoked:
I am not certain about this, but it seems that .osm files don't store anything other than the Wikipedia article name (see foot of post - I think the timestamp is not related to the Wikipedia tag), so identifying if an article has updated, would either require:
Sorry for the long post, and thanks for reading.
<relation id="15318293" version="3" timestamp="2023-01-15T21:30:55Z" changeset="0">
<member type="way" ref="348325272" role=""/>
<member type="way" ref="1132283393" role=""/>
<tag k="destination" v="Porcupine River"/>
<tag k="name" v="Troochoonjik"/>
<tag k="old_name" v="Driftwood River"/>
<tag k="type" v="waterway"/>
<tag k="waterway" v="river"/>
<tag k="wikidata" v="Q22594472"/>
<tag k="wikipedia" v="de:Driftwood River (Porcupine River)"/>
</relation>
The OSM wiki does show that Wikipedia pages are linked by their name, but Wikidata, when linked, should identify the corresponding Wikipedia article.
Wikidata can be queried over the Internet, but it's possible it would also need to be processed as a dump at map generation volumes, which could negate the benefit.
@nvarner It's a good point, I don't know why the relationship between the Wikidata and Wikipedia dumps isn't more structured.
Perhaps I am missing something, but would there be anything wrong with doing the current workflow (but with dumps data) of getting the Wikipedia pages, and then populating a newly linked articles found from Wikidata?
Valid questions, thanks for raising them!
Regarding Wikidata: Wikipedia dump should also contain WikiData Q-codes for articles. Can you please confirm that they are there? It means that you can check if the currently processed article is in the list from OSM features, and create HTML for it.
Regarding the process:
Wikidata can be queried over the Internet, but it's possible it would also need to be processed as a dump at map generation volumes, which could negate the benefit.
I agree, using the API would require fixing all the problems with the current "scraper" and be a worst of both worlds situation.
I don't know why the relationship between the Wikidata and Wikipedia dumps isn't more structured.
Regarding Wikidata: Wikipedia dump should also contain WikiData Q-codes for articles. Can you please confirm that they are there? It means that you can check if the currently processed article is in the list from OSM features, and create HTML for it.
What I've found while writing my proposal is that the wikipedia dumps don't have any direct links to wikidata.
It also seems like the actual xml article content of the articles-multistream
, articles
, and meta-current
dumps are all equivalent.
I've sliced out the Anarchism page of all of them below.
The Wikidata Q-Code for it Q6199
doesn't appear in any of them.
There is a related wbc_entity_usage
enwiki dump that "Tracks which pages use which Wikidata items or properties and what aspect (e.g. item label) is used", but this seems to include more than a 1-1 mapping of article -> Q-Code, and I haven't dug into it because I think using the wikidata dumps is the better route.
Wikidata provides it's own dumps, but the equivalent xml one is a whopping ~130 GB and the "recommended" json one is still ~80GB, larger than the article dumps in all the current supported languages combined.
Thankfully there's a wb_items_per_site
dump which is only 1.5GB and contains exactly what's needed: a mapping of Wikidata Q-Code to wikipedia article names in all languages.
The only catch is that it's a SQL dump, so parsing it would be a bit hacky, but very doable and I think very fast. Alternatively, deal with mysql and load it into that.
It looks like this:
So you'd have a separate download and run to generate the article names in supported languages from OSM's wikidata IDs, and then the wikipedia extractor can combine those with the article names directly from OSM.
I looked through the listed tools for working with dumps, but they either aren't in our target languages or want the full wikidata xml/json dump.
OSMAnd uses wikidata JSON dumps, I'm not clear if that's the full one or a smaller one I haven't seen yet.
Full dump/big size is not an issue for a server.
In that case the wikidata json dump seems like the right choice, depending on processing time.
There is a related wbc_entity_usage enwiki dump that "Tracks which pages use which Wikidata items or properties and what aspect (e.g. item label) is used", but this seems to include more than a 1-1 mapping of article -> Q-Code, and I haven't dug into it because I think using the wikidata dumps is the better route.
This might be ok, especially if there is the intention to be using a database for storing the last modified time of wiki articles, and reviews etc. in the near future.
I can confirm that osm doesn't store the last modified time of articles (see here for wiki data filtered from osm data).
The JSON data does appear to be the best option at the moment. Although it has a large overhead, it has all available languages centralised (see here for a snippet), and of course the "sitelinks" to the corresponding wiki articles.
Also, from reading these wikidata docs, I just wanted to check if doing this proposed workflow is ok? Is the fact we would be intending to import these into their own HTML files avoiding this licensing problem?
Also, from reading these wikidata docs, I just wanted to check if doing this proposed workflow is ok? Is the fact we would be intending to import these into their own HTML files avoiding this licensing problem?
I'm pretty sure that's talking about editing Wikidata with info from OSM, not displaying them next to each other or using one to access/convert the other.
Whether the current display of the OSM and Wikipedia data in the app meets the terms of their respective licenses is another question that I don't know the answer to.
While researching this issue, I stumbled upon Wikimedia Enterprise API. They provide monthly Wikipedia HTML dumps for free, and I guess it counts as frequent enough for OM purposes. This luckily means no wikitext parsing is needed!
if there is the intention to be using a database for storing the last modified time of wiki articles, and reviews etc. in the near future.
There is no such intention. Storing something important in files may be ok/enough. Last modified time, for example, can be set directly in the filesystem, if it is really necessary.
Also, from reading these wikidata docs, I just wanted to check if doing this proposed workflow is ok? Is the fact we would be intending to import these into their own HTML files avoiding this licensing problem?
I do not understand the problem in our case. Can you please elaborate?
Wiki enterprise costs money. That's not an option now.
I was just checking that the following isn't applicable: "Copying data from Wikidata into OSM is also generally not allowed - see OpenStreetMap:Wikidata#Importing_data for details."
Is this just talking about importing wikidata onto the shared OSM data, so in this case, downloading an osm file and attaching any additional wiki data isn't a problem, is it?
It is about adding data from Wikidata database into OSM database.
It is not our case. We're pulling info from both and mixing it in a viewer-app.
It is about adding data from Wikidata database into OSM database.
It is not our case. We're pulling info from both and mixing it in a viewer-app.
Thanks for clearing that up, and answering my many questions so promptly and clearly.
I think I've got a good understanding of what is expected for this project.
I need get a proposal together so I'm at least in with a chance of getting to work on this!
Wiki enterprise costs money.
I'm sorry for not describing my findings clearly enough. You're right, Wiki Enterprise does offer paid plans with frequent data updates. But they also publicly release monthly dumps for free:
At the moment I don't have enough hard drive space to download those and play with them. Anyway I still believe that processing dumped HTML would be much easier than parsing wikitext. (Possibly these dumps could be even fitted as a replacement for a parser in the current workflow!) And being updated once or twice a month is IMO sufficient for offline map wiki.
If it's a reliable source and they won't stop publishing, and dumps are regular/free, then of course they can also be used.
Here's one of the entries from the category 0 enterprise dump (for this page): myron_reed_enterprise_html.json.txt
Anyway I still believe that processing dumped HTML would be much easier than parsing wikitext.
I agree!
It does mention it's currently an experimental service, and I'm unclear if "partial" and "for a specific set of namespaces" really means all articles, but it does seem to have all the data that's needed - Q-ID, title, and html.
It looks like the html doesn't include any images.
@biodranik I have just submitted a proposal for this project. Any feedback on this would be very much appreciated.
The other thing to mention regarding the enterprise HTML dumps is it doesn't use multistream compression like the XML dumps do, so you would have to parse every line. Instead, it stores each article as a JSON (as a part of an NDJSON file). I've played around with extracting its data, but I haven't done any tests to see what difference in performance it would make just yet.
But I can +1 that not having to battle with Wikitext would be great, particularly given the limited tooling available to convert it to HTML. But how hard can it be to make our own? (Famous last words)...
But how hard can it be to make our own? (Famous last words)...
After reading readmes for several mediawiki parsing libraries I've got an impression it would be extremely hard. For example, widespread inclusion of macros and templates (e.g. for pronunciation or even for length measures) makes it necessary to process whole XML dumps, including extra namespaces, and perform computation for every entry. The mediawiki syntax itself makes it impossible to be parsed in linear manner, recursive tricks and token counting are used extensively. Overall, there is a reason that one of the most starred projects in this field was named mwparserfromhell...
Wow, I didn't know that wikitext is so complex... But is it possible to implement just a simple subset of it? We need just previews / short summaries of the articles, so we can skip pronounciation macros/templates, tables, images and other optional/complicated stuff.
@newsch can we close this issue? 🙃
Thanks to @newsch , wiki parser is implemented: https://github.com/organicmaps/wikiparser/
There are several issues now with our current crawler implementation:
A way better option is to download a dump of all Wiki articles for given languages and extract summaries from there directly.
Any volunteers for this task?
A list of supported languages and the output format can be checked in the existing implementation in
tools/python/descriptions
.