repology / repology-updater

Repology backend service to update repository and package data
https://repology.org
GNU General Public License v3.0
502 stars 177 forks source link

Wikipedia support #434

Closed AMDmi3 closed 5 years ago

AMDmi3 commented 6 years ago

Wikipedia has a lot of articles on f/oss projects with structured actual information, including versions. There should be a way to get a dump of this.

https://en.m.wikipedia.org/w/index.php?title=Category:Latest_stable_software_release_templates

AMDmi3 commented 6 years ago

May also use wikidata

https://m.wikidata.org/wiki/Q306144

Wikidata support moved to #436. However we still need wikipedia support, as unfortunately it's not integrated with wikidata and it has more up to date information on software versions.

jtojnar commented 6 years ago

Hmm, they still do not use wikidata for that. Adding https://en.wikipedia.org/wiki/Template_talk:Infobox_software/FAQ to my ever growing to-do list.

AMDmi3 commented 5 years ago

Some relevant bits discovered:

  1. Go to https://en.wikipedia.org/wiki/Special:Export
  2. Add pages from category Cross-platform_free_software
  3. This produces ~5.5MB XML export with relevant articles, 371 of them
AMDmi3 commented 5 years ago

Titles:

7-Zip ADMB APT (Package Manager) AbiWord Activiti (software) AeroGear Agda (programming language) Antiword Apache Ant Apache Bloodhound Apache Groovy Apache HTTP Server Apache Karaf Apache MXNet Apache Nutch Apache OJB Apache OpenOffice Apache POI Apache Struts 1 Apache Struts 2 Apache Subversion Apache Tomcat Apache Traffic Server Arc (programming language) Ardour (software) Armitage (computing) Art of Illusion AspectJ Astrolog Audacity (audio editor) Audiveris Authbind Autoconf Automake Avidemux Ballerina (programming language) Bash (Unix shell) Bc (programming language) Berkeley Open Infrastructure for Network Computing Bitcoin Core BleachBit Blender (software) BlueBream BlueJ Brackets (text editor) Bugzilla C shell CGI:IRC CLAN program CURL CVSNT Caddy (web server) Cafu Engine Calibre (software) Camunda Cassandre software Category:Berkeley Open Infrastructure for Network Computing Projects Category:Emacs Category:LibreOffice Category:OpenOffice Category:Vi Celestia Chamilo ChatScript Chromium (web browser) Ciao (programming language) Clam AntiVirus Clean (programming language) Clojure Clozure CL Code::Blocks CodeLite Coding Analysis Toolkit Coherence (UPNP) CommaFeed Common Lisp Conch (SSH) Connotea Cppcheck Create Project Crosswalk Project CudaText Cuneiform (programming language) DBEdit DOSBox Darcs Dasher (software) DaviX DeaDBeeF Dia (software) Dillo Distributed Access Control System Double Commander EC (programming language) ELAN software EMMS (media player) Easyrec Eclipse (software) Eclipse Che Eggdrop Ehcache Embroidermodder Emby Endgame: Singularity Enigmail Eric (software) Erlang (programming language) Eureka Streams Eww (web browser) Ex Falso Exaile F Sharp (programming language) FET (timetabling software) FFmpeg FLTK FLUID Feedbin Finch (software) Firefox Foreman (software) Fossil (software) Free Pascal FreeFileSync Freedup GIMP GNAT Programming Studio GNOWSYS GNU C Library GNU Compiler Collection GNU Gatekeeper GNU Octave GNU TeXmacs GNUnet GNUstep GPAC Project on Advanced Content GPlates Ganymede (software) Geany Genie (programming language) Ggplot2 GitLab Glasgow Haskell Compiler Glide (API) Glossword Gmsh Gnuplot Gnus Gnuspeech Go (programming language) GoAgent Golly (program) Gosu (programming language) Gramps Gretl H2O (software) HandBrake Hiawatha (web server) Higan (emulator) Hoodie (software) HuMo-gen Hy I2P IKVM.NET INGENIAS IRIS (transportation software) IUP (software) Idris (programming language) Impressive (presentation program) Infinispan Info-ZIP Inkscape JXL (API) JXTA Jackson (API) Jexus Jitsi JobScheduler Jsish KH Coder Kid3 Kodi (software) Kune (software) Kurso de Esperanto LanguageTool LazPaint Lazarus (IDE) Leiden Open Variation Database LibGDX LibSBML LibVNCServer Libav Libdash LibreOffice Base LibreOffice Calc LibreOffice Writer LibreOffice Libwww Light (web browser) Lighttpd Lightweight Java Game Library Little b (programming language) Lua (programming language) LyX Lynx (web browser) MEncoder MP3Gain MPlayer Mahogany (email client) Maqetta Marabunta (software) Marketcetera Mathomatic Maxima (software) MediaInfo MediaWiki Mercurial Mercury (programming language) Metasploit Project MilkyTracker Mixxx Mod openpgp MonetDB Mozilla Prism Mozilla Thunderbird Mpv (media player) MuseScore Nana (C++ library) Neko (software) NetBeans NetSurf Nginx Nitro (software) Nmap Normaliz OCaml ORBX.js OWASP ZAP Off-the-Record Messaging Opa (programming language) Open Broadcaster Software Open Cobalt OpenBoard OpenDroneMap OpenGrok OpenLDAP OpenLP OpenOffice.org OpenProject OpenSSH OpenSearchServer OpenStudio OpenWebGlobe Orange (software) Oxwall PARI/GP PDF.js PGF/TikZ PHP-Crawler PJIRC Padre (software) Pan (programming language) Panda3D Pcap PeaZip Persistent uniform resource locator PhpLiteAdmin PhpMyAdmin PhpWiki Pidgin (software) Pinta (software) Pisg (software) PlayCanvas Polipo Posadis Previous (software) Prey (software) Processing (programming language) Programming with Big Data in R PuTTY PukiWiki Puppet (software) Pure (programming language) PyQt Pylons (web framework) Pylons Framework Pylons project Pyramid (web framework) Python (programming language) QDevelop QEMU Qt Creator QtWeb Quantitative Discourse Analysis Package Query Abstraction Layer QuickFIX Quod Libet (software) R (programming language) REPLAY (software) RQDA Racket (programming language) Radare2 Reason (programming language) Redmine Renjin RetroShare Ring (programming language) SNAMP SOFA Statistics SQLite Sahana FOSS Disaster Management System Scala (programming language) SchoolTool Scuttle (software) Self (programming language) Sigil (application) SmallBASIC Smalltalk Snappy (compression) Spring Web Flow Squeak Squid (software) Squirrel (programming language) Stevie (text editor) Subtitle Edit SwellRT Syncthing Syndie Synfig TANGO TYPO3 Tahoe-LAFS Taskfreak Taskwarrior Tcl Tcpdump Tcsh TeXworks TetGen Thttpd TiddlyWiki TigerVNC Tiki Wiki CMS Groupware Trac Trojitá UWSGI Ultracopier Urbiscript VIPS (software) VLC media player VTD-XML Vala (programming language) Vim (text editor) VirtualBox Virtuoso Universal Server VisualEditor W3af WASTE WSO2 Mashup Server Waarp Wget Wireshark WordGrinder Workrave XOWA Xdebug Xen Xinu ZMap (software) ZeroTier Zope 2 Zope

AMDmi3 commented 5 years ago

Release versions examples

| latest release version = <!-- If you update [[Template:phpMyAdmin version]], it will automatically update this page and [[Comparison of database tools]]--> {{phpMyAdmin version}}
| latest release version = 0.2.0-beta
| latest release version = 3.3
| latest release version = 13.0
| latest_release_version = 2.6<ref>{{cite web|url=https://jsish.org/fossil/jsi/taglist|title=Tags|accessdate=14 November 2018}}</ref>
| latest release version = 4.11 (2019-03 R)<ref>{{cite web|title=Simultaneous Release - Eclipsepedia|url=https://wiki.eclipse.org/Simultaneous_Release|website=Wiki.eclipse.org|accessdate=2018-03-23}}</ref>
jtojnar commented 5 years ago

Might be better to do this in Repology. They already parse Wikidata.

Or even better work on the Wikipedia template to use Wikidata (with proper sourcing support) See FAQ on https://en.wikipedia.org/wiki/Template_talk:Infobox_software

Edit: I confused this with another issue tracker :-P

AMDmi3 commented 5 years ago

Yep, I've figured as much that Wikipedia doesn't use Wikidata, so I'm thinking of ways to get data from Wikipedia directly. For now I don't see a reliable way to, as I don't see any way to get structured data from Wikipedia.

AMDmi3 commented 5 years ago

It seems like Wikidata is being used for Wikipedia more and more, so it doesn't make much sense to parse the latter (which is hardly technically possible given what mess it is).

AMDmi3 commented 5 years ago

For the record, here're some examples of Wikipedia entries using data from Wikidata. Unfortunately, Wikidata is not common for English Wikipedia yet, but it's already widely used in some language versions, namely French and Russian.

https://ru.wikipedia.org/wiki/Banshee https://ru.wikipedia.org/wiki/Tux_Racer https://fr.wikipedia.org/wiki/NGINX https://ru.wikipedia.org/wiki/Libvirt https://fr.wikipedia.org/wiki/Libvirt https://fr.wikipedia.org/wiki/Claws_Mail https://pl.wikipedia.org/wiki/Claws_Mail https://ru.wikipedia.org/wiki/Claws_Mail