strinking / docflow

A Discord Bot for evaluating code and viewing documentation, built for the Programming Discord Server.
6 stars 3 forks source link

Migrate to DevDocs Scraper #34

Open jchristgit opened 7 years ago

jchristgit commented 7 years ago

DevDocs is an open source service that displays documentation on its website. The project uses a scraper which pulls documentation data for tons of different languages and even frameworks. It's written in Ruby, but since it exports to JSON, the resulting data should be easy to integrate into the Bot, and save us a lot of time with writing the scrapers ourselves.

jchristgit commented 7 years ago

I did some testing with this. The scraped data is stored in simple HTML, for example, this is scraped from the JavaScript documentation:

<h1>Functions</h1> <p>Generally speaking, a function is a "subprogram" that can be <em>called</em> by code external (or internal in the case of recursion) to the function. Like the program itself, a function is composed of a sequence of statements called the <em>function body</em>. Values can be <em>passed</em> to a function, and the function will <em>return</em> a value.</p> <p>In JavaScript, functions are first-class objects, because they can have properties and methods just like any other object. What distinguishes them from other objects is that functions can be called. In brief, they are <code><a href="global_objects/function">Function</a></code> objects.</p> <p>For more examples and explanations, see also the <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Functions">JavaScript guide about functions</a>.</p> <h2 id="Description">Description</h2> <p>Every function in JavaScript is a <code>Function</code> object. See <a href="global_objects/function"><code>Function</code></a> for information on properties and methods of <code>Function</code> objects.</p> <p>To return a value other than the default, a function must have a <code><a href="statements/return">return</a></code> statement that specifies the value to return. A function without a return statement will return a default value. In the case of a <a href="global_objects/object/constructor">constructor</a> called with the <code><a href="operators/new">new</a></code> keyword, the default value is the value of its <code>this</code> parameter. For all other functions, the default return value is <a href="global_objects/undefined"><code>undefined</code></a>.</p> <p>The parameters of a function call are the function's <em>arguments</em>. Arguments are passed to functions <em>by value</em>. If the function changes the value of an argument, this change is not reflected globally or in the calling function. However, object references are values, too, and they are special: if the function changes the referred object's properties, that change is visible outside the function, as shown in the following example:</p> <pre data-language="js">/* Declare the function 'myFunc' */
function myFunc(theObject) {
   theObject.brand = "Toyota";
 }

(from https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Functions, I guess)

Each language directory contains an index.json file which links names together with their articles, here's an excerpt from the JavaScript index.json:

"entries":[  
   {  
      "name":"!",
      "path":"operators/logical_operators#Logical_NOT",
      "type":"Operators"
   },
   {  
      "name":"!=",
      "path":"operators/comparison_operators#Inequality",
      "type":"Operators"
   },
   {  
      "name":"!==",
      "path":"operators/comparison_operators#Nonidentity",
      "type":"Operators"
   },

Looking at the other scraped files, it appears that they all share a common format. Now the question is: how do we properly add the scraper to our repository? A submodule would probably be the easiest way, but we don't need the entire stack of devdocs. What are your thoughts?