WIP add Nat Eliason notes scrapper, partially resolve #188

Before going further I'd like to ask couple question about how classes from /app/utilities/ come into play with the rest of the application.

Assumptions (please, tell me if I'm wrong or too imprecise):

each class defines 3 interfaces: :search, :extract and :list which are meant to be called by a Book object (defined in /app/utilities/book.rb); if the resource is a book
the goal of each class is to enrich the Book instance it gets as parameter, in order to instantiate an Item object. As many Book's attributes should be set
each class is called as asynchronous job, as ImportGoodReadListJob, and therefore needs to be idempotent (& early return if possible)

Could I have a brief recap of the flow that leads to classes from /app/utilities/ ?

The PR is mostly here to compare my understanding of those features with the expected ones. I'm not expecting it to be merged, as it requires more work to be done

Wow! @Kawsay You're the first one to even peek at this part of the code. This forces me to document how it works and what future improvements are needed. :-)

You understood most of it correctly. First, understand LearnAwesome's entity model. A book/course/podcast etc is called an Item which must have one ItemType and can have multiple Links. We also connect related items by having Item belong to IdeaSet. For eg: If the author of a book has given a TED Talk or a podcast interview or a blog post summarizing the book, they are all different Items, each with a different ItemType. However, they all belongs to the same IdeaSet. Item themselves don't have a relationships with our Topics. Instead it's the IdeaSet that can have a many-to-many relationship with Topics.

book is one of the ItemType. We store various attributes for items like: cover image URL, ISBN #, Page Count, Author's name.

So, now let's think about what it takes to add a new link to our database:

A URL is given
Check if it exists in the Link table.
If it already exists, we're done. But perhaps we can do a refresh of the Item by scraping fields from the URL.
It if it is not found, there are at least two fields we need before we can add it: ItemType and Topics. ItemType can be inferred sometimes, but not always. For eg: if the URL is from youtube, we can take this to be a video. If it's GoodReads, then it's a book.
Inferring topic is much harder. I ran a GoodReads import job once where "tags" were imported as topics, but that led to some junk topics in our database. Now-a-days, I prefer to assign topics manually.
Once we have ItemType and Topics, we create one IdeaSet and one Item belonging to this IdeaSet.
'Summary' itself is an ItemType. So both a book and its summary belong to the same IdeaSet.

This is why, when we try to add a link via the browser extension, this is the form that is shown. Second Topic and Difficulty Level fields are optional:

Now, lets think about Books. Different sources of book data (Amazon, GoodReads, OpenLibrary, bloggers like Derek Sivers or Nate Liason who share summaries of all the books they have read etc) provide us different set of fields. So, in app/utilities, we have a pipeline of processors. We have a plain-Ruby object Book (which is not ActiveRecord object) and pass it through various processors to enrich it. You can see one example here: https://github.com/learn-awesome/learn/blob/master/app/utilities/book.rb#L70

We start by fetching a list from some source (such as book reviews at Derek Sivers blog). We expect to obtain at least one identifer of the book (such as Amazon link or ISBN) that helps us extract more fields from another book cataloge.
We then enrich this in-memory Book object through some of the processors in app/utilities.
In the end, we get a Book object with most fields filled in: title, description, image cover URL, topic(s), page_count, ISBN, author etc
Finally, we invoke Item.create_or_update_book(b) which implements the above written logic for adding a new link.

Theoretically, each source of book data (GoodReads, Amazon etc) should have extract, list and search methods:

list: Parse some kind of listing page and return an array of Books with at least one field, for eg: a URL
search: take a Book and prepare it for extract
extract : take a Book object and enrich it

None of these import jobs are currently scheduled as periodic tasks. They are only run manually as one-off tasks. ImportGoodreadsListJob was made just because instead of having to write a scraper for press.stripe.com, I was able to find an existing GoodReads list. The primary use-case for this extraction pipeline is in methods like Book.import_four_minute_book_summaries, Book.import_sivers_book_summaries, Book.import_blas. Topic tagging being manual (see #189 ) is the biggest problem in going for fully automating these and making them periodic. But additionally, not every item in a list (such as Nat Eliason's notes) is worth adding to our repository. So, what I have been doing is: Export a list to JSON, manually add topics, remove entries that we don't want, and only then run it through the extraction pipeline.

But yes, they are supposed to be idempotent and update existing records (instead of creating new records) whenever found. Sometimes, I have to break down the import process into two parts: Somehow convert a listing into a JSON file (such as from blas.com) and then instantiate Book objects from it in order to pass them through the enrichment pipeline. Exporting a JSON file at some point allows me to do assign topics manually - with high quality. As I wrote above, fully automated pipeline is possible but it leads to data pollution - especially when it comes to topics.

See here as example: https://github.com/learn-awesome/learn/blob/master/app/utilities/book.rb#L85

learn-awesome / learn

WIP add Nat Eliason notes scrapper, partially resolve #188 #238