Closed Kawsay closed 1 year ago
Wow! @Kawsay You're the first one to even peek at this part of the code. This forces me to document how it works and what future improvements are needed. :-)
You understood most of it correctly. First, understand LearnAwesome's entity model. A book/course/podcast etc is called an Item
which must have one ItemType
and can have multiple Link
s. We also connect related items by having Item
belong to IdeaSet
. For eg: If the author of a book has given a TED Talk or a podcast interview or a blog post summarizing the book, they are all different Item
s, each with a different ItemType
. However, they all belongs to the same IdeaSet
. Item
themselves don't have a relationships with our Topic
s. Instead it's the IdeaSet
that can have a many-to-many relationship with Topic
s.
book
is one of the ItemType
. We store various attributes for items like: cover image URL, ISBN #, Page Count, Author's name.
So, now let's think about what it takes to add a new link to our database:
Link
table.ItemType
and Topic
s. ItemType can be inferred sometimes, but not always. For eg: if the URL is from youtube, we can take this to be a video
. If it's GoodReads, then it's a book
.IdeaSet
and one Item
belonging to this IdeaSet
.ItemType
. So both a book and its summary belong to the same IdeaSet.This is why, when we try to add a link via the browser extension, this is the form that is shown. Second Topic and Difficulty Level fields are optional:
Now, lets think about Books. Different sources of book data (Amazon, GoodReads, OpenLibrary, bloggers like Derek Sivers or Nate Liason who share summaries of all the books they have read etc) provide us different set of fields. So, in app/utilities
, we have a pipeline of processors. We have a plain-Ruby object Book
(which is not ActiveRecord object) and pass it through various processors to enrich it. You can see one example here: https://github.com/learn-awesome/learn/blob/master/app/utilities/book.rb#L70
Book
object through some of the processors in app/utilities
.Book
object with most fields filled in: title, description, image cover URL, topic(s), page_count, ISBN, author etcItem.create_or_update_book(b)
which implements the above written logic for adding a new link.Theoretically, each source of book data (GoodReads, Amazon etc) should have extract
, list
and search
methods:
list
: Parse some kind of listing page and return an array of Book
s with at least one field, for eg: a URLsearch
: take a Book
and prepare it for extract
extract
: take a Book
object and enrich itNone of these import jobs are currently scheduled as periodic tasks. They are only run manually as one-off tasks. ImportGoodreadsListJob
was made just because instead of having to write a scraper for press.stripe.com, I was able to find an existing GoodReads list. The primary use-case for this extraction pipeline is in methods like Book.import_four_minute_book_summaries
, Book.import_sivers_book_summaries
, Book.import_blas
. Topic tagging being manual (see #189 ) is the biggest problem in going for fully automating these and making them periodic. But additionally, not every item in a list (such as Nat Eliason's notes) is worth adding to our repository. So, what I have been doing is: Export a list to JSON, manually add topics, remove entries that we don't want, and only then run it through the extraction pipeline.
But yes, they are supposed to be idempotent and update existing records (instead of creating new records) whenever found. Sometimes, I have to break down the import process into two parts: Somehow convert a listing into a JSON file (such as from blas.com) and then instantiate Book objects from it in order to pass them through the enrichment pipeline. Exporting a JSON file at some point allows me to do assign topics manually - with high quality. As I wrote above, fully automated pipeline is possible but it leads to data pollution - especially when it comes to topics.
See here as example: https://github.com/learn-awesome/learn/blob/master/app/utilities/book.rb#L85
Before going further I'd like to ask couple question about how classes from
/app/utilities/
come into play with the rest of the application.Assumptions (please, tell me if I'm wrong or too imprecise):
:search
,:extract
and:list
which are meant to be called by aBook
object (defined in/app/utilities/book.rb
); if the resource is a bookBook
instance it gets as parameter, in order to instantiate anItem
object. As manyBook
's attributes should be setImportGoodReadListJob
, and therefore needs to be idempotent (& early return if possible)Could I have a brief recap of the flow that leads to classes from
/app/utilities/
?The PR is mostly here to compare my understanding of those features with the expected ones. I'm not expecting it to be merged, as it requires more work to be done