learn-awesome / learn

A social network of lifelong learners built around humanity's universal learning map.
https://learnawesome.org/
Other
344 stars 41 forks source link

WIP add Nat Eliason notes scrapper, partially resolve #188 #238

Closed Kawsay closed 1 year ago

Kawsay commented 3 years ago

Before going further I'd like to ask couple question about how classes from /app/utilities/ come into play with the rest of the application.

Assumptions (please, tell me if I'm wrong or too imprecise):

Could I have a brief recap of the flow that leads to classes from /app/utilities/ ?

The PR is mostly here to compare my understanding of those features with the expected ones. I'm not expecting it to be merged, as it requires more work to be done

nileshtrivedi commented 3 years ago

Wow! @Kawsay You're the first one to even peek at this part of the code. This forces me to document how it works and what future improvements are needed. :-)

You understood most of it correctly. First, understand LearnAwesome's entity model. A book/course/podcast etc is called an Item which must have one ItemType and can have multiple Links. We also connect related items by having Item belong to IdeaSet. For eg: If the author of a book has given a TED Talk or a podcast interview or a blog post summarizing the book, they are all different Items, each with a different ItemType. However, they all belongs to the same IdeaSet. Item themselves don't have a relationships with our Topics. Instead it's the IdeaSet that can have a many-to-many relationship with Topics.

book is one of the ItemType. We store various attributes for items like: cover image URL, ISBN #, Page Count, Author's name.

So, now let's think about what it takes to add a new link to our database:

This is why, when we try to add a link via the browser extension, this is the form that is shown. Second Topic and Difficulty Level fields are optional: image

Now, lets think about Books. Different sources of book data (Amazon, GoodReads, OpenLibrary, bloggers like Derek Sivers or Nate Liason who share summaries of all the books they have read etc) provide us different set of fields. So, in app/utilities, we have a pipeline of processors. We have a plain-Ruby object Book (which is not ActiveRecord object) and pass it through various processors to enrich it. You can see one example here: https://github.com/learn-awesome/learn/blob/master/app/utilities/book.rb#L70

Theoretically, each source of book data (GoodReads, Amazon etc) should have extract, list and search methods:

nileshtrivedi commented 3 years ago

None of these import jobs are currently scheduled as periodic tasks. They are only run manually as one-off tasks. ImportGoodreadsListJob was made just because instead of having to write a scraper for press.stripe.com, I was able to find an existing GoodReads list. The primary use-case for this extraction pipeline is in methods like Book.import_four_minute_book_summaries, Book.import_sivers_book_summaries, Book.import_blas. Topic tagging being manual (see #189 ) is the biggest problem in going for fully automating these and making them periodic. But additionally, not every item in a list (such as Nat Eliason's notes) is worth adding to our repository. So, what I have been doing is: Export a list to JSON, manually add topics, remove entries that we don't want, and only then run it through the extraction pipeline.

But yes, they are supposed to be idempotent and update existing records (instead of creating new records) whenever found. Sometimes, I have to break down the import process into two parts: Somehow convert a listing into a JSON file (such as from blas.com) and then instantiate Book objects from it in order to pass them through the enrichment pipeline. Exporting a JSON file at some point allows me to do assign topics manually - with high quality. As I wrote above, fully automated pipeline is possible but it leads to data pollution - especially when it comes to topics.

See here as example: https://github.com/learn-awesome/learn/blob/master/app/utilities/book.rb#L85