learn-awesome / learndb

Curated learning resources with topics, formats, difficulty levels, expert reviews and metadata tags
123 stars 22 forks source link

Refactor/jsonlines data #51

Closed widged closed 2 years ago

widged commented 2 years ago

Replaced data being served through datasette by data being simply loaded from a jsonline format. Reading data could be migrated to a backend-service but this is not required if all data are read only

nileshtrivedi commented 2 years ago

Good refactor. A few observations:

nileshtrivedi commented 2 years ago

In items.js, the value of »_ can currently be like this:

"summary|https://neilkakkar.com/sapiens.html;video|https://www.youtube.com/watch?v=nzj7Wg4DAbs;video|https://www.youtube.com/watch?v=UTchioiHM0U;audio|https://www.theguardian.com/science/audio/2014/sep/15/sapiens-brief-history-humankind-yuval-noah-harari-podcast;book|https://www.amazon.com/Sapiens-Humankind-Yuval-Noah-Harari/dp/0099590085?tag=learnawesome-20;video|https://www.youtube.com/watch?v=B2H9Q92RkfE"

But being JSON, you can keep the original structure intact and don't have to use separators like | and ;:

So the new value would be:

[
  ["summary", "https://neilkakkar.com/sapiens.html"],
  ["video", "https://www.youtube.com/watch?v=nzj7Wg4DAbs"],
  ["video", "https://www.youtube.com/watch?v=UTchioiHM0U"],
  ["audio","https://www.theguardian.com/science/audio/2014/sep/15/sapiens-brief-history-humankind-yuval-noah-harari-podcast"],
  ["book","https://www.amazon.com/Sapiens-Humankind-Yuval-Noah-Harari/dp/0099590085?tag=learnawesome-20"],
  ["video","https://www.youtube.com/watch?v=B2H9Q92RkfE"]
]

That should be in a separate PR though because frontend code assumes the existing format in many places.

More importantly, and BEFORE we do this 👆, I think we should just make a decision on keeping only one of the data formats in the repo, SQLite or JSONLines. If we keep both formats, we would have to deal with keeping both datasets in sync with each other.

About the jsonlines data approach, I like the following:

What is nice with SQLite/Datasette:

nileshtrivedi commented 2 years ago

Noticed a few issues:

widged commented 2 years ago

On the schema, this is done directly in io/jsonlines.js. THere is an "asItem", "asTopic", "asReview" that spells it out. I did it to minimize space and see how much I could gain from it. My recommendation is to rather find 3 letter codes that are more intuitive.

widged commented 2 years ago

Datasette... there is absolutely no reason to have it as a dependency. To require both javascript and python in the toolchain brings far more cons than benefits. That it makes csv easier to import is not a benefit. The constraint of csv will stiffle future improvements.

On "Join queries, foreign keys, full text-search, views etc.". The move I made gives you the option to start supporting fuzzy search. I have used lunr in the past -- https://github.com/olivernn/lunr.js. It will let you rapidly define (text) searchable content and built all required indexes for it.

widged commented 2 years ago
  • Some pages like http://localhost:8080/#/topic/lg%C2%BBcss don't load at all.

Thanks for mentioning it. I will need to use decodeURI(). I pushed a fix for this to this PR

widged commented 2 years ago

But being JSON, you can keep the original structure intact and don't have to use separators like | and ;:

Indeed. Best is to make it into two steps. In this first step, I aimed at simply replacing the services. Make sure I got all data in the exact equivalent of what was returned by the old services. Making changes that also require changes in the components that consume the data should be made as a separate step.

widged commented 2 years ago

Note that the primary motivation for all these changes is to make it much much easier to clean up the data. You have a lot of inconsistencies in the way topics are coded. With far too often, the same topic appearing in singular and plural form, but with different topic ids (both newborn and newborns), with occasional typos (both scandanavia and scandinavia) or alternative spellings (both south-korea and southkorea). Some harder to detect (both bicycle and cycling). Doing that cleanup requires to make updates in parallel to data in items and data in topics. The simpler the data storage, the easier to set it up.

nileshtrivedi commented 2 years ago

On "Join queries, foreign keys, full text-search, views etc.". The move I made gives you the option to start supporting fuzzy search. I have used lunr in the past -- https://github.com/olivernn/lunr.js. It will let you rapidly define (text) searchable content and built all required indexes for it.

Alright, then. I agree that eliminating Python would drastically simplify development and deployment (both on a server and user's local machine) both. Also, agree that CSV was too constrained already.

Feel free to merge the PR when conflicts are resolved and issues are fixed. And thanks for your amazing contribution! 🍻 😄

widged commented 2 years ago

The "explore the map" section will need to be fixed

d3.json("/learn/topics.json?_shape=array&_size=5000", function(alltopics){
widged commented 2 years ago

I ended up completely rewriting the explore map. I turned it into a svelte component for uniformity. And upgraded the d3 version from d3v3 to d3v7 (latest). I relied on the d3.stratify() function to reduce code size. It is only listing parent group names. Clicking sends to the page for that topic. There have been extensive rewrite of tags (in sync between topics and items).

You are invited to check. I will merge tomorrow night if I don't hear of anything in need to be fixed.

nileshtrivedi commented 2 years ago

I had a look:

nileshtrivedi commented 2 years ago

I am able to run the new TreeMap explorer after cleaning node_modules.

I'd say the new explorer is clearly inferior to the zoomable treemap that was earlier present. The objective was to navigate the entire hierarchy of topics visually while being able to easily switch to parent or sibling nodes.

The current map shows "condition" at the top view, which is actually a child of the "Medicine" topic. Similarly, "physics" is a child of "science" but both are shown at the same level. Also, there are gaps in the way rectangles are laid out.

image
widged commented 2 years ago

i invite you to look into the tagging I have introduced. Then make decisions with that information taken into account.

Conditions do not always have medicine as parent. I have used condition loosely as any state of being, to also include concepts like happiness or mental disorders.

The implication is that there will soon be a need to tag topics with multiple parents. Which a tree map representation won’t support.

which was my main reason to not spend too much time on it. The initial code was unlegible (poorly organised), pointing to on outdated version of d3 (with code breaking when pointed to the new version; the old version didn’t seem to support es module import) and there was no information ion intent.

I really do not believe it is the best presentation option for the intent while supporting changes made as well as future needs. I would invite a rethink of the representation.

nileshtrivedi commented 2 years ago

Pinged you on Discord.