tutors-sdk / tutors

The core Tutors Reader application.
https://tutors.dev
MIT License
107 stars 38 forks source link

Text Search PDF content #685

Open edeleastar opened 5 months ago

edeleastar commented 5 months ago

Devise a strategy to extend Tutors Search to PDF content.

Possible approach: consider a modification to the generator project:

When generating a project, produce a json representation of the text within the PDF. Uses this companion content as the basis for the search algorithm

lgriffin commented 2 weeks ago

@edeleastar I played around with this and made a small tool on https://github.com/lgriffin/pdf-to-json that converts PDF to JSON so a few decisions need to be made here before I integrate something into Tutors:

This isn't a big uplift to implement, once you give me clarity on the above I can quickly get this into a PR

edeleastar commented 2 weeks ago

hi @lgriffin

fascinating work. I would be inclined to consider an experiment whereby the json for PDFs is included in tutors.json (just like lab text is). My sense is that they would not be as verbose as the labs - and with compression (when we enable) it should have negligible impact on the performance.

Thus, the roadmap could be:

(1) Create a new npm package "tutors-publish-s" package, a superset of "tutors-publish", which generates a tutors.json (as currently) but also containing the contents of each pdf/json embedded in the talk object. This could be typed like this:

  type: "talk";
  pdf: string; // route to pdf for the lo
  pdfFile: string; // pdf file name
  contentJson: any // textual contents of pdf
};

See

(2) Update the search module in the Tutors Reader to also search this object (currently it only searches notes and labs)

See

This would have to identify the slide number the search text occurs on

(3) Update the Talk component to include a slide number on its url - this is currently not implemented yet, by should be relatively straightforward. It is also a useful feature in its own right, being able to bookmark / send a link to an individual slide. It would be updating the url as the user navigated the slides.

Additionally, it might be a good time to roll in the encryption/compression work you did earlier into the "tutors-publish-s" npm package. If all goes well, we could replace "tutors-publish" with this new version once it is stable ( and we hav verified backward compatibility)

The addition of the above would mean we have full textual model of a complete course contents.

Great stuff

lgriffin commented 5 days ago

That makes sense and if I can offer a suggestion for this (and other future changes), make a dedicated project Kanban board and populate it as I see several key steps in this.

The issue type doesn't really give the depth that you expressing in your response especially with a fundamental change to the flow. You can then disconnect the actions and quickly get the new module published -- I don't mind if you bring it in as a dependency or shift the whole logic in, it's only a small experiment app.

I would see the search criteria being a 2 step process. A general search where text matching to the JSON (trivial) to tell you this appears in a certain PDF -- check if that is useful for students as the more expressive search to actual reference where in the PDF it is -- the latter would require a bit of thought and we could be over engineering a solution that nobody wants in a sense.

You then have a possible task to check the performance and then the compression integration and compare. There's a number of compression approaches out there I picked a more popular one but the size of the JSON files was already small enough that compression would not be noticeable. With a dozen PDFs that size will grow for sure and give a better testbed for picking the right compression tool.

All in this looks great and if you can break out the steps I can pick up some of the parts.