open-source-ideas / ideas

πŸ’‘ Looking for inspiration for your next open source project? Or perhaps you've got a brilliant idea you can't wait to share with others? Open Source Ideas is a community built specifically for this! πŸ‘‹
6.56k stars 221 forks source link

Google Books alternative - Full text browsing and search #11

Closed ocdtrekkie closed 7 years ago

ocdtrekkie commented 7 years ago

Project description

I have a lot of ebooks, mostly in PDF format, but some EPUBs as well, and I feel the information in them is often better than on the Internet. But I've got no way to find that information easily. I'd like a self-hosted (ideally web-based) platform that I can just upload all my books, and then search and browse them from wherever I am.

I asked for something like this on HN, and other than general full text search apps to run on my desktop, there really wasn't anything to do this.

Relevant Technology

Needs to support hosting on a Linux box. Should ideally be usable from Windows, Linux, Android, Mac, etc. through a web interface.

Who is this for

Probably at least somewhat savvy users since we're talking about self-hosting here, but hopefully one could admin a server for their family to use or something.

mikaelbr commented 7 years ago

Some technology one could build on for indexing is ElasticSearch and the mapper-attachments.

ghost commented 7 years ago

I'm doing this. Will update here :)

mikaelbr commented 7 years ago

@mysticmode Cool! Really looking forward to this. There's a lot of potential here to make a open source, ebook indexing, library. One could do auto-fetching of book covers, overview of all books, fetching reviews, having highlights/bookmarks, etc.

I think there is a lot of work that could be done here and there might be room for a collaboration for several people at some point.

This'll be awesome πŸ’ͺ

mikaelbr commented 7 years ago

This could also be easily wrapped in something like electron if that adds any value.

FredrikAugust commented 7 years ago

To make it a bit easier you could just make a website and use kiosk-view/app-view in chrome. They probably have something similar in Firefox etc. That way you don't have to deal with electronic as well

jvanbruegge commented 7 years ago

There is already a server mode for the ebook management software calibre, but it's rather ugly and feature-poor. Developing something with a good server client architecture would be good. Server should do the scraping, searching and managing of ebooks (maybe throw in a few converter plugins for different formats), the client would display it neatly, possibly via chrome or electron. Would be interested, especially serverside.

CyrisXD commented 7 years ago

What about building an electron wrapped app, that authenticates and syncs with dropbox. We could use the API to search the contents of PDF files on your dropbox and bring the results back to the Electron app. I don't think the contents search supports EPUBs though.

ocdtrekkie commented 7 years ago

A dependency on Dropbox, a proprietary cloud service would mostly defeat the point.

la0rg commented 7 years ago

Great idea! I'm heavy user of Google Books, but would really like some open source solution. How do you plan to search in PDF files? Some kind of OCR will be required. It's also worth mention that I have non-English books as well.

ocdtrekkie commented 7 years ago

@la0rg Most PDFs are already OCR'd. Or, in many cases, were always digital to begin with. Page images are rare, PDFs are a multimedia format.

jvanbruegge commented 7 years ago

Yes, pdf as file format should be supported, but for OCR you are better served with tesseract, which is Open Source and does an amazing job.

ghost commented 7 years ago

Elastic search is good. I'm trying to take baby steps here and implement something which is well-thought and discussed rather than just another ebook reader which is open source and web client based.

First step. Domain registration http://www.libreread.org/ :) I'm not sure, if we can discuss the whole process here. So, I'll create a slack team and let you know.

I feel the information in them is often better than on the Internet. But I've got no way to find that information easily

@ocdtrekkie Could you elaborate in examples of how do you need the full text search should be?

Also if you take the existing ebook readers like Kindle, iBooks, etc., other than those are proprietary, what features that you need are missing in it or other ebook readers?

Answer from anyone on the features is appreciated! :)

Thanks!

ocdtrekkie commented 7 years ago

@mysticmode If I'm looking for a section on a programming structure, for example, in my programming books, I'd expect searching it to show me which books mention it, and let me open that book to the first mention.

Another feature that would be super important would be the ability to import a set of my ebooks data in some format I could easily create. In my case, I have a homegrown (and relatively shoddy) database app which indexes my books, and I'd want to be able to import the metadata I have into a format I could feed into this system, rather than having to sit there and reenter all of that information.

I'm not super concerned about the actual reading elements of this, much more than embedding the PDF reader in my web browser. But that's just my personal use case, I suppose.

ghost commented 7 years ago

Features:

I'm looking into Elastic Search for PDF formats

jvanbruegge commented 7 years ago

Basicly this would be Plex, but for Books

mikaelbr commented 7 years ago

While it's certainly cool to have a lot of features and there is a lot of potential here. But I think it's smart to start with the "core feature" and scope out from that when it's solved. I'd say start small, solve the critical, core feature (which is creating a full text register for pdf ebooks) and expand on that in time. Along the way you'll learn a lot about the problem and probably see new ways to improve it.

Just creating infrastructure for setting up a distributable project with ElasticSearch & attachments is a job in it of it self. Maybe a good way to start would be just test it out with some ElasticSearch plugin for searching your indexes. And then creating a small web server and a UI as an extension of that when you have the index fundaments on place. When you have that foundation it's much easier to work in parallel for people who want to join in also, I think. Just my two cents. I think the chance of success increases drastically if the scope is small from the beginning. πŸ˜„

sunsided commented 7 years ago

I'm with @SuperManitu here about calibre. I've always been thinking of something that is either using calibre's own database (which is probably a bad idea) or something that just imports and/or exports to it as a starting point. I also thought it could be something peer-to-peer, e.g. utilizing torrent technology to broadcast metadata or content. That might really come in handy in a academic context with open data in mind.

ghost commented 7 years ago

I agree with @la0rg on OCR support for PDF formats. I think full-text search for most ebooks that are in PDF formats would work well on most cases with the extracted data. But OCR support should be in the pipeline though.

As @mikaelbr suggested, I'll try to start with the basic implementation of PDF extraction and search and share it here. Then if people find it good, we can move on from there.

ghost commented 7 years ago

I'm looking into pdf.js and elastic-search. Through pdf.js we could get the rendered HTML5.

I'm writing this, if someone knows these technologies for them to tell me if I'm on the right way :)

I'll do the above approach and try to share it in couple days.

jvanbruegge commented 7 years ago

I don't think pdfjs is needed you will almost never read the pdf in browser. For thise cases I would just open the pdf in a new tab and let the browser does its job. Pdfjs is rather slow and not a pleasurable reading experiance.

ghost commented 7 years ago

No. When the user uploads a book, I'd be using pdf.js and converting to HTML through a headless browser like phantom-js in the server.

ghost commented 7 years ago

I tried python pdfminer but the extracted HTML is messy. PDF.js gives the clean code

jvanbruegge commented 7 years ago

It might be easier to do the server in Java, as Elasticsearch provides a Java API, you can use a REST microframework like Jersey and you have libraries like PDFBox for reading pdfs. Plus Node's performance on bigger files is really bad.

jvanbruegge commented 7 years ago

I would opt for ElasticSearch + Jersey (or similar) REST Server + clientside SPA (preferably typescript) + optionally (later on) nodejs for Server-rendering the SPA.

As this is a rather complicated setup provide a zero-config docker container to run it.

ghost commented 7 years ago

Working with filesystems is costly. I need to think about this.

Maybe I could use java as a semi-standalone process to extract pdfs.

But I'm planning to use nodejs as the base for the application. As this is open source, I think using javascript for server is better when it comes to collaboration and it works pretty well for SPA.

I'll try the pdf extractors and see which one suits best. As for text, python pdfminer works well.

jvanbruegge commented 7 years ago

I dont think javascript is better for collaboration, a type system can help you a lot when using code of others. I'll create a working prototype in the next days

FredrikAugust commented 7 years ago

What language would you use then?

On Wed, Oct 19, 2016, 14:24 SuperManitu notifications@github.com wrote:

I dont think javascript is better for collaboration, a type system can help you a lot when using code of others. I'll create a working prototype in the next days

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mikaelbr/open-source-ideas/issues/11#issuecomment-254796823, or mute the thread https://github.com/notifications/unsubscribe-auth/AHVdG_2tWB-zzFRRfI9rxOVFe4dzwMozks5q1gv8gaJpZM4KY3le .

jvanbruegge commented 7 years ago

As said a small REST Server written in Java as Java is easy to adopt and learn, has a good type system and has very good tooling (maven and eclipse/intellij) For the SPA I would use typescript as it enhances Javascript with an unobstrusive type system and is very popular, so you get type definitions for almost all javascript libraries

FredrikAugust commented 7 years ago

What about C++? Also has good tools and isn't much harder to learn than Java IMO

On Wed, Oct 19, 2016, 14:35 SuperManitu notifications@github.com wrote:

As said a small REST Server written in Java as Java is easy to adopt and learn, has a good type system and has very good tooling (maven and eclipse/intellij) For the SPA I would use typescript as it enhances Javascript with an unobstrusive type system and is very popular, so you get type definitions for almost all javascript libraries

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mikaelbr/open-source-ideas/issues/11#issuecomment-254799173, or mute the thread https://github.com/notifications/unsubscribe-auth/AHVdG8Kk5uFdg3ejFAy_l6Pxm9NsoeGnks5q1g6IgaJpZM4KY3le .

FredrikAugust commented 7 years ago

We could also write it in Python, which is by far easier to learn for newer programmers, and has great support for platforms, with plenty of libs we could use.

On Wed, Oct 19, 2016, 14:39 Fredrik August Madsen-Malmo < mail.fredrikaugust@gmail.com> wrote:

What about C++? Also has good tools and isn't much harder to learn than Java IMO

On Wed, Oct 19, 2016, 14:35 SuperManitu notifications@github.com wrote:

As said a small REST Server written in Java as Java is easy to adopt and learn, has a good type system and has very good tooling (maven and eclipse/intellij) For the SPA I would use typescript as it enhances Javascript with an unobstrusive type system and is very popular, so you get type definitions for almost all javascript libraries

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mikaelbr/open-source-ideas/issues/11#issuecomment-254799173, or mute the thread https://github.com/notifications/unsubscribe-auth/AHVdG8Kk5uFdg3ejFAy_l6Pxm9NsoeGnks5q1g6IgaJpZM4KY3le .

jvanbruegge commented 7 years ago

I personally like C++ far more than Java, but it has too many disadvantages in this particular case: Pro:

Cons:

Python: Con:

FredrikAugust commented 7 years ago

Yeah, those are good points. Dependency management isn't really something contributors need to worry about a lot, considering most of that will be done during the initial phase. Consts and memory management are indeed a bit harder for newer devs.

Python however I don't see why we couldn't use. The lack of types isn't really that big of a problem IMO, but it is handy.

With Python we don't have to worry too much about setups and the like, as we have pip for deps and could use something like pep8 for formatting tools, plus most (if not all) IDEs have support for Python.

FutureProg commented 7 years ago

I'm interested in this, and I'm for using Python. The low barrier to entry for newer devs is probably a huge plus for using the language for the backend. Also, @SuperManitu, I believe @mysticmode has already created a repo for this project.

jvanbruegge commented 7 years ago

I wouldnt use python, because having a type system is really helpful, python has the tendency to be rather slow compared to Java and i had problems with some libs using native code. Plus using intendation as blocks is ugly

FutureProg commented 7 years ago

@supermanitu, if we were to use Java, what backend framework would you suggest?

On Wed, Oct 19, 2016 at 10:01 AM SuperManitu notifications@github.com wrote:

I wouldnt use python, because having a type system is really helpful, python has the tendency to be rather slow compared to Java and i had problems with some libs using native code. Plus using intendation as blocks is ugly

β€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mikaelbr/open-source-ideas/issues/11#issuecomment-254821495, or mute the thread https://github.com/notifications/unsubscribe-auth/AFW7PiGTiU9If4eMXe6tKtMKJ8vCnNpTks5q1iKogaJpZM4KY3le .

jvanbruegge commented 7 years ago

As said I would use the Jersey RESTful Framework: https://jersey.java.net/documentation/latest/getting-started.html#new-project-structure

Here is a rather good explanation: http://www.vogella.com/tutorials/REST/article.html

FredrikAugust commented 7 years ago

I don't like python either because of the indentation "thing". If you're up for using something like python that doesn't use that system we could use ruby.

On Wed, Oct 19, 2016, 16:15 SuperManitu notifications@github.com wrote:

As said I would use the Jersey RESTful Framework: https://jersey.java.net/documentation/latest/getting-started.html#new-project-structure

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mikaelbr/open-source-ideas/issues/11#issuecomment-254825593, or mute the thread https://github.com/notifications/unsubscribe-auth/AHVdG4_JFBgQ8uEsn02zbseHQHU2Qmdqks5q1iXwgaJpZM4KY3le .

FutureProg commented 7 years ago

I'm personally not for using Ruby. PHP and Java are good for me.

On Wed, Oct 19, 2016, 12:37 PM Fredrik A. Madsen-Malmo < notifications@github.com> wrote:

I don't like python either because of the indentation "thing". If you're up for using something like python that doesn't use that system we could use ruby.

On Wed, Oct 19, 2016, 16:15 SuperManitu notifications@github.com wrote:

As said I would use the Jersey RESTful Framework:

https://jersey.java.net/documentation/latest/getting-started.html#new-project-structure

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/mikaelbr/open-source-ideas/issues/11#issuecomment-254825593 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AHVdG4_JFBgQ8uEsn02zbseHQHU2Qmdqks5q1iXwgaJpZM4KY3le

.

β€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mikaelbr/open-source-ideas/issues/11#issuecomment-254869211, or mute the thread https://github.com/notifications/unsubscribe-auth/AFW7PmCM57_whOesg3fwdPx-2prE1evYks5q1kdhgaJpZM4KY3le .

ghost commented 7 years ago

When we are handling files, in our case pdf, epub. We could try using Xpdf I haven't tested it yet, but I read in few places that it gives better output than PDFBox. It's written in C++ and it's licensed under GPL. If I start this project, I would like this to be in GPL. And there are multiple pdf extractors available in C++ which we can test.

As far as language goes for the core part of our application, I certainly won't choose java, it's confusing philosophy of licensing for it's ecosystem and the court claims + object-oriented + Certainly hard to pickup for a new programmer makes me go for some other languages that has a fairly good motive towards open philosophy and good in performance compared to java and considering readability.

I would choose Elixir. if we need multi-threading for this application, it does right away and goes well with performance compared to java. The web framework Phoenix is ruby inspired. The coding approach is readable and far easier to pick-up quickly than java.

ghost commented 7 years ago

Gotta say Elixir is far better than Ruby in performance plus You will experience the taste of Ruby style with the speed of Erlang.

jvanbruegge commented 7 years ago

Never used Elixir, but from what I've seen it looks good. Should be a good choice for the server.a

ngoctranfire commented 7 years ago

I think using node.js with walmarts electrode as a framework would be great. The project can be super modularized. I believe react would work well for this project and is such a popular framework people can really help with its development and pick up easy. It's also easy to onboard someone

FredrikAugust commented 7 years ago

Do we really need electron though? Couldn't it just be a website?

On Wed, Oct 19, 2016, 23:49 Ngoc Buu Tran notifications@github.com wrote:

I think using node.js with walmarts electrode as a framework would be great. The project can be super modularized. I believe react would work well for this project and is such a popular framework people can really help with its development and pick up easy. It's also easy to onboard someone

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mikaelbr/open-source-ideas/issues/11#issuecomment-254951147, or mute the thread https://github.com/notifications/unsubscribe-auth/AHVdG218x9lMlWMXhFwJ6EYaesBlyfqHks5q1pBfgaJpZM4KY3le .

ocdtrekkie commented 7 years ago

I absolutely would like to see a standards-compliant website I can access from anywhere. (As a personal note, I'd like to be able to host it as a Sandstorm.io app, which is possible as long as it's A. web-based and B. runs on 64-bit Linux.)

jvanbruegge commented 7 years ago

For the SPA I would make a simple website. Using electron can be done if needed later (but i doubt that). As Frontend frameworks there is either React + Redux or Cycle.js (which I'm in favour of). Used Angular2 and wonβ€˜t use it again.

ghost commented 7 years ago

@SuperManitu Could you explain why do we need a front-end framework in our use-case? I don't think our front-end is complex, so far with what @ocdtrekkie pointed out earlier, the browser based app should be standards-compatible and upon features it should be minimal at first. Most of the process is happening in the backend.

Let's build it with bare-bone javascript or maybe typescript, and on the go.. we'll figure out if we need a framework that might help us solve the problems that we would be facing at that time. I think building the minimal version first -> then get it right -> then get it better is the way to go.

To initiate this project, I'm doing the initial setup now and share it here once ready.

ghost commented 7 years ago

I think it's better if we take this discussion to chat. I've created a slack team http://libreread.slack.com

Let's discuss the tools and intricacies of the app in the chat. And If we want to discuss about the features, we can post here.

Please share your email if you would like to join the development, so I could add you on slack team.

Thanks!

jvanbruegge commented 7 years ago

Yes of course, for the beginning we dont need a framework. Just standard Typescript. Awaiting the setup :)

Email is supermanitu@gmail.com

FutureProg commented 7 years ago

@mysticmode email is nickmorrrison09@gmail.com

ghost commented 7 years ago

@FutureProg hey, let me know if you get the invite :) It's bouncing here and I couldn't send the invite again. Weird