How to add an Importer for pdf files?

gerritv commented 7 years ago

I would like to add to your implementation an Importer for pdf files. It would get meta data from the PDF file itself. I envisage using multiple source directories (I don't want to move the files from where they are located.) and recurse into them as deep as necessary.

Do I need to only implement something like LuceneImporter or do I need to change code elsewhere to allow choosing an Importer? E.g. in ImportTask.cs the importer is hardcoded to Lucene.

shemanaev commented 7 years ago

Server itself targeted to use index files called inpx and don't provide a way to scan filesystem iteself. I'm not tested in other than fb2-ready inpx files scenarios so there might be (and will, i'm sure 😄) bugs. But basically you need to:

produce .inpx file in some way for every root directory (i.e. if you have c:\lib1 and c:\lib2 you'll need two files)
if you want to have info that not fit into inpx format (cover, annotation) you'll have to implement IBookParser and register it to BookParsersPool
import every .inpx with related root (i.e. dotopds import c:\lib1 lib1.inpx)

The .inpx format description i found only in russian, so here is translation

gerritv commented 7 years ago

Thank you, that helps me a lot. I have been reading the code and understand more than when I opened the Issue :-) I can generate the .inpx from my PDF parser, will test that out and then decide what to do next. I am impressed with the design, it looks very expandable.

gerritv commented 7 years ago

I have the pdf scanner added (Utils/PdfParser.cs), I chose to recursively scan the directory and process each pdf rather than creating an intermediate file. I didn't add another parser to Parsers, the generic one there is sufficient as the Class in Utils does all the work, using InpxParser.cs as a template.

Pondering how to add it to the commands. Would it be better to create another Class in Tasks called PdfScanTask and then a 'pdfscan' command to run it? Much or most of the code in PdfScanCommand.cs would be the same as ImportCommand.cs. I had thought of generalizing ImportTask to make it take an option indicating what to import but that got more complex.

gerritv commented 7 years ago

Ok, upon further pondering over an espresso I modified Import Task and ImportCommand:

Added required option ImportType=inpx or pdf,
added code in ImportTask to run one of those 2 tasks. Long term it might be best to add a base class for Parser in Parsers and move inpx/pdf parsers to that directory? Now on to testing & debugging

gerritv commented 7 years ago

You can see my code changes so far in https://github.com/gerritv/DotOPDS. Scanning of pdf's is working, but can't get query working via Aldiko. I tried forcing all books/pdf's to have Genre other,other but wtill no joy. so, my next question is: where can I learn about using Owin and System.Web.Http to create some different web pages for serving pages?

shemanaev commented 7 years ago

Hey Gerrit, genre should be it's id, not human readable string. You should pick one from list.Add("sf_history"); like instruction in Genres.cs. And your Book model will look like this:

var args = new Book
{
    Authors = new[] { author },
    Genres = new[] { "other" },
    Title = info.Title,
    File = Path.GetFileNameWithoutExtension(fi.FullName),
    Size = (int)fi.Length,
    Ext = "pdf",
    Date = info.CreationDate,
    Language = "en",
    Keywords = info.Keywords.Split(','),
    Archive = "",
};

I've also pushed some fixes to master, you should pull it. And there is one problem i can't figure it out yet: LuceneImporter always uses RussianAnalyzer for now, as there is neither language autodetection, nor good way to populate it on import.

gerritv commented 7 years ago

Thank you for those fixes/changes. I now have things sort of working using FBReader. Aldiko and OPDSViewer don't like whatever is being returned. I also need to work on File pathname as my files can be in sub directory off Library Path. Your solution above strips out the intermediate directories. My initial method was also wrong as it resulted in Library Path existing twice in the download link.

I will close this Issue as I am now well past the original question. I would though appreciate a link or book or something where I can learn about WebApi2/Owin/Nowin in English (or Dutch)

shemanaev commented 7 years ago

I learned WebApi 2 from official docs. Nowin/OWIN is pretty straightforward through Nowin samples and OWIN spec.

Your solution above strips out the intermediate directories.

Yeah, I don't remember all the .net apis but you get the point 😉

gerritv commented 7 years ago

Thx, The Message LifeCycle diagram is a huge help.

Yes, I got it :-) My setup is a bit unusual. Now trying to figure out how to make some Pull requests without feeding you my pdf solution. (It relies on DebenuPDFLite, which is a bit of a pain to install but is free). Looking at

git cherry-pick

shemanaev / DotOPDS

How to add an Importer for pdf files? #7