Open jflesch opened 10 years ago
If this feature does get included into a release, please make it optional. Following Paperwork’s “scan-and-forget” logic, my papers are not sorted by date-on-the-paper, but by date-of-scan. My papers are a FIFO: 1— I scan 2— I put the paper on the FIFO 3— I forget 4— Any paper at the head of the FIFO, older than 10 years, gets discarded. It seems to me, that this is exactly how Paperwork is intended to work, and the date-on-the-paper has no role here.
@tYYGH I'm not agree with you. I just start using Paperwork few weeks ago and I have a lot of papers to scan. In this context, the date-of-scan has no sens : all my papers have the same dummy date. So retrieving the date-on-the-paper follows the "scan-and-forget" logic more than adding notes manually for example.
@akarzim I agree with you. You cannot compare first-time setup and regular usage, though. With this feature being optional, it could be enabled for the first-time setup, and then disabled for regular usage. But then, you may disagree on the "regular usage" part. Well, to each their own ;-)
Maybe paperwork should actually keep two dates:
One could even separate the scanned and imported at date. I am scanning documents as they come and import them in bulks into paperwork. The imported date is even less interesting to me than the scanned date (e.g. creation date of the pdf).
Also the document date shouldn't be set automatically. Paperwork should suggest dates based on file contents.
If there's a better way to talk about features please let me know.
Hm, below the calendar we could display a list of suggested dates based on the content of the document. I guess it could help first time users quite a lot.
One of the main problems will be to find the dates in the document. Each culture has its own way of writing the dates ... :/ The regex you provided will catch all the dates written numerically in occidental cultures I think, but not the ones with words ("Septembre the 5th 2012" or "5 septembre 2012").
It's not an easy problem, but it would be a fun one to try to solve :-)
Yes, you're right. Didn't think about different conventions. Even the order of month and day isn't fixed.
But parsing a date, without knowing the format already, from a human readable string sounds like something others have already taken on. edit: Found this: https://github.com/kvh/recurrent
So there are two problems:
Looking at my local paperwork db: I could try to extract dates from a paper.words file on the command line.
I suggest we add a big-data AI to Paperwork, and we crowd-feed it all the dates from all our “words”-files, so that it learns… :-p
I simple comment to say that this kind of feature would be very handy for me: I frequently have to integrate a PDF that is not a fresh scanned document. And sometimes a lot of them at the same time.
Relying on import time is just irrelevant in my case. I can't use it for a search for instance. So if during import I could activate an option to use the creation/modification time, it would be handy ! :)
I don't understand why you want to scan the date on the document. It would be useful for sure (even if it's more complex) but at least using document date is handy (I mean, the pdf metadata, not the content parsed) as you probably scanned/received it almost at the same date as scanned/created.
Ok, so we have 2 ideas here: 1) At import time, ask the users if they want to use the current time or the file metadata for the date. Note that you will always find files with crappy metadata like 01-01-1970, so using metadatas won't be the default value. 2) On already scanned/imported documents, when editing document properties, suggest the date found in the document (metadata/content)
Re: 1) Maybe show the dialog on first import and provide a "remember this setting flag" which can be changed in the settings. At least this dialog shouldn't block batch imports of pdfs from folders.
Re: 2) After looking at some recognized text it turns out that a lot of dates are not parsed correctly. e.g.:
29.o9.2o1e
There might be an option to pass parameters to the OCR engine to improve the recognition of numbers or suggest similar looking characters. Also a preceding "Date: "
(in different languages) could be handy.
This already sounds a lot like the suggested big-data AI ;)
edit: I was able to get better OCR results (in general and regarding numbers) using those two commands:
convert -density 300 doc.pdf -colorspace gray doc.png
tesseract doc.png doc.words -l deu hocr
At least this dialog shouldn't block batch imports of pdfs from folders.
:+1:
I think the 1) is a lot easier to implement, while I suppose 2) will require more work/testing.
PDF have metadatas, including "creation date" and "last modified". These dates could be used for the document date.
Note: be careful, some PDFs have crappy dates