Open ptrikutam opened 7 years ago
PT: OK. Yes, there may be some stuff (i.e. ingestion) that we need to work through to make sure things are indexed properly. That said, we could have people manually fill in fields and have the document indexed that way and it'd be much simpler. Let's review this and other features tomorrow in our call.
DW: For a demo, it's fine. In real life, it's a whole kettle of fish, especially if these works are going to be indexed and will show up in search results. Issues include (to be obvious) copyright violations, privacy violations, incorrect metadata added by the person uploading it, spam, and offensive material (porn but also being flooded with climate denial resources?)
@ptrikutam
I was thinking on implement the functionality for this main use case:
Thoughts?
So we actually may not want to auto-extract the metadata. I think asking the user to manually input some of the metadata might be preferable anyway.
@gtourtellot do you know an example resource with a good amount of metadata included? That can be a template for @nsanta to build the form off of.
@ptrikutam Sorry. I wasn't explicit with what I wanted to explain and suggest:
Ah, now I understand. That makes sense, and would be awesome if it's not too much effort.
I don't think we should expect to fill in much metadata automatically. There's just too much variability in PDFs to get it right.
The python pdf reader library I was using tries to extract: Author, CreationDate, Creator, ModDate, Producer, and Title. It works OK for some books we've been provided, i.e., it gets author and title right. However, the library fails on PDFs for other (non-book) PDFs. Extracting the number of pages from a PDF is something we expect to do reliably.
Metadata fields we could present on a form might include: resource_type, title, short_content, title, date (i.e., pub date), tags (e.g., from applying ClimateTagger to full contents), rights (some string about rights), pages (number of pages), isbn, and content_url (there's an external associated url).
see as reference the schema: https://github.com/greencommons/commons/blob/master/lib/etl/schema/resource_array_schema.json)
@gtourtellot, fair enough. Thanks for sharing the schema.
@nsanta For the time being, let's not worry about pre-populating metadata, and just worry about extracting the PDF contents for the long_content
field. I think it makes sense to parse metadata and include it in a future PR.
Those fields are good for books, but we should have some additional ones for articles: journal, pages, volume, number, doi…
David W.
On Tue, Sep 12, 2017 at 4:09 PM, Pavan Trikutam notifications@github.com wrote:
@gtourtellot https://github.com/gtourtellot, fair enough. Thanks for sharing the schema.
@nsanta https://github.com/nsanta For the time being, let's not worry about pre-populating metadata, and just worry about extracting the PDF contents for the long_content field. I think it makes sense to parse metadata and include it in a future PR.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/greencommons/commons/issues/212#issuecomment-328985102, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjGyKvZkMO8UJNKTncaFOYF3ZhpQqYIks5shvLugaJpZM4PFXaO .
Good point, @dweinberger. We should definitely do a thorough review of current and desired metadata.
Since the upload feature is considered high priority, in the spirit of "build and improve" let's stick with the metadata we have now for implementing this feature. Can return and improve it in the future once we review and expand the metadata.
Currently, users are able to upload only URL resources. DM says we should be able to upload other types for demo purposes.
Update 9/8/17
Please proceed with this task. This is going to involve a few things, but basically, we want to update the Add Resource form to accept a PDF (no other file types should be allowed).
The form should include:
Once the PDF is uploaded, we'd like to use a library like
pdf-reader
(open to other suggestions here) to actually parse the contents of the PDF and keep the full text in thelong_content
field of the Resource.Feel free to split any of these tasks out into their own cards to make smaller, more logical PRs / commits.
Clarification: After this task is a completed, a user could specify either a URL or a PDF as the resource. That is, the URL version of resource upload that exists now would be augmented by this task to have the extra metadata fields.