greencommons / commons

https://greencommons.net
2 stars 2 forks source link

(L) Enable users to upload resources. #212

Open ptrikutam opened 7 years ago

ptrikutam commented 7 years ago

Resolve the myriad of issues that entails -- most especially dealing with spammers.

Currently, users are able to upload only URL resources. DM says we should be able to upload other types for demo purposes.


Update 9/8/17

Please proceed with this task. This is going to involve a few things, but basically, we want to update the Add Resource form to accept a PDF (no other file types should be allowed).

The form should include:

Once the PDF is uploaded, we'd like to use a library like pdf-reader (open to other suggestions here) to actually parse the contents of the PDF and keep the full text in the long_content field of the Resource.

Feel free to split any of these tasks out into their own cards to make smaller, more logical PRs / commits.

Clarification: After this task is a completed, a user could specify either a URL or a PDF as the resource. That is, the URL version of resource upload that exists now would be augmented by this task to have the extra metadata fields.

ptrikutam commented 7 years ago

PT: OK. Yes, there may be some stuff (i.e. ingestion) that we need to work through to make sure things are indexed properly. That said, we could have people manually fill in fields and have the document indexed that way and it'd be much simpler. Let's review this and other features tomorrow in our call.

ptrikutam commented 7 years ago

DW: For a demo, it's fine. In real life, it's a whole kettle of fish, especially if these works are going to be indexed and will show up in search results. Issues include (to be obvious) copyright violations, privacy violations, incorrect metadata added by the person uploading it, spam, and offensive material (porn but also being flooded with climate denial resources?)

nsanta commented 7 years ago

@ptrikutam

I was thinking on implement the functionality for this main use case:

  1. The user pick a PDF file
  2. The metadata is extracted when is possible.
  3. The file is uploaded to S3 directly.
  4. After upload the file, the resource is saved with the file path and metadata.
  5. A new background is triggered for scan, filter and index the PDF content.

Thoughts?

ptrikutam commented 7 years ago

So we actually may not want to auto-extract the metadata. I think asking the user to manually input some of the metadata might be preferable anyway.

@gtourtellot do you know an example resource with a good amount of metadata included? That can be a template for @nsanta to build the form off of.

nsanta commented 7 years ago

@ptrikutam Sorry. I wasn't explicit with what I wanted to explain and suggest:

  1. The metadata can be extracted for pre-populate the fields when is possible.
ptrikutam commented 7 years ago

Ah, now I understand. That makes sense, and would be awesome if it's not too much effort.

gtourtellot commented 7 years ago

I don't think we should expect to fill in much metadata automatically. There's just too much variability in PDFs to get it right.

The python pdf reader library I was using tries to extract: Author, CreationDate, Creator, ModDate, Producer, and Title. It works OK for some books we've been provided, i.e., it gets author and title right. However, the library fails on PDFs for other (non-book) PDFs. Extracting the number of pages from a PDF is something we expect to do reliably.

Metadata fields we could present on a form might include: resource_type, title, short_content, title, date (i.e., pub date), tags (e.g., from applying ClimateTagger to full contents), rights (some string about rights), pages (number of pages), isbn, and content_url (there's an external associated url).

see as reference the schema: https://github.com/greencommons/commons/blob/master/lib/etl/schema/resource_array_schema.json)

ptrikutam commented 7 years ago

@gtourtellot, fair enough. Thanks for sharing the schema.

@nsanta For the time being, let's not worry about pre-populating metadata, and just worry about extracting the PDF contents for the long_content field. I think it makes sense to parse metadata and include it in a future PR.

dweinberger commented 7 years ago

Those fields are good for books, but we should have some additional ones for articles: journal, pages, volume, number, doi…

David W.

On Tue, Sep 12, 2017 at 4:09 PM, Pavan Trikutam notifications@github.com wrote:

@gtourtellot https://github.com/gtourtellot, fair enough. Thanks for sharing the schema.

@nsanta https://github.com/nsanta For the time being, let's not worry about pre-populating metadata, and just worry about extracting the PDF contents for the long_content field. I think it makes sense to parse metadata and include it in a future PR.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/greencommons/commons/issues/212#issuecomment-328985102, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjGyKvZkMO8UJNKTncaFOYF3ZhpQqYIks5shvLugaJpZM4PFXaO .

gtourtellot commented 7 years ago

Good point, @dweinberger. We should definitely do a thorough review of current and desired metadata.

Since the upload feature is considered high priority, in the spirit of "build and improve" let's stick with the metadata we have now for implementing this feature. Can return and improve it in the future once we review and expand the metadata.