what we talk about when we talk about _files

michamos commented 7 years ago

The contents of the _files field for Literature record is supposed to contain the metadata to retrieve the file by invenio-records-files.

The schema we have for it was copied by Zenodo and so contains the basic info in the invenio-records-files schema, but also some additional Zenodo-specific stuff (previewer, type) that we probably don't need.

The workflow is using this field in yet another way, writing description and doctype there (for arXiv PDF and extracted plots), which are not currently in the schema. This doesn't cause any error now as the results of _files are discarded anyway and never sent to legacy, but we should decide on what information we really want to have there.

@kaplun and @tsgit know how files ares handled on legacy and could share their experience. Discussing with @jacquerie, we identified the following keys that might be useful:

doctype (or document_type?): to signal what kind of document is attached. This would be an enum with values fulltext, plot, what else?
mime_type: how this document is encoded, which might warrant a different handling (e.g. PDF vs XML for a fulltext).
hidden: a flag to indicate whether this file is publicly visible (would be true for fulltexts used for indexing that we may not serve directly to our users).

kaplun commented 7 years ago

tsgit commented 7 years ago

doctype:

I think we should carefully consider what enumeration to use here, and it might be a nested structure

for example, fulltext and hidden isn't sufficiently fine grained to distinguish policy decisions and obligations towards individual publishers

arXiv fulltext is hidden for different reasons than Springer fulltext amount of text in snippets is dependent on publisher OA content, like arXiv, still may be hidden due to agreement to not divert traffic from arXiv, etc.

fulltext is pretty generic -- is it the fulltext of the article or the fulltext of an addendum?

it is a recurring question how many "fulltext" records INSPIRE holds and there is no definitive answer to this, since in legacy it is ambiguous what fulltext (doctype Inspire-public + springer + main + ...) combined with a certain mime-type actually represents content-wise. the only doctype we can safely assume is not fulltext is PLOT.

this definitely should be cleaned up.

I think it would be useful to distinguish

actual text of the record
text of appendices, errata, ancillary material
figures contained in the document
ancillary figures, etc.
publisher fulltext as opposed to arXiv PDF, or e.g. XML

I think we also need to track

provenance

the mime-type should be based on magic and actual parsing of the file, e.g. PyPDF2, jhove http://jhove.openpreservation.org/ or similar

T

michamos commented 7 years ago

Good points, @tsgit.

arXiv fulltext is hidden for different reasons than Springer fulltext

does the reason for hiding it need to be expressed in the metadata of the records?

amount of text in snippets is dependent on publisher

if it is dependent on publisher, it should probably not live in the literature records then. I have no idea how this record-dependent snippet length should be handled anyway, @jacquerie @kaplun ?

fulltext is pretty generic -- is it the fulltext of the article or the fulltext of an addendum?

we can add a material field for that.

I think we also need to track

provenance

we need to add a source indeed.

jacquerie commented 7 years ago

I have no idea how this record-dependent snippet length should be handled anyway, @jacquerie @kaplun ?

Sorry, I don't understand the question. Which snippets are you talking about?

tsgit commented 7 years ago

@jacquerie http://inspirehep.net/search?p=fulltext+"quark" there are text snippets with the search term highlighted the fulltext can be provided to us by publisher with restrictions. We can use if for refextract and fulltext indexing, but we are only allowed to display x (small number) words or at most a paragraph in the results -- the value of x may vary by publisher or even series

tsgit commented 7 years ago

I note that snippets implementation in legacy isn't working very well for search phrases, and it also may not always show, but that's beside the point here

tsgit commented 7 years ago

so the underlying issue here is that a "file" can have policies tied to it and the schema should have a way to ref those

inspirehep / inspire-schemas

what we talk about when we talk about _files #147