Closed michamos closed 7 years ago
doctype:
I think we should carefully consider what enumeration to use here, and it might be a nested structure
for example, fulltext and hidden isn't sufficiently fine grained to distinguish policy decisions and obligations towards individual publishers
arXiv fulltext is hidden for different reasons than Springer fulltext amount of text in snippets is dependent on publisher OA content, like arXiv, still may be hidden due to agreement to not divert traffic from arXiv, etc.
fulltext is pretty generic -- is it the fulltext of the article or the fulltext of an addendum?
it is a recurring question how many "fulltext" records INSPIRE holds and there is no definitive answer to this, since in legacy it is ambiguous what fulltext (doctype Inspire-public + springer + main + ...) combined with a certain mime-type actually represents content-wise. the only doctype we can safely assume is not fulltext is PLOT.
this definitely should be cleaned up.
I think it would be useful to distinguish
I think we also need to track
the mime-type should be based on magic and actual parsing of the file, e.g. PyPDF2, jhove http://jhove.openpreservation.org/ or similar
T
Good points, @tsgit.
arXiv fulltext is hidden for different reasons than Springer fulltext
does the reason for hiding it need to be expressed in the metadata of the records?
amount of text in snippets is dependent on publisher
if it is dependent on publisher, it should probably not live in the literature records then. I have no idea how this record-dependent snippet length should be handled anyway, @jacquerie @kaplun ?
fulltext is pretty generic -- is it the fulltext of the article or the fulltext of an addendum?
we can add a material
field for that.
I think we also need to track
provenance
we need to add a source
indeed.
I have no idea how this record-dependent snippet length should be handled anyway, @jacquerie @kaplun ?
Sorry, I don't understand the question. Which snippets are you talking about?
@jacquerie http://inspirehep.net/search?p=fulltext+"quark"
there are text snippets with the search term highlighted
the fulltext can be provided to us by publisher with restrictions. We can use if for refextract and fulltext indexing, but we are only allowed to display x
(small number) words or at most a paragraph in the results -- the value of x
may vary by publisher or even series
I note that snippets implementation in legacy isn't working very well for search phrases, and it also may not always show, but that's beside the point here
so the underlying issue here is that a "file" can have policies
tied to it and the schema should have a way to ref
those
The contents of the
_files
field for Literature record is supposed to contain the metadata to retrieve the file byinvenio-records-files
.The schema we have for it was copied by Zenodo and so contains the basic info in the invenio-records-files schema, but also some additional Zenodo-specific stuff (
previewer
,type
) that we probably don't need.The workflow is using this field in yet another way, writing
description
anddoctype
there (for arXiv PDF and extracted plots), which are not currently in the schema. This doesn't cause any error now as the results of_files
are discarded anyway and never sent to legacy, but we should decide on what information we really want to have there.@kaplun and @tsgit know how files ares handled on legacy and could share their experience. Discussing with @jacquerie, we identified the following keys that might be useful:
doctype
(ordocument_type
?): to signal what kind of document is attached. This would be anenum
with valuesfulltext
,plot
, what else?mime_type
: how this document is encoded, which might warrant a different handling (e.g. PDF vs XML for a fulltext).hidden
: a flag to indicate whether this file is publicly visible (would be true for fulltexts used for indexing that we may not serve directly to our users).