Closed glouppe closed 10 years ago
The content of a record is stored in 3 different places. Why?
Legacy reasons - we are storing recstruct
, recjson
, indexes and "other" formats depending on configuration.
What approach shall I use? What's the preferred way to access records?
invenio.modules.records.api:get_record
for "quering" by record idinvenio.legacy.search.engine.perform_request_search
for other queriesHow to query their content? Let's say I want to find all records where the title contains "Higgs", what's the preferred way to do this? Unless I am wrong, it is unfortunately not possible to use the ORM syntax on the JSON fields, is it?
perform_request_search(p="title:Higgs", of="recjson")
In case there is documentation about this, I am sorry for not having found it, but please give me the pointers :)
Is the API of records still something open for discussion?
There is always place for improvement.
invenio.legacy.search.engine.perform_request_search
for other queries
In the future, how shall we query for records? Is JSONAlchemy going to be the way? Is it going to wrap ElasticSearch?
invenio.modules.records.api:get_record
for "quering" by record id
Having an instance of a record at hand, shall we go for allowing to modifying it, relying on signals to do all the post-processing?
In the future, how shall we query for records? Is JSONAlchemy going to be the way? Is it going to wrap ElasticSearch?
Hopefully there will be some new cool query object on JSONAlchemy that will wrap all possible backends.
Having an instance of a record at hand, shall we go for allowing to modifying it, relying on signals to do all the post-processing?
Right now you can safely modify record only through BibUpload
. We still need some "transactional" system either BibSched
or Celery however indexing and other tasks can rely on signals.
Right now you can safely modify record only through BibUpload. We still need some "transactional" system >either BibSched or Celery however indexing and other tasks can rely on signals.
Would it be possible to solve the "transactions" using the SQL/JSONAlchemy transactional commits or is it perhaps more complicated?
(PS: Loving the fact that signals will trigger all necessary things. No more "where is the record? Oh, forgot to run x".)
In the future, how shall we query for records? Is JSONAlchemy going to be the way? Is it going to wrap ElasticSearch?
There are several query use cases:
Right now you can safely modify record only through BibUpload. We still need some "transactional" system >either BibSched or Celery however indexing and other tasks can rely on signals.
Would it be possible to solve the "transactions" using the SQL/JSONAlchemy transactional commits or is it perhaps more complicated? (PS: Loving the fact that signals will trigger all necessary things. No more "where is the record? Oh, forgot to run x".)
+1 if this is possible. It would make things much simpler to write and to discover for newcomers (I would never have thought to look at BibUpload to update a record -- i.e., one that is already uploaded. Rather, coming from Django, I would have directly used the DB interface that SQLAlchemy offers). It would also make the whole thing more robust since triggers would be automatically called for you, without the danger of missing some.
What would be the main roadblock for implementing things in this way at the moment? (I still dont have a clear picture of everything, so this is a honest question)
e.g. looking for records to find records having certain exact text in some field following some condition. Here, SQLAlchemy will provide the answer natively, depending on where the data is stored. For example, if in PostgreSQL JSON tables, then via PostgreSQL JSON query.
I agree with this and I think this the most reasonable thing to do. However, in practice, what approach shall we use? This suggests to use modules.records.models.Record
+ modules.records.models.RecordMetadata
, for which we have an explicit JSON column on which to search for. Unless I am missing something, this is in contradiction with the fact that the preferred way should be to use invenio.modules.records.api:get_record
instead to query records? (which is indeed the way things are done currently, while the former is used nowhere)
By the way, https://github.com/inveniosoftware/invenio/blob/pu/invenio/modules/records/api.py#L73 seems to suggest that the blob should eventually be fetched from the JSON column (I assume modules.records.models.RecordMetadata
instead of modules.formatter.models.Bibfmt
). How are things going on on this front?
@glouppe you can use only functions from invenio.modules.records.api
. The other models are there for legacy reasons. Concerning transactions it's not so easy for now. @egabancho is working on new uploader
module that should facilitate CRUD actions around records. One can't simply change record metadata without seeing the wider picture - updating history, modifying access rights, indexing, file management, etc.
If you have a specific use-case we can have organize a meeting.
@glouppe you can use only functions from invenio.modules.records.api. The other models are there for legacy reasons.
Okay, but does this also exclude queries through the ORM architecture Record.query.*
? (and through which I can also query the JSON column, via the corresponding SQLAlchemy dialect) Or shall we only use perform_request_search
for non-ID based queries?
If you have a specific use-case we can have organize a meeting.
Well, since we are about to port the author module, we will have to deal on a daily basis with records, so I was mainly asking for comments about the preferred way to do things (to have harmony and consistency within the project) rather going straight ahead without asking for your opinion.
For now, I dont have specific use cases that come to mind, but e.g., for performance reasons I assume that we are very likely to query for multiple records at once on some given fields. In principles, I would like to avoid rewriting our own DB abstraction layer (like it has been done for production, see https://github.com/inveniosoftware/invenio/blob/master/modules/bibauthorid/lib/bibauthorid_dbinterface.py) and rely as much as possible on the tools that come for free with the framework. So basically, for us, it should be as painless as possible to access record fields (which is solved by the SmartJSON objects :) [even if I still miss the point of having defined this redundant RecordMetadata
model]) or to access records based on complex queries. Also, we should not even have to edit records, so all should be fine. Nevertheless, I believe it would important for the project to have guidelines on this. After all, records are at the core of the invenio products. There should be a simple, agreed and unified way to access and modify them.
Anyway, thanks for your answers. We will start prototyping things in the next weeks and get back to you if we get stuck at some point.
One can't simply change record metadata without seeing the wider picture - updating history, modifying access rights, indexing, file management, etc.
Wouldn’t this be nicely handled with signals, though? Would this make sense?:
rec = Record.query.get(123)
rec["title"]["title"] = "Wall of the cave"
rec.save()
# which calls commit() and emits update signal
# now if any receiver of update signal fails, a rollback() is issued
I am probably not seeing the full picture as @jirikuncar hinted, so forgive my naive method.
@jalavik this might work for simple fields. However not all our backends have full proven transactional support and some actions might require touching other tables or backends too. Hence we might need cross-database transactions.
Or shall we only use
perform_request_search
for non-ID based queries?
@glouppe that's the safest option for now.
I would never have thought to look at BibUpload to update a record -- i.e., one that is already uploaded.
@glouppe BTW this shouldn't be needed. Once BibUpload has run, the record is as uploaded as it can get :smile: But it is not yet indexed, not yet preformatted, not yet ranked, not yet collection-clustered. For these we have periodical daemon tasks that run in the background. http://invenio-demo.cern.ch/help/admin/howto-run
After all, records are at the core of the invenio products. There should be a simple, agreed and unified way to access and modify them.
I fully agree. We could perhaps take this occasion to improve our existing record-related documentation:
See also #1799 and #1835.
Once BibUpload has run, the record is as uploaded as it can get :smile: But it is not yet indexed, not yet preformatted, not yet ranked, not yet collection-clustered.
Yes, and so this RFC could be indeed used also as way to ask ourselves: is this the way we still want it to be? The fact that after a record is uploaded, it's not yet assigned to collections, to indexes etc. is one of the issues we are often trying to workaround in several ways in Invenio (e.g. when we created sequenceid for bibtasks in order to be able to send an email at the end of the processes).
In principle we should be able to consider all these series of tasks as an atomic operation and either do it synchronously or doing it asynchronously with message-passing.
E.g. in mongodb/solr/elasticsearch, when a document is pushed to them, when the function responsible returns, all the operations needed to update indexes are already done. Shall we aim for the same expectation, especially given the modern technologies (celery & elasticsearch etc.) that we have now?
Yes, and so this RFC could be indeed used also as way to ask ourselves: is this the way we still want it to be?
This is quite a different topic than "RFC Querying records?" which was the topic at hand.
A tiny comment on this though: new uploader would simply fire signals and subscribers would launch whatever post-upload operations are necessary. However, it is still interesting to decouple some of these operations for later: e.g. refextract/plotextract/bibclassify and friends can take N minutes to complete and it would not be feasible to fire them up after every metadata update. Another example: indexer is much faster when it runs in batch bulk mode instead of progressive index update after every single record update. So if someone happens to touch 1,000,000 records on an "exceptional day" in order to move 65017
tag to 65027
(say), then he/she doesn't really want to use "regular day" signals during upload phase. The uploader would still "differentiate" between these use cases in the interest of speed.
Anyway, we should really be musing about this side topic elsewhere, notably in #1772.
perform_request_search(p="title:Higgs", of="recjson")
In case there is documentation about this, I am sorry for not having found it, but please give me the > pointers :)
It is sort of covered in /help/hacking/search-engine-api
on Invenio 1 based instances, but I'll enrich the documentation to mention XML API and JSON API more explicitly, based on some past presentations.
I'll enrich the documentation to mention XML API and JSON API more explicitly, based on some past presentations.
Done:
Can be enriched further...
In reply to https://github.com/inveniosoftware/invenio/issues/2303#issuecomment-56512945 by @jalavik
rec = Record.query.get(123)
rec["title"]["title"] = "Wall of the cave"
# (a) two methods
rec.save()
rec.save_async()
# - or -
# (b) parametrized save
rec.save()
rec.save(_async=True)
save
will not directly store anything in db but instead will trigger uploader task and if not _async it will wait for response from celery (whatever it means ;)).
Co-authored-by: @egabancho
@jirikuncar I like (a) for readability.
@jirikuncar @jalavik The option (b) is not bad either, having one function to do the job, and thinking of sync/async as options, would resemble our on_the_fly=True
technique. But I think (a) would be more advantageous too: cleaner separation. BTW we'd probably have more params, not just async True/False, e.g. to specify which async workflow to run to save the record, e.g. to alert different people in case of async failure depending on record types.
e.g. to specify which async workflow to run to save the record,
Shouldn't this be configured system wide?
e.g. to alert different people in case of async failure depending on record types.
shouldn't this be part of the workflow itself?
If we had only a simple save()
call, then we would centralize all the workflow/instance related configuration, rather than spreading it everywhere .save()
is called.
What if the record was aware of the workflow that is needed to be executed to deal with updates? Just wild ideas...
@tiborsimko @jirikuncar Yeah, I like (a) for it's separation and in general I have seen that boolean flags are not always that adventagous.
So one solution could be to keep the flag as a property instead of function argument:
rec.enable_async_save()
rec.save()
But I still like more (a) I think.
@jalavik the last proposal would be introducing side effects difficult to debug (imagine if you pass the rec
to a function that calls enable_async_save()
, and you are not aware off.
So I also think (a) is better.
Shouldn't this be configured system wide?
Nope, I think that would be too broad: various clients may have various record-saving use cases, depending on whether they need to wait for the outcome or not. (Even the same client can.)
shouldn't this be part of the workflow itself?
Yes, naturally. Note that in my example I was only speaking about being able to specify which workflow to run, so only its ID/name would be passed, kinda save_async(workflow=booktest)
.
What if the record was aware of the workflow that is needed to be executed to deal with updates?
Yes, that's a very good default possibility, e.g. depending on the collection. However, some clients may still want to override it differently, depending on the situation at hand. Say record can change collections, etc.
So one solution could be to keep the flag as a property instead of function argument:
That'd be worse then optional argument in my eyes :) I think we all like option (a) anyway; nobody called for (b) or some alternative yet.
This will be splitted to several smaller specific RFCs.
As part of my journey to understand how Invenio is working, I have been trying to understand how records are represented, stored and queried in pu. As I found several ways to access records and several places where they are stored, I must admit that I am a bit confused.
1)
modules.records.models.Record
+modules.records.models.RecordMetadata
Together, these classes define SQLAlchemy models where the content of the record (title, authors, etc) is stored in JSON in the
RecordMetadata.json
field.2)
modules.records.api.Record
This class extends the
SmartJson
class to load a record from the correspondingmodules.formatter.models.Bibfmt
object, which also stores the content of the record in a binaryvalue
field. Through "smart queries", the content of the record can then be accessed.3)
modules.editor.models.*
redefine all oldbibxxx
tables as SQLAlchemy models, thereby linking them to the correspondingmodules.records.models.Record
instances and enabling the SQLAlchemy ORM syntax on these old fields. Together, all these models also store the content of the record.This raises the following questions:
CC: @MSusik