Querying records? - Githubissues

glouppe commented 10 years ago

As part of my journey to understand how Invenio is working, I have been trying to understand how records are represented, stored and queried in pu. As I found several ways to access records and several places where they are stored, I must admit that I am a bit confused.

1) modules.records.models.Record + modules.records.models.RecordMetadata

Together, these classes define SQLAlchemy models where the content of the record (title, authors, etc) is stored in JSON in the RecordMetadata.json field.

2) modules.records.api.Record

This class extends the SmartJson class to load a record from the corresponding modules.formatter.models.Bibfmt object, which also stores the content of the record in a binary value field. Through "smart queries", the content of the record can then be accessed.

3) modules.editor.models.* redefine all old bibxxx tables as SQLAlchemy models, thereby linking them to the corresponding modules.records.models.Record instances and enabling the SQLAlchemy ORM syntax on these old fields. Together, all these models also store the content of the record.

This raises the following questions:

The content of a record is stored in 3 different places. Why?
What approach shall I use? What's the preferred way to access records?
How to query their content? Let's say I want to find all records where the title contains "Higgs", what's the preferred way to do this? Unless I am wrong, it is unfortunately not possible to use the ORM syntax on the JSON fields, is it?
In case there is documentation about this, I am sorry for not having found it, but please give me the pointers :)
Is the API of records still something open for discussion?

CC: @MSusik

jirikuncar commented 10 years ago

The content of a record is stored in 3 different places. Why?

Legacy reasons - we are storing recstruct, recjson, indexes and "other" formats depending on configuration.

What approach shall I use? What's the preferred way to access records?

invenio.modules.records.api:get_record for "quering" by record id

invenio.legacy.search.engine.perform_request_search for other queries

How to query their content? Let's say I want to find all records where the title contains "Higgs", what's the preferred way to do this? Unless I am wrong, it is unfortunately not possible to use the ORM syntax on the JSON fields, is it?

perform_request_search(p="title:Higgs", of="recjson")

In case there is documentation about this, I am sorry for not having found it, but please give me the pointers :)

Is the API of records still something open for discussion?

There is always place for improvement.

kaplun commented 10 years ago

invenio.legacy.search.engine.perform_request_search for other queries

In the future, how shall we query for records? Is JSONAlchemy going to be the way? Is it going to wrap ElasticSearch?

invenio.modules.records.api:get_record for "quering" by record id

Having an instance of a record at hand, shall we go for allowing to modifying it, relying on signals to do all the post-processing?

jirikuncar commented 10 years ago

In the future, how shall we query for records? Is JSONAlchemy going to be the way? Is it going to wrap ElasticSearch?

Hopefully there will be some new cool query object on JSONAlchemy that will wrap all possible backends.

Having an instance of a record at hand, shall we go for allowing to modifying it, relying on signals to do all the post-processing?

Right now you can safely modify record only through BibUpload. We still need some "transactional" system either BibSched or Celery however indexing and other tasks can rely on signals.

jalavik commented 10 years ago

Right now you can safely modify record only through BibUpload. We still need some "transactional" system >either BibSched or Celery however indexing and other tasks can rely on signals.

Would it be possible to solve the "transactions" using the SQL/JSONAlchemy transactional commits or is it perhaps more complicated?

(PS: Loving the fact that signals will trigger all necessary things. No more "where is the record? Oh, forgot to run x".)

tiborsimko commented 10 years ago

In the future, how shall we query for records? Is JSONAlchemy going to be the way? Is it going to wrap ElasticSearch?

There are several query use cases:

e.g. looking for records to find records having certain exact text in some field following some condition. Here, SQLAlchemy will provide the answer natively, depending on where the data is stored. For example, if in PostgreSQL JSON tables, then via PostgreSQL JSON query.
e.g. looking for records via regular search functions, using all the non-exact IR goodies like stemming, higher-order operators, proximity search, and whatnot. Here we'd use indexer with optional IR backend such as Xapian/Solr/elasticsearch.

glouppe commented 10 years ago

Right now you can safely modify record only through BibUpload. We still need some "transactional" system >either BibSched or Celery however indexing and other tasks can rely on signals.

Would it be possible to solve the "transactions" using the SQL/JSONAlchemy transactional commits or is it perhaps more complicated? (PS: Loving the fact that signals will trigger all necessary things. No more "where is the record? Oh, forgot to run x".)

+1 if this is possible. It would make things much simpler to write and to discover for newcomers (I would never have thought to look at BibUpload to update a record -- i.e., one that is already uploaded. Rather, coming from Django, I would have directly used the DB interface that SQLAlchemy offers). It would also make the whole thing more robust since triggers would be automatically called for you, without the danger of missing some.

What would be the main roadblock for implementing things in this way at the moment? (I still dont have a clear picture of everything, so this is a honest question)

e.g. looking for records to find records having certain exact text in some field following some condition. Here, SQLAlchemy will provide the answer natively, depending on where the data is stored. For example, if in PostgreSQL JSON tables, then via PostgreSQL JSON query.

I agree with this and I think this the most reasonable thing to do. However, in practice, what approach shall we use? This suggests to use modules.records.models.Record + modules.records.models.RecordMetadata, for which we have an explicit JSON column on which to search for. Unless I am missing something, this is in contradiction with the fact that the preferred way should be to use invenio.modules.records.api:get_record instead to query records? (which is indeed the way things are done currently, while the former is used nowhere)

By the way, https://github.com/inveniosoftware/invenio/blob/pu/invenio/modules/records/api.py#L73 seems to suggest that the blob should eventually be fetched from the JSON column (I assume modules.records.models.RecordMetadata instead of modules.formatter.models.Bibfmt). How are things going on on this front?

jirikuncar commented 10 years ago

@glouppe you can use only functions from invenio.modules.records.api. The other models are there for legacy reasons. Concerning transactions it's not so easy for now. @egabancho is working on new uploader module that should facilitate CRUD actions around records. One can't simply change record metadata without seeing the wider picture - updating history, modifying access rights, indexing, file management, etc.

If you have a specific use-case we can have organize a meeting.

glouppe commented 10 years ago

@glouppe you can use only functions from invenio.modules.records.api. The other models are there for legacy reasons.

Okay, but does this also exclude queries through the ORM architecture Record.query.*? (and through which I can also query the JSON column, via the corresponding SQLAlchemy dialect) Or shall we only use perform_request_search for non-ID based queries?

If you have a specific use-case we can have organize a meeting.

Well, since we are about to port the author module, we will have to deal on a daily basis with records, so I was mainly asking for comments about the preferred way to do things (to have harmony and consistency within the project) rather going straight ahead without asking for your opinion.

For now, I dont have specific use cases that come to mind, but e.g., for performance reasons I assume that we are very likely to query for multiple records at once on some given fields. In principles, I would like to avoid rewriting our own DB abstraction layer (like it has been done for production, see https://github.com/inveniosoftware/invenio/blob/master/modules/bibauthorid/lib/bibauthorid_dbinterface.py) and rely as much as possible on the tools that come for free with the framework. So basically, for us, it should be as painless as possible to access record fields (which is solved by the SmartJSON objects :) [even if I still miss the point of having defined this redundant RecordMetadata model]) or to access records based on complex queries. Also, we should not even have to edit records, so all should be fine. Nevertheless, I believe it would important for the project to have guidelines on this. After all, records are at the core of the invenio products. There should be a simple, agreed and unified way to access and modify them.

Anyway, thanks for your answers. We will start prototyping things in the next weeks and get back to you if we get stuck at some point.

jalavik commented 10 years ago

One can't simply change record metadata without seeing the wider picture - updating history, modifying access rights, indexing, file management, etc.

Wouldn’t this be nicely handled with signals, though? Would this make sense?:

rec = Record.query.get(123)
rec["title"]["title"] = "Wall of the cave"
rec.save()
# which calls commit() and emits update signal
# now if any receiver of update signal fails, a rollback() is issued

I am probably not seeing the full picture as @jirikuncar hinted, so forgive my naive method.

jirikuncar commented 10 years ago

@jalavik this might work for simple fields. However not all our backends have full proven transactional support and some actions might require touching other tables or backends too. Hence we might need cross-database transactions.

Or shall we only use perform_request_search for non-ID based queries?

@glouppe that's the safest option for now.

tiborsimko commented 10 years ago

I would never have thought to look at BibUpload to update a record -- i.e., one that is already uploaded.

@glouppe BTW this shouldn't be needed. Once BibUpload has run, the record is as uploaded as it can get :smile: But it is not yet indexed, not yet preformatted, not yet ranked, not yet collection-clustered. For these we have periodical daemon tasks that run in the background. http://invenio-demo.cern.ch/help/admin/howto-run

tiborsimko commented 10 years ago

After all, records are at the core of the invenio products. There should be a simple, agreed and unified way to access and modify them.

I fully agree. We could perhaps take this occasion to improve our existing record-related documentation:

http://invenio.readthedocs.org/en/latest/modules/jsonalchemy.html

See also #1799 and #1835.

kaplun commented 10 years ago

Once BibUpload has run, the record is as uploaded as it can get :smile: But it is not yet indexed, not yet preformatted, not yet ranked, not yet collection-clustered.

Yes, and so this RFC could be indeed used also as way to ask ourselves: is this the way we still want it to be? The fact that after a record is uploaded, it's not yet assigned to collections, to indexes etc. is one of the issues we are often trying to workaround in several ways in Invenio (e.g. when we created sequenceid for bibtasks in order to be able to send an email at the end of the processes).

In principle we should be able to consider all these series of tasks as an atomic operation and either do it synchronously or doing it asynchronously with message-passing.

E.g. in mongodb/solr/elasticsearch, when a document is pushed to them, when the function responsible returns, all the operations needed to update indexes are already done. Shall we aim for the same expectation, especially given the modern technologies (celery & elasticsearch etc.) that we have now?

tiborsimko commented 10 years ago

Yes, and so this RFC could be indeed used also as way to ask ourselves: is this the way we still want it to be?

This is quite a different topic than "RFC Querying records?" which was the topic at hand.

A tiny comment on this though: new uploader would simply fire signals and subscribers would launch whatever post-upload operations are necessary. However, it is still interesting to decouple some of these operations for later: e.g. refextract/plotextract/bibclassify and friends can take N minutes to complete and it would not be feasible to fire them up after every metadata update. Another example: indexer is much faster when it runs in batch bulk mode instead of progressive index update after every single record update. So if someone happens to touch 1,000,000 records on an "exceptional day" in order to move 65017 tag to 65027 (say), then he/she doesn't really want to use "regular day" signals during upload phase. The uploader would still "differentiate" between these use cases in the interest of speed.

Anyway, we should really be musing about this side topic elsewhere, notably in #1772.

tiborsimko commented 10 years ago

perform_request_search(p="title:Higgs", of="recjson")

In case there is documentation about this, I am sorry for not having found it, but please give me the > pointers :)

It is sort of covered in /help/hacking/search-engine-api on Invenio 1 based instances, but I'll enrich the documentation to mention XML API and JSON API more explicitly, based on some past presentations.

tiborsimko commented 10 years ago

I'll enrich the documentation to mention XML API and JSON API more explicitly, based on some past presentations.

Done:

Can be enriched further...

jirikuncar commented 10 years ago

In reply to https://github.com/inveniosoftware/invenio/issues/2303#issuecomment-56512945 by @jalavik

rec = Record.query.get(123)
rec["title"]["title"] = "Wall of the cave"
# (a) two methods
rec.save()
rec.save_async()

# - or -

# (b) parametrized save 
rec.save()
rec.save(_async=True)

save will not directly store anything in db but instead will trigger uploader task and if not _async it will wait for response from celery (whatever it means ;)).

Co-authored-by: @egabancho

jalavik commented 10 years ago

@jirikuncar I like (a) for readability.

tiborsimko commented 10 years ago

@jirikuncar @jalavik The option (b) is not bad either, having one function to do the job, and thinking of sync/async as options, would resemble our on_the_fly=True technique. But I think (a) would be more advantageous too: cleaner separation. BTW we'd probably have more params, not just async True/False, e.g. to specify which async workflow to run to save the record, e.g. to alert different people in case of async failure depending on record types.

kaplun commented 10 years ago

e.g. to specify which async workflow to run to save the record,

Shouldn't this be configured system wide?

e.g. to alert different people in case of async failure depending on record types.

shouldn't this be part of the workflow itself?

If we had only a simple save() call, then we would centralize all the workflow/instance related configuration, rather than spreading it everywhere .save() is called.

What if the record was aware of the workflow that is needed to be executed to deal with updates? Just wild ideas...

jalavik commented 10 years ago

@tiborsimko @jirikuncar Yeah, I like (a) for it's separation and in general I have seen that boolean flags are not always that adventagous.

So one solution could be to keep the flag as a property instead of function argument:

rec.enable_async_save()
rec.save()

But I still like more (a) I think.

kaplun commented 10 years ago

@jalavik the last proposal would be introducing side effects difficult to debug (imagine if you pass the rec to a function that calls enable_async_save(), and you are not aware off. So I also think (a) is better.

tiborsimko commented 10 years ago

Shouldn't this be configured system wide?

Nope, I think that would be too broad: various clients may have various record-saving use cases, depending on whether they need to wait for the outcome or not. (Even the same client can.)

shouldn't this be part of the workflow itself?

Yes, naturally. Note that in my example I was only speaking about being able to specify which workflow to run, so only its ID/name would be passed, kinda save_async(workflow=booktest).

What if the record was aware of the workflow that is needed to be executed to deal with updates?

Yes, that's a very good default possibility, e.g. depending on the collection. However, some clients may still want to override it differently, depending on the situation at hand. Say record can change collections, etc.

tiborsimko commented 10 years ago

So one solution could be to keep the flag as a property instead of function argument:

That'd be worse then optional argument in my eyes :) I think we all like option (a) anyway; nobody called for (b) or some alternative yet.

jirikuncar commented 10 years ago

This will be splitted to several smaller specific RFCs.

inveniosoftware / invenio

Querying records? #2303