panosc-eu / panosc-search-scoring

3 stars 3 forks source link

Minor improvements in documentation #5

Open antolinos opened 2 years ago

antolinos commented 2 years ago

I am in the process of populating the scoring DB. It is very well documented but I found some minor pitfalls that could be easily improved. The documentation says that the model has 3 fields:

  1. _id,
  2. _group
  3. _fields

I found some issues when I was playing with it:

  1. The use of _. When I was trying to POST /items the documentation says that each item should match the structure defined in the model section however I only make it to work when I removed the _. It means that group instead of _group and fields instead of _fields. I saw that id is an alias of _id https://github.com/panosc-eu/panosc-search-scoring/blob/a94ab1e5e281c54d25fe71b43d3b6e549d9f6154/app/models/items.py#L9

I would propose to leave only id at least in the documentation that sounds to me more coherent

  1. _fields type: it is a dictionary even if the documentation says that it can contain just a string

    It can contain a string or a complex nested json object.

Next item did not work for me:

 {"id":186877495,"group":"datasets","fields": "Ford Mustang 1964"}

and will raise the following error:

{'detail': "An exception of type ValidationError occurred. Arguments:\n([ErrorWrapper(exc=DictError(), loc=('fields',))], <class 'app.models.items.ItemModel'>)"}

This worked:

{"id":186877495,"group":"datasets","fields":  {
  "brand": "Ford",
  "model": "Mustang",
  "year": 1964
}}
nitrosx commented 2 years ago

You are correct regarding the name of the fields. Let me know if you could submit a PR for the documentation.

Regarding the type of field, you are also correct. It was my intention to allow for a simple string and for a complex structure, but I impose to be a structure in the model. Both of them in my mind are valid options, although I tested only with complex structures.

I will check the feasibility to allow fields as a simple string (example a simple abstract) or a complex structure (like the one that you provided as example)

antolinos commented 2 years ago

This was my first and probably naive approach:

{
"id":186877495,
"group":"datasets",
"fields":"cm01  GridSquare_8305671cflat001 Near atomic resolution cryoEM structure of the type 6 secretion system membrane complex mx2005 10.15151/ESRF-ES-186874482 "
}

But I have not tried yet to weight the datasets.

By the way, which are the list of valid values for group?

nitrosx commented 2 years ago

I would structure the item in the following way: { "id":186877495, "group":"datasets", "fields":{ "abstract" : "cm01 GridSquare_8305671cflat001 Near atomic resolution cryoEM structure of the type 6 secretion system membrane complex mx2005 10.15151/ESRF-ES-186874482 " } }

group can be any string that you would like to use, as long as it is consistent in term of upper or lower case.

antolinos commented 2 years ago

I would structure the item in the following way: { "id":186877495, "group":"datasets", "fields":{ "abstract" : "cm01 GridSquare_8305671cflat001 Near atomic resolution cryoEM structure of the type 6 secretion system membrane complex mx2005 10.15151/ESRF-ES-186874482 " } }

But it is not the abstract, it is the concatenation of the proposal name, dataset name, proposal title, visit name and doi. By looking at the weights computation, the word abstract will not biases the results as will be used as a term shared by all datasets?

group can be any string that you would like to use, as long as it is consistent in term of upper or lower case.

I did not get that, what does it mean? Should they be upper or lower case? I thought that group (datasets or documents) where used for the search api to distinguish between proposal and publication: image

If it is not the case, what is group used for?

nitrosx commented 2 years ago

Based on the latest post, I would structure your item as it follows: { "id":186877495, "group":"datasets", "fields":{ "proposal_name" : "cm01", "dataset_name" : "GridSquare_8305671cflat001", "proposal_title" : "Near atomic resolution cryoEM structure of the type 6 secretion system membrane complex", "visit_name" : "mx2005", "doi" : "10.15151/ESRF-ES-186874482" } }

although doi will not add any added value. The word abstract will add some bias, but if you have enough entries, the bias will be minimal.

The group is an arbitrary string that you assign to group together a number of items. Example: if I want to score separately a group of items that are derived from the datasets and from documents, I will insert "datasets" in the item belonging to the first group, and "documents" in the items belonging to the second one. When I request a score, I can specify that I would like to get scores of items belonging to the group "datasets", so I can limit which items I work on and increase performance.

The group in the scoring is not related to the type shown in the portal. Document type is a field of the panosc document model. BTW, thank you for bringing this to my attention. I will make sure to add the documentation to make it clear.