Open antolinos opened 2 years ago
You are correct regarding the name of the fields. Let me know if you could submit a PR for the documentation.
Regarding the type of field, you are also correct. It was my intention to allow for a simple string and for a complex structure, but I impose to be a structure in the model. Both of them in my mind are valid options, although I tested only with complex structures.
I will check the feasibility to allow fields as a simple string (example a simple abstract) or a complex structure (like the one that you provided as example)
This was my first and probably naive approach:
{
"id":186877495,
"group":"datasets",
"fields":"cm01 GridSquare_8305671cflat001 Near atomic resolution cryoEM structure of the type 6 secretion system membrane complex mx2005 10.15151/ESRF-ES-186874482 "
}
But I have not tried yet to weight the datasets.
By the way, which are the list of valid values for group
?
I would structure the item in the following way:
{ "id":186877495, "group":"datasets", "fields":{ "abstract" : "cm01 GridSquare_8305671cflat001 Near atomic resolution cryoEM structure of the type 6 secretion system membrane complex mx2005 10.15151/ESRF-ES-186874482 " } }
group can be any string that you would like to use, as long as it is consistent in term of upper or lower case.
I would structure the item in the following way:
{ "id":186877495, "group":"datasets", "fields":{ "abstract" : "cm01 GridSquare_8305671cflat001 Near atomic resolution cryoEM structure of the type 6 secretion system membrane complex mx2005 10.15151/ESRF-ES-186874482 " } }
But it is not the abstract, it is the concatenation of the proposal name
, dataset name
, proposal title
, visit name
and doi
. By looking at the weights computation, the word abstract will not biases the results as will be used as a term shared by all datasets?
group can be any string that you would like to use, as long as it is consistent in term of upper or lower case.
I did not get that, what does it mean? Should they be upper or lower case? I thought that group (datasets or documents) where used for the search api to distinguish between proposal and publication:
If it is not the case, what is group used for?
Based on the latest post, I would structure your item as it follows:
{ "id":186877495, "group":"datasets", "fields":{ "proposal_name" : "cm01", "dataset_name" : "GridSquare_8305671cflat001", "proposal_title" : "Near atomic resolution cryoEM structure of the type 6 secretion system membrane complex", "visit_name" : "mx2005", "doi" : "10.15151/ESRF-ES-186874482" } }
although doi will not add any added value. The word abstract will add some bias, but if you have enough entries, the bias will be minimal.
The group is an arbitrary string that you assign to group together a number of items. Example: if I want to score separately a group of items that are derived from the datasets and from documents, I will insert "datasets" in the item belonging to the first group, and "documents" in the items belonging to the second one. When I request a score, I can specify that I would like to get scores of items belonging to the group "datasets", so I can limit which items I work on and increase performance.
The group in the scoring is not related to the type shown in the portal. Document type is a field of the panosc document model. BTW, thank you for bringing this to my attention. I will make sure to add the documentation to make it clear.
I am in the process of populating the scoring DB. It is very well documented but I found some minor pitfalls that could be easily improved. The documentation says that the model has 3 fields:
I found some issues when I was playing with it:
_
. When I was trying toPOST /items
the documentation says that each item should match the structure defined in the model section however I only make it to work when I removed the_
. It means thatgroup
instead of_group
andfields
instead of_fields
. I saw thatid
is an alias of_id
https://github.com/panosc-eu/panosc-search-scoring/blob/a94ab1e5e281c54d25fe71b43d3b6e549d9f6154/app/models/items.py#L9I would propose to leave only
id
at least in the documentation that sounds to me more coherent_fields
type: it is a dictionary even if the documentation says that it can contain just a stringNext item did not work for me:
and will raise the following error:
This worked: