Open stephane54 opened 6 years ago
ContentConnector.search (continuation) ?
To get around the limits of ISTEX searches using from an to, I simulate them by ignoring the from first results. Inefficient but functional. Does these values start at 0 or 1. In other words, if from is 1, do I have to ignore 0 or 1 result ?
How should be handled invalids from and to (either negative values, or from >= to) ? What should be from and to in SearchResult in the following situations:
OMTD Facet in SearchResult cannot hold ISTEX facets results. ISTEX facets results depends on data type, may contain ranges, counts, minimal and maximal value, ... Things are very different. Maybe the concepts behind the word facet are different in OMTD and ISTEX. Finally, what are the facets used for? Are they required in order to OMTD to work properly? I don't think so. So the best thing to do would be to ignore them. But if they are ignored in search, this method becomes very similar to fetchMetadata, the differences beeing totalHits, from and to, and the way the results are returned (List vs Stream). In fact, fetchMetadata superseeds search, which could become obsolete. totalHits is the number of results read from fetchMetadata with from = first and to = last, avoiding memory limits on the number of results due to the use of List in search. What could be a concrete use case needing only a specific part of results on arbitrary bounds ?
Replying in short to some questions and I will come back later morning or we can even arrange a skype call to have an in-depth discussion:
Regarding the query, check the README in https://github.com/openminted/content-connector-api. I think it explains most of the questions about the search parameters and facet names.
The queries we are sending to the connector(s) are led by the UI in https://test.openminted.eu/resourceRegistration/corpus/searchForPublications. Every value the same facets is OR and the facets between them are AND. So, you example would become ((publicationyear = 2000) OR (publicationyear = 2003)) AND ((publicationtype = article) OR (publicationtype = thesis) OR (publicationtype = review)) (exactly as you wrote it!). We don't need range queries, comparisons, negations, etc.
The keyword is a simple keyword search in all the fields of the metadata, something like a google search but without the extra functionality offered by google. In the OpenAIRE connector, we handle it by searching in all indexed fields of the publication.
ContentConnector.search
When we designed this method, we had in mind to display the publications in a page. Later we decided that it would be too much and not to useful for the user, so we ended up with displaying the facets on the left and the number of publications in the center of the page. For the current purposes of the connector, you can get away with ignoring the "from" and "to" parameters and even return an empty list of publications (I'll make sure that this is correct and reply if this is wrong).
The facets are used (along with the keyword) to limit the number of publications that will go in the corpus. They are mandatory in the sense that if you don't return them, the content service will try to build a corpus with all your publications in it (we have a hard limit of ~1000 publications per connector to protect the content providers from abuse but this may rise or be entirely removed in the future)
A general comment that should have been posted first:
The content connectors is used by the OMTD platform to help the users browse the publications and build corpora. The order of the calls to the methods of the connectors is as follows:
Access token is user specific. Up to now, this token validity is not time limited, but that could change in the future. So the token can't be stored in the implementation because it would give public access to documents. It can't neither be stored in OMTD user'configuration (supposing it is technically possible) because it may become time limited. It can only be transmitted.
At one end, ISTEX authentication requires user interaction to get the token, so it can't be done elsewhere. At the other end, the connector needs to be given this token. Between the two is OMTD, which must transmit it.
I think that ContentConnector interface has to be modified.
I've already looked at in https://github.com/openminted/content-connector-api, and in the guidelines. Alone, it is not precise enough. But with your explanation about the way a corpora is built, and looking at the URL of the page for doing (which I haven't found before), things gets much clearer.
PDF can be a reasonable format if it is "native", that is, if PDF contains text, but not if it contains in fact scanned images. In that case, a PDF to text conversion will give nothing. Also, I'm not sure of what will be output if the presentation of the document is not simple, with images, columns, parts to be ignored, ...
So, facets are only and always those listed (Rights, Publication year, Publication Type, Language, Document Type). Content Source should come from getSourceName method, and keyword from the search field. Right ?
publisher doesn't appear as facet. Is it not used (and should be ignored in the connector implementation) or simply not shown because not applicable for Core and OpenAIRE ?
The search field looks like a free text area, except that chars are upcased. Is there any other transformation, or is the content of that area transmitted to the connector as is, letting it translate it to it's own syntax ? But what should I do for example with words like AND, OR, NOT, which can be boolean operators, with A, THE, ONE, which are empty words often ignored, with symbols like * ? ( ) [ ] \ " ' frequently used in expressions, ... ? Should THE and THÉ be distinguished or not ? The same keyword string will be used in different connectors, which rely probably on different tools, with their own capabilities, syntax, limits, ... So in order to tend to an homogeneous behavior, the definition of how to interpret the string should be independent of them, and be specified in a formal and precise manner by OMTD. Also, only very basic search features should be used, hoping they are available in all tools.
Building a corpora from OpenAIRE requires an authentication which seems similar to the one needed for ISTEX. The first time, I want to build a corpora, I'm redirected on an edugain authentication page. From there, I suppose that an authentication token is returned, which is stored as a cookie. The question is how does it reaches the connector ? The same is to be applied for ISTEX connector.
Once written (which is about to be done), how can I test my connector ? Where can I find a test tool or environment ?
What should be put in the label field of a Facet in a SearchResult ?
What should be done when a param is specified in query but has no value ? 1) Values being connected by OR, the less values you specify, the less documents you get => extrapolating to zero value, you get zero document. 2) Checking no value on a check list usually means do not filter => ignore param.
Regarding the access toked, the OMTD platform does keep an access token for the logged in users and this token could be passed to the connector (with a small change/addition to the interface). However, keep in mind that this token is issued by the OMTD AAI, an idP acting as a proxy for edugain, social networks, etc. If you choose to use this token, then you'd have to refer to the AAI to retrieve information about the user (organization information is included when applicable) but it also means that you'd have to trust the information that AAI provides. Is that ok? If yes, I can arrange for the change in the interface and also provide you with example code of how to get user info from the AAI.
PDFs are the easy option for us. if they contain scanned images or very complex structure, then indeed we don't get very useful text, but we'll have to live with that until we are ready to deal with all the different formats.
Facets and keyword:
you are right about all statements about the facets :)
About the keyword search, we don't have any specs about the keyword, operators, etc because we can't know what each content provider supports. It's absolutely up to the implementations to decide if they are going to support complex queries or pass the keyword to their index and hope for the best. In a future version we might add some specs but in our current implementation, it would add a lot of complexity with little gains.
Regarding the testing and deployment of your connector, what we plan to do (asap) is the following:
The only thing missing is for us to have a look at your code to make sure that you are not going to format my laptop's hard drive (still haven't transferred last summer's photos!) and that you are using spring to set up your code. If you are not using spring, we'll have to write a thin layer to wrap your implementation and allow our spring-based code to pick up your code as well.
the label field of a search result is the label displayed in the UI. Try to use the same labels as the OpenAIRE or CORE connectors for the same facets. To be honest, I don't remember how the UI will behave if the there is a mismatch between the labels of the same facets, so try to remain consistent.
Regarding the params, there will not be a parameter in the query without value. If the users don't select a value, no parameter will be set in the query.
The only thing missing is for us to have a look at your code to make sure that you are not going to format my laptop's hard drive (still haven't transferred last summer's photos!) and that you are using spring to set up your code. If you are not using spring, we'll have to write a thin layer to wrap your implementation and allow our spring-based code to pick up your code as well.
@ludovicwalle is away from work this week but I can give you some answers. We do not use spring at all, so you'll have to develop a specific layer Here is the github repo url for the project so you can have a look to the code https://github.com/VisaTM/IstexConnector Finally, we are waiting for the information for the configuration of the Github repo
For the Github repo, configure your repository to call our Jenkins whenever a push is made (settings -> integration&services -> Jenkins (GitHub plugin) ) using "https://builds.openminted.eu/job/istix-connector/build/token=2106875370) as a url.
We'll write the spring layer and let you know when we have a deployed version of your code.
I never used Jenkins. So I just followed instructions of the preceding comment, pushed a dummy commit, and clicked on the "Test service" button, which showed me a Okay, the test payload is on its way message. Is it enough ?
Dear @antleb, I have questions about how to automatically forward a corpus to omtd (initial step before running a workflow on this corpus) Is there a solution for an application:
Dear @antleb, can you tell me about some progress in istex connection work : particulary concerning Spring layer implementation and authentication token transmission via the interface. I wonder if you are waiting for informations from us?
Dear @antleb, I have questions about how to automatically forward a corpus to omtd (initial step before running a workflow on this corpus) Is there a solution for an application:
to simply push a corpus, using a Rest service instead of the omtd GUI. ask omtd to use a search API (like Istex API) for build and pull a corpus, such a service could consume a search query and a database url for instance. In our case, that's means, to sollicite the Istex Connector through a API instead of the omtd GUI. I don't find informations about this kind of procedures in the documentation.
Thanks @stephane54 by asking these questions, I'm also interested to know. @greenwoodma or @galanisd, may be you know if remote access by API is possible ?
Continued discussion about "Connector to a repository " : https://groups.google.com/forum/#!topic/openminted-user-forum/khbajvAefPk
ContentConnector.downloadFullText ? An authentication token is needed to allow access to fulltext for ISTEX. This token (how it is obtained is out of scope) has to go from the user's application (a navigator, most of the time) to the connector, going through OMTD. How OMTD will transmit this token ? How the connector can get it ? ISTEX Fulltext is available in more than one format (text, pdf, tei). How this format can be chosen ? How the connector can get it ?
Query ? Where does all the values in a Query object come from ? What are the possible values for "facets" ? For example, a "params" containing a "publicationyear" with values "2000" and "2003", and a publicationtype with values "article", "thesis" and "review" means ((publicationyear = 2000) OR (publicationyear = 2003)) AND ((publicationtype = article) OR (publicationtype = thesis) OR (publicationtype = review)) ? There is no way to express ranges, negation, regexp, comparison, ... ? What values can be expected for each constraint ? How "keyword" should be handled ? Even if it is as separate, litteral words, it is still ambiguous ? How are numbers, letters case, non letters chars handled ?
ContentConnector.search ? ISTEX has limits on this kind of search. "to" must be at most 10000, and "to" - "from" must be at most 5000. fetchMetadata must be used to go beyond those limits. If it used for facets, why adding publications ?