ISTEX : Connector to a repository (VISATM Hackathon)

stephane54 commented 6 years ago

Continued discussion about "Connector to a repository " : https://groups.google.com/forum/#!topic/openminted-user-forum/khbajvAefPk

ContentConnector.downloadFullText ? An authentication token is needed to allow access to fulltext for ISTEX. This token (how it is obtained is out of scope) has to go from the user's application (a navigator, most of the time) to the connector, going through OMTD. How OMTD will transmit this token ? How the connector can get it ? ISTEX Fulltext is available in more than one format (text, pdf, tei). How this format can be chosen ? How the connector can get it ?

Query ? Where does all the values in a Query object come from ? What are the possible values for "facets" ? For example, a "params" containing a "publicationyear" with values "2000" and "2003", and a publicationtype with values "article", "thesis" and "review" means ((publicationyear = 2000) OR (publicationyear = 2003)) AND ((publicationtype = article) OR (publicationtype = thesis) OR (publicationtype = review)) ? There is no way to express ranges, negation, regexp, comparison, ... ? What values can be expected for each constraint ? How "keyword" should be handled ? Even if it is as separate, litteral words, it is still ambiguous ? How are numbers, letters case, non letters chars handled ?

ContentConnector.search ? ISTEX has limits on this kind of search. "to" must be at most 10000, and "to" - "from" must be at most 5000. fetchMetadata must be used to go beyond those limits. If it used for facets, why adding publications ?

ludovicwalle commented 6 years ago

ContentConnector.search (continuation) ?

To get around the limits of ISTEX searches using from an to, I simulate them by ignoring the from first results. Inefficient but functional. Does these values start at 0 or 1. In other words, if from is 1, do I have to ignore 0 or 1 result ?

How should be handled invalids from and to (either negative values, or from >= to) ? What should be from and to in SearchResult in the following situations:

totalResult < from
from < totalResult < to

OMTD Facet in SearchResult cannot hold ISTEX facets results. ISTEX facets results depends on data type, may contain ranges, counts, minimal and maximal value, ... Things are very different. Maybe the concepts behind the word facet are different in OMTD and ISTEX. Finally, what are the facets used for? Are they required in order to OMTD to work properly? I don't think so. So the best thing to do would be to ignore them. But if they are ignored in search, this method becomes very similar to fetchMetadata, the differences beeing totalHits, from and to, and the way the results are returned (List vs Stream). In fact, fetchMetadata superseeds search, which could become obsolete. totalHits is the number of results read from fetchMetadata with from = first and to = last, avoiding memory limits on the number of results due to the use of List in search. What could be a concrete use case needing only a specific part of results on arbitrary bounds ?

antleb commented 6 years ago

Replying in short to some questions and I will come back later morning or we can even arrange a skype call to have an in-depth discussion:

ContentConnector.downloadFullText Regarding the format of fulltext, we certainly prefer pdf. It's the most prevalent format and the easiest to handle by default. Regarding the access token, if we can have a "OMTD" token, common for all users, then you can store it/configure it as a parameter in your implementation of the connector. If your require one token per user, I cannot see an immediate way to handle this. Lets discuss it.

antleb commented 6 years ago

Regarding the query, check the README in https://github.com/openminted/content-connector-api. I think it explains most of the questions about the search parameters and facet names.

The queries we are sending to the connector(s) are led by the UI in https://test.openminted.eu/resourceRegistration/corpus/searchForPublications. Every value the same facets is OR and the facets between them are AND. So, you example would become ((publicationyear = 2000) OR (publicationyear = 2003)) AND ((publicationtype = article) OR (publicationtype = thesis) OR (publicationtype = review)) (exactly as you wrote it!). We don't need range queries, comparisons, negations, etc.

The keyword is a simple keyword search in all the fields of the metadata, something like a google search but without the extra functionality offered by google. In the OpenAIRE connector, we handle it by searching in all indexed fields of the publication.

antleb commented 6 years ago

ContentConnector.search

When we designed this method, we had in mind to display the publications in a page. Later we decided that it would be too much and not to useful for the user, so we ended up with displaying the facets on the left and the number of publications in the center of the page. For the current purposes of the connector, you can get away with ignoring the "from" and "to" parameters and even return an empty list of publications (I'll make sure that this is correct and reply if this is wrong).

The facets are used (along with the keyword) to limit the number of publications that will go in the corpus. They are mandatory in the sense that if you don't return them, the content service will try to build a corpus with all your publications in it (we have a hard limit of ~1000 publications per connector to protect the content providers from abuse but this may rise or be entirely removed in the future)

antleb commented 6 years ago

A general comment that should have been posted first:

The content connectors is used by the OMTD platform to help the users browse the publications and build corpora. The order of the calls to the methods of the connectors is as follows:

one call is made to the search() method. The results are displayed to the user, who can try to limit the results by selecting a value in the facets (a subsequent call is made to the search method with extra parameters in the Query argument).
When the user is happy with the subset of the publications, he/she (no assumptions on gender!!!) clicks the "build" button. This translates to one call to the fetchMetadata() method, which returns a stream with all the metadata of the results (no paging needed at this face). The stream is split by OMTD in individual metadata records and for each record 3a. a file with the record is stored in the metadata folder of the new corpus 3b. a call is made to the downloadFulltext() method and the resulting fulltext is stored in the fulltext folder of the new corpus. The "hash" metadata element is used by OMTD to cache the fulltext and avoid bombarding the connectors with requests for files. We are trying to play nice!

ludovicwalle commented 6 years ago

Access token is user specific. Up to now, this token validity is not time limited, but that could change in the future. So the token can't be stored in the implementation because it would give public access to documents. It can't neither be stored in OMTD user'configuration (supposing it is technically possible) because it may become time limited. It can only be transmitted.

At one end, ISTEX authentication requires user interaction to get the token, so it can't be done elsewhere. At the other end, the connector needs to be given this token. Between the two is OMTD, which must transmit it.

I think that ContentConnector interface has to be modified.

ludovicwalle commented 6 years ago

I've already looked at in https://github.com/openminted/content-connector-api, and in the guidelines. Alone, it is not precise enough. But with your explanation about the way a corpora is built, and looking at the URL of the page for doing (which I haven't found before), things gets much clearer.

PDF can be a reasonable format if it is "native", that is, if PDF contains text, but not if it contains in fact scanned images. In that case, a PDF to text conversion will give nothing. Also, I'm not sure of what will be output if the presentation of the document is not simple, with images, columns, parts to be ignored, ...

ludovicwalle commented 6 years ago

So, facets are only and always those listed (Rights, Publication year, Publication Type, Language, Document Type). Content Source should come from getSourceName method, and keyword from the search field. Right ?

publisher doesn't appear as facet. Is it not used (and should be ignored in the connector implementation) or simply not shown because not applicable for Core and OpenAIRE ?

The search field looks like a free text area, except that chars are upcased. Is there any other transformation, or is the content of that area transmitted to the connector as is, letting it translate it to it's own syntax ? But what should I do for example with words like AND, OR, NOT, which can be boolean operators, with A, THE, ONE, which are empty words often ignored, with symbols like * ? ( ) [ ] \ " ' frequently used in expressions, ... ? Should THE and THÉ be distinguished or not ? The same keyword string will be used in different connectors, which rely probably on different tools, with their own capabilities, syntax, limits, ... So in order to tend to an homogeneous behavior, the definition of how to interpret the string should be independent of them, and be specified in a formal and precise manner by OMTD. Also, only very basic search features should be used, hoping they are available in all tools.

ludovicwalle commented 6 years ago

Building a corpora from OpenAIRE requires an authentication which seems similar to the one needed for ISTEX. The first time, I want to build a corpora, I'm redirected on an edugain authentication page. From there, I suppose that an authentication token is returned, which is stored as a cookie. The question is how does it reaches the connector ? The same is to be applied for ISTEX connector.

ludovicwalle commented 6 years ago

Once written (which is about to be done), how can I test my connector ? Where can I find a test tool or environment ?

ludovicwalle commented 6 years ago

What should be put in the label field of a Facet in a SearchResult ?

ludovicwalle commented 6 years ago

What should be done when a param is specified in query but has no value ? 1) Values being connected by OR, the less values you specify, the less documents you get => extrapolating to zero value, you get zero document. 2) Checking no value on a check list usually means do not filter => ignore param.

antleb commented 6 years ago

Regarding the access toked, the OMTD platform does keep an access token for the logged in users and this token could be passed to the connector (with a small change/addition to the interface). However, keep in mind that this token is issued by the OMTD AAI, an idP acting as a proxy for edugain, social networks, etc. If you choose to use this token, then you'd have to refer to the AAI to retrieve information about the user (organization information is included when applicable) but it also means that you'd have to trust the information that AAI provides. Is that ok? If yes, I can arrange for the change in the interface and also provide you with example code of how to get user info from the AAI.

antleb commented 6 years ago

PDFs are the easy option for us. if they contain scanned images or very complex structure, then indeed we don't get very useful text, but we'll have to live with that until we are ready to deal with all the different formats.

antleb commented 6 years ago

Facets and keyword:

you are right about all statements about the facets :)

About the keyword search, we don't have any specs about the keyword, operators, etc because we can't know what each content provider supports. It's absolutely up to the implementations to decide if they are going to support complex queries or pass the keyword to their index and hope for the best. In a future version we might add some specs but in our current implementation, it would add a lot of complexity with little gains.

antleb commented 6 years ago

Regarding the testing and deployment of your connector, what we plan to do (asap) is the following:

you'll configure your GitHub repo to notify our Jenkins when you push changes.
we'll set up a clone of the platform's build to build a platform with a dependency on your connector. This build's last step will be deploying everything to a test machine(s) dedicated to content connector tests.
You'll be able to use the UI to build corpora from all available connectors (including yours) and verify that everything runs smoothly. If you need access to the machine running the services, we can discuss about it.

The only thing missing is for us to have a look at your code to make sure that you are not going to format my laptop's hard drive (still haven't transferred last summer's photos!) and that you are using spring to set up your code. If you are not using spring, we'll have to write a thin layer to wrap your implementation and allow our spring-based code to pick up your code as well.

antleb commented 6 years ago

the label field of a search result is the label displayed in the UI. Try to use the same labels as the OpenAIRE or CORE connectors for the same facets. To be honest, I don't remember how the UI will behave if the there is a mismatch between the labels of the same facets, so try to remain consistent.

antleb commented 6 years ago

Regarding the params, there will not be a parameter in the query without value. If the users don't select a value, no parameter will be set in the query.

stephane54 commented 6 years ago

The only thing missing is for us to have a look at your code to make sure that you are not going to format my laptop's hard drive (still haven't transferred last summer's photos!) and that you are using spring to set up your code. If you are not using spring, we'll have to write a thin layer to wrap your implementation and allow our spring-based code to pick up your code as well.

@ludovicwalle is away from work this week but I can give you some answers. We do not use spring at all, so you'll have to develop a specific layer Here is the github repo url for the project so you can have a look to the code https://github.com/VisaTM/IstexConnector Finally, we are waiting for the information for the configuration of the Github repo

antleb commented 6 years ago

For the Github repo, configure your repository to call our Jenkins whenever a push is made (settings -> integration&services -> Jenkins (GitHub plugin) ) using "https://builds.openminted.eu/job/istix-connector/build/token=2106875370) as a url.

We'll write the spring layer and let you know when we have a deployed version of your code.

ludovicwalle commented 6 years ago

I never used Jenkins. So I just followed instructions of the preceding comment, pushed a dummy commit, and clicked on the "Test service" button, which showed me a Okay, the test payload is on its way message. Is it enough ?

stephane54 commented 6 years ago

Dear @antleb, I have questions about how to automatically forward a corpus to omtd (initial step before running a workflow on this corpus) Is there a solution for an application:

to simply push a corpus, using a Rest service instead of the omtd GUI.
ask omtd to use a search API (like Istex API) for build and pull a corpus, such a service could consume a search query and a database url for instance. In our case, that's means, to sollicite the Istex Connector through a API instead of the omtd GUI. I don't find informations about this kind of procedures in the documentation.

stephane54 commented 6 years ago

Dear @antleb, can you tell me about some progress in istex connection work : particulary concerning Spring layer implementation and authentication token transmission via the interface. I wonder if you are waiting for informations from us?

mandiayba commented 6 years ago

Dear @antleb, I have questions about how to automatically forward a corpus to omtd (initial step before running a workflow on this corpus) Is there a solution for an application:
to simply push a corpus, using a Rest service instead of the omtd GUI.
ask omtd to use a search API (like Istex API) for build and pull a corpus, such a service could consume a search query and a database url for instance.
In our case, that's means, to sollicite the Istex Connector through a API instead of the omtd GUI.
I don't find informations about this kind of procedures in the documentation.

Thanks @stephane54 by asking these questions, I'm also interested to know. @greenwoodma or @galanisd, may be you know if remote access by API is possible ?

openminted / Open-Call-Discussions

ISTEX : Connector to a repository (VISATM Hackathon) #12