hltcoe / patapsco

Cross language information retrieval pipeline
Other
18 stars 6 forks source link

Query language? #38

Open isoboroff opened 2 years ago

isoboroff commented 2 years ago

How does Patapsco parse queries? In particular, when you send a query to the web service, is it parsed as a Lucene query, or something else?

The context is that I'm thinking about ways to handle queries on a combined traditional and simplified Chinese corpus.

Are parameters of the retrieval in the web service controlled by the "queries" and "retrieve" clauses of the config file?

cash commented 2 years ago

@isoboroff when running the web services like so:

patapsco-web --run path/to/run --port 9090

It reads the configuration file saved in the run directory and uses the topic file section to grab the language of the queries (and uses the retrieve config for those parameters).

I think you're asking for the ability to override parts of the config on the command line. Is that right?

isoboroff commented 2 years ago

My main question is how are the queries parsed. The answer seems to be the same way they are in batch mode. I think that's just word tokens with no operators or anything, right?

I'm adapting my collection search tool, which currently uses ElasticSearch, to use the Patapsco web service, on the hypothesis that it is better at tokenizing the languages I'm working with (Russian, Farsi, Chinese). Elastic has a lot of web service functionality like highlights and faceting and pagination which are nice when building an interactive search tool, and also it's not hard to use Lucene query syntax which supports some common operators.

isoboroff commented 2 years ago

Just adding the minimum configuration:

topics:
  input:
    lang: fas
retrieve:
  name: bm25
  number: 10

There is an error:

patapsco.error.ConfigError: 3 validation errors in configuration
  topics.input.format - missing field
  topics.input.source - missing field
  topics.input.path - missing field

These fields of course don't make sense for interactive queries. Does it mean that the query endpoint is expecting a JSON object like a batch query?

(edited: removed bad stand-in config. I needed a basic "queries" section which was missing.)

dlawrie commented 2 years ago

This is my javascript code

// Searching using Patapsco
var lang = targetLanguage;
lang = 'zho' // FIX ME remove for release

var url = PATAPSCO_URL + '/' + lang ;
const myRequest = new Request(url+'/query/'+inputQuery);

fetch(myRequest)
.then(response => {
    console.log('Response:', response.status);
    if (!response.ok) {
        throw new Error('Network response was not OK');
    }
    return response.json();
})
.then(data => {
    console.log("Patapsco response");
    console.log(data);
    var results = data['results'];
    if (data.query && data.query.text) {
        document.getElementById('target-query').dataset.recent =

data.query.text; } console.log(results); for (let i in results) { let id = results[i]['doc_id'] var doc_num = parseInt(i) + 1; let doc_info = [doc_num.toString(), id]; document_list.push(doc_info); } console.log(document_list);

    possible_queries[inputQuery] = [inputQuery, document_list];
    console.log(possible_queries);

    buildDocumentList(document_list);
})
.catch(error => {
    document.getElementById('inner-hit-list').classList.remove('no-display');
    const findContainer =

document.getElementById('find-document-container'); findContainer.innerHTML = ' There was a error issuing the query...try again'; console.error('There has been a problem with your fetch operation:', error); }); }

Are you getting the error when setting up the web service?

On Mon, Mar 28, 2022 at 10:12 AM Ian Soboroff @.***> wrote:

Just adding the minimum configuration:

topics: input: lang: fas retrieve: name: bm25 number: 10

There is an error:

patapsco.error.ConfigError: 3 validation errors in configuration topics.input.format - missing field topics.input.source - missing field topics.input.path - missing field

These fields of course don't make sense for interactive queries. Does it mean that the query endpoint is expecting a JSON object like a batch query?

— Reply to this email directly, view it on GitHub https://github.com/hltcoe/patapsco/issues/38#issuecomment-1080705796, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJNDOQNOIPZY3V73WSFFWDVCG45PANCNFSM5RUGTUXQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--


Dawn J. Lawrie Ph.D. Senior Research Scientist Human Language Technology Center of Excellence Johns Hopkins University 810 Wyman Park Drive Baltimore, MD 21211 @.*** https://hltcoe.jhu.edu/faculty/dawn-lawrie/

isoboroff commented 2 years ago

Frankly, I'm trying run the web service and send some queries from the command line so I can understand the request and response formats.

Your JS doesn't clarify the format of the query, and you appear to have a custom URL maybe meaning you have a proxy layer in there per language, or your own web service app.

isoboroff commented 2 years ago

I see in patapsco/topic.py that there seem to be hooks for Lucene query processing, I'll start poking through that.

cash commented 2 years ago

@isoboroff Yes, processing of queries/topics in the web services is controlled by the configuration file used to create the index. Most people use term-based queries or PSQ. I added support for Lucene syntax but it has to be configured for that and is not interoperable with PSQ. The only documentation that I have on this is here: https://github.com/hltcoe/patapsco/blob/master/docs/config.md#lucene-classic-query-parsing

lizekui commented 1 year ago

Hi @dlawrie your js code looks so subtle and concise, could you share your js code project for beginners as me? Thanks!

https://github.com/hltcoe/patapsco/issues/38#issuecomment-1080723547

dlawrie commented 1 year ago

I tested in a web browser by just typing the URL with the query at the end.

On Mon, Mar 28, 2022 at 10:27 AM Ian Soboroff @.***> wrote:

Frankly, I'm trying run the web service and send some queries from the command line so I can understand the request and response formats.

— Reply to this email directly, view it on GitHub https://github.com/hltcoe/patapsco/issues/38#issuecomment-1080726006, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJNDOQURK6RVASM2VL2NODVCG6WTANCNFSM5RUGTUXQ . You are receiving this because you commented.Message ID: @.***>

--


Dawn J. Lawrie Ph.D. Senior Research Scientist Human Language Technology Center of Excellence Johns Hopkins University 810 Wyman Park Drive Baltimore, MD 21211 @.*** https://hltcoe.jhu.edu/faculty/dawn-lawrie/

dlawrie commented 1 year ago

The plain text query is parsed in the same way the documents were parsed (ie. normalized, stemmed or not, etc). Does that answer the question?

On Fri, Mar 25, 2022 at 9:48 AM Ian Soboroff @.***> wrote:

How does Patapsco parse queries? In particular, when you send a query to the web service, is it parsed as a Lucene query, or something else?

— Reply to this email directly, view it on GitHub https://github.com/hltcoe/patapsco/issues/38, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJNDOVN63BLBFDACIDYLBTVBW7Y7ANCNFSM5RUGTUXQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--


Dawn J. Lawrie Ph.D. Senior Research Scientist Human Language Technology Center of Excellence Johns Hopkins University 810 Wyman Park Drive Baltimore, MD 21211 @.*** https://hltcoe.jhu.edu/faculty/dawn-lawrie/