uvacw / inca

24 stars 6 forks source link

Make case-sensitive searches possible #390

Open damian0604 opened 6 years ago

damian0604 commented 6 years ago

Although the elastic search query syntax allows for regular expressions, everything is in fact transformed to lowercase (b/c the text field is 'analyzed' as they call it). this makes searches very fast, but it makes it impossible to get a timeline for, for instance, the political party DENK, given that it is a frequent dutch word when lowercased. See example below:

In [14]: myinca.importers_exporters.export_timeline(queries='text:"/DENK/" AND d
    ...: octype:"nu"', granularity='year')

In [15]: %cat timeline_export.csv
,timestamp,"1. text:""/DEN..."
0,2014-01-01T00:00:00.000Z,596
1,2015-01-01T00:00:00.000Z,124
2,2016-01-01T00:00:00.000Z,1246
3,2017-01-01T00:00:00.000Z,1634
4,2018-01-01T00:00:00.000Z,719

In [16]: myinca.importers_exporters.export_timeline(queries='text:"/Denk/" AND d
    ...: octype:"nu"', granularity='year')

In [17]: %cat timeline_export.csv
,timestamp,"1. text:""/Den..."
0,2014-01-01T00:00:00.000Z,596
1,2015-01-01T00:00:00.000Z,124
2,2016-01-01T00:00:00.000Z,1246
3,2017-01-01T00:00:00.000Z,1634
4,2018-01-01T00:00:00.000Z,719

In [18]: myinca.importers_exporters.export_timeline(queries='text:"denk" AND doc
    ...: type:"nu"', granularity='year')

In [19]: %cat timeline_export.csv
,timestamp,"1. text:""denk..."
0,2014-01-01T00:00:00.000Z,596
1,2015-01-01T00:00:00.000Z,124
2,2016-01-01T00:00:00.000Z,1246
3,2017-01-01T00:00:00.000Z,1634
4,2018-01-01T00:00:00.000Z,719
theoaraujo commented 6 years ago

Suggestion for solution:

damian0604 commented 6 years ago

@lisadk93 if you do it as @theoaraujo suggests, then it seems to come down to writing a processor (see docs/how_to_process) that returns True or False...

damian0604 commented 6 years ago

Functionality implemented in PR #422 , we only need to write some documentation....