usc-isi-i2 / dig-sandpaper

MIT License
4 stars 3 forks source link

Breaks are converted to carriage returns in highlights #4

Open ThomasSchellenbergNextCentury opened 7 years ago

ThomasSchellenbergNextCentury commented 7 years ago

Breaks (<br/>) from titles/descriptions are being converted into carriage returns (\r) in the highlight results from sandpaper. We need the highlighted title/description text to have breaks in order to show line breaks in the DIG UI.

Here is the knowledge_graph->description->value:

"<br/> nav <br/> search <br/>            los angeles, ca<br/>           <br/>           <br/>           <br/>            free classifieds<br/>           <br/> The requested ad could not be found.<br/>  <br/>  <br/>    <br/>    <br/>        <br/>           Recent escorts ads. <br/> Posted: Sun. May. 22, 6:21 AM <br/> Posted: Sun. May. 22, 6:17 AM <br/> Posted: Sun. May. 22, 6:17 AM <br/> Posted: Sun. May. 22, 6:16 AM <br/> Posted: Sun. May. 22, 6:13 AM <br/> Posted: Sun. May. 22, 6:12 AM <br/> Posted: Sun. May. 22, 6:10 AM <br/> Posted: Sun. May. 22, 6:09 AM <br/> Posted: Sun. May. 22, 6:06 AM <br/> Posted: Sun. May. 22, 6:06 AM <br/>"

Here is the highlight->content_extraction.content_strict.text

"\n \n search \n \n \n \n \n \n \r\n            <em>los</em> <em>angeles</em>, ca\r\n           \r\n           \r\n           \r\n            free classifieds\r\n           \n \n \n \n \n \n The requested ad could not be found.\r\n\r\n\r\n  \r\n  \r\n    \r\n    \r\n        \r\n           Recent escorts ads. \n \n Posted: Sun. May. 22, 6:21 AM \n \n \n Posted: Sun. May. 22, 6:17 AM \n \n \n Posted: Sun. May. 22, 6:17 AM \n \n \n Posted: Sun. May. 22, 6:16 AM \n \n \n Posted: Sun. May. 22, 6:13 AM \n \n \n Posted: Sun. May. 22, 6:12 AM \n \n \n Posted: Sun. May. 22, 6:10 AM \n \n \n Posted: Sun. May. 22, 6:09 AM \n \n \n Posted: Sun. May. 22, 6:06 AM \n \n \n Posted: Sun. May. 22, 6:06 AM \n \n \n \n \n"

Here is my sandpaper query on http://10.3.2.82:9876/search/coarse

{"SPARQL":{"group-by":{"limit":1,"offset":0},"select":{"variables":[{"type":"simple","variable":"?ad"}]},"where":{"clauses":[{"constraint":"los angeles","isOptional":false,"predicate":"city"}],"filters":[],"type":"Ad","variable":"?ad"}},"type":"Point Fact"}

Here is the link to the ES document: http://10.1.94.103:9201/dig-etk-search/ads/CDFDF087781B7FCEFD7CEA46A739DAB72F26434CF6B7BE5D34865CAE48243B76

ThomasSchellenbergNextCentury commented 7 years ago

@jasonslepicka After talking with @saggu it sounds like the issue is that the <br/> tags are only in knowledge_graph but the highlights only use content_extraction or indexed. In order to have both highlights and line breaks, we need to either:

We don't want to just use the raw text because part of the cleaning process includes removing excess newlines.